commit 64bb3cd6b3a3724dbca4352a0cb17e8cb694a0f2
Author: Ludvig Strigeus <strigeus@gmail.com>
Date:   Wed Aug 8 13:12:38 2018 +0200

    TunSafe open source (Same as 1.3-rc3 version)

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..fc80387
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,18 @@
+/Debug/
+/Release/
+/ipzip2/Debug/
+/Build
+/Win32/
+/TunSafe.aps
+/ipch
+/*.sdf
+/*vcxproj.user
+/*.opensdf
+/*.suo
+/.vs/
+/x64/
+/Azire.conf
+/*.psess
+/*.vspx
+/installer/*.zip
+/config/
\ No newline at end of file
diff --git a/LICENSE.AGPL.TXT b/LICENSE.AGPL.TXT
new file mode 100644
index 0000000..a38b98c
--- /dev/null
+++ b/LICENSE.AGPL.TXT
@@ -0,0 +1,76 @@
+AFFERO GENERAL PUBLIC LICENSE 
+Version 1, March 2002
+
+Copyright � 2002 Affero Inc. 
+510 Third Street - Suite 225, San Francisco, CA 94107, USA
+
+This license is a modified version of the GNU General Public License copyright (C) 1989, 1991 Free Software Foundation, Inc. made with their permission. Section 2(d) has been added to cover use of software over a computer network.
+
+Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
+
+Preamble
+
+The licenses for most software are designed to take away your freedom to share and change it. By contrast, the Affero General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This Public License applies to most of Affero's software and to any other program whose authors commit to using it. (Some other Affero software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.
+
+When we speak of free software, we are referring to freedom, not price. This General Public License is designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.
+
+To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
+
+For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
+
+We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.
+
+Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.
+
+Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.
+
+The precise terms and conditions for copying, distribution and modification follow.
+
+TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this Affero General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you".
+Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does.
+
+1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.
+You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee.
+
+2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:
+a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change.
+b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.
+c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.)
+d) If the Program as you received it is intended to interact with users through a computer network and if, in the version you received, any user interacting with the Program was given the opportunity to request transmission to that user of the Program's complete source code, you must not remove that facility from your modified version of the Program or work based on the Program, and must offer an equivalent opportunity for all users interacting with your Program through a computer network to request immediate transmission by HTTP of the complete source code of your modified version or other derivative work.
+These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License.
+
+3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:
+a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
+b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
+c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)
+The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.
+
+If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code.
+
+4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
+5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it.
+6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License.
+7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program.
+If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances.
+
+It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice.
+
+This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License.
+
+8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License.
+9. Affero Inc. may publish revised and/or new versions of the Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.
+Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by Affero, Inc. If the Program does not specify a version number of this License, you may choose any version ever published by Affero, Inc.
+
+You may also choose to redistribute modified versions of this program under any version of the Free Software Foundation's GNU General Public License version 3 or higher, so long as that version of the GNU GPL includes terms and conditions substantially equivalent to those of this license.
+
+10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by Affero, Inc., write to us; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.
+NO WARRANTY
+
+11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..c679487
--- /dev/null
+++ b/README.md
@@ -0,0 +1,11 @@
+# TunSafe
+Source code of the TunSafe client.
+
+This open sourced TunSafe code is AGPL-1.0 licensed. Do note that the repository contains BSD and OpenSSL licensed files, so if you want to release a version based off of this repository you need to take that into account.
+
+To build on Windows, open TunSafe.sln and build, or run build.py.
+
+To build on Linux, run build_linux.sh
+
+To build on FreeBSD, run build_freebsd.sh
+
diff --git a/TunSafe.conf b/TunSafe.conf
new file mode 100644
index 0000000..073c9e5
--- /dev/null
+++ b/TunSafe.conf
@@ -0,0 +1,16 @@
+[Interface]
+PrivateKey = KMakx+0sYjWKnkY2pO8+CFZ0Sp+Gzzp/GfxwlR+WgXQ=
+ListenPort = 51820
+Address = 192.168.2.2/24
+MTU = 1420
+
+
+[Peer]
+PublicKey = 2m1BdGW9AwwF5dqaGm0NgMggdDZDUPFAL4JxCySdgBw=
+#AllowedIPs = 0.0.0.0/0, fc00::2/64
+AllowedIPs = 192.168.2.0/24
+Endpoint = 192.168.1.4:8040
+#Endpoint = [fe80::6825:68f4:7c6f:42d4]:8040
+PersistentKeepalive = 25
+
+
diff --git a/TunSafe.rc b/TunSafe.rc
new file mode 100644
index 0000000..7139e02
Binary files /dev/null and b/TunSafe.rc differ
diff --git a/TunSafe.sln b/TunSafe.sln
new file mode 100644
index 0000000..cc929b1
--- /dev/null
+++ b/TunSafe.sln
@@ -0,0 +1,46 @@
+﻿
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio 15
+VisualStudioVersion = 15.0.26403.7
+MinimumVisualStudioVersion = 10.0.40219.1
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "TunSafe", "TunSafe.vcxproj", "{626FBC16-64C6-407D-BC2B-6C087794E0D0}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|Win32 = Debug|Win32
+		Debug|x64 = Debug|x64
+		Release|Win32 = Release|Win32
+		Release|x64 = Release|x64
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|Win32.ActiveCfg = Debug|Win32
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|Win32.Build.0 = Debug|Win32
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|x64.ActiveCfg = Debug|x64
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|x64.Build.0 = Debug|x64
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|Win32.ActiveCfg = Release|Win32
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|Win32.Build.0 = Release|Win32
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|x64.ActiveCfg = Release|x64
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|x64.Build.0 = Release|x64
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+EndGlobal
diff --git a/TunSafe.vcxproj b/TunSafe.vcxproj
new file mode 100644
index 0000000..f9118c6
--- /dev/null
+++ b/TunSafe.vcxproj
@@ -0,0 +1,268 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <ProjectGuid>{626FBC16-64C6-407D-BC2B-6C087794E0D0}</ProjectGuid>
+    <Keyword>Win32Proj</Keyword>
+    <RootNamespace>TunSafe</RootNamespace>
+    <WindowsTargetPlatformVersion>10.0.15063.0</WindowsTargetPlatformVersion>
+    <ProjectName>TunSafe</ProjectName>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+    <Import Project="crypto\nasm.props" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <LinkIncremental>true</LinkIncremental>
+    <TargetName>TunSafe</TargetName>
+    <OutDir>$(SolutionDir)$(Platform)\$(Configuration)\</OutDir>
+    <IntDir>$(Platform)\$(Configuration)\</IntDir>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <LinkIncremental>true</LinkIncremental>
+    <ExecutablePath>$(VC_ExecutablePath_x64);$(WindowsSDK_ExecutablePath);$(VS_ExecutablePath);$(MSBuild_ExecutablePath);$(FxCopDir);$(PATH);C:\Bin\Dev\nasm</ExecutablePath>
+    <TargetName>TunSafe</TargetName>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <LinkIncremental>false</LinkIncremental>
+    <TargetName>TunSafe</TargetName>
+    <OutDir>$(SolutionDir)$(Platform)\$(Configuration)\</OutDir>
+    <IntDir>$(Platform)\$(Configuration)\</IntDir>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <LinkIncremental>false</LinkIncremental>
+    <ExecutablePath>$(VC_ExecutablePath_x64);$(WindowsSDK_ExecutablePath);$(VS_ExecutablePath);$(MSBuild_ExecutablePath);$(FxCopDir);$(PATH);C:\Bin\Dev\nasm</ExecutablePath>
+    <TargetName>TunSafe</TargetName>
+  </PropertyGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <PrecompiledHeader>Use</PrecompiledHeader>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS</PreprocessorDefinitions>
+      <AdditionalIncludeDirectories>.</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Windows</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib</AdditionalDependencies>
+      <UACExecutionLevel>RequireAdministrator</UACExecutionLevel>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <ClCompile>
+      <PrecompiledHeader>Use</PrecompiledHeader>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS=1</PreprocessorDefinitions>
+      <ForcedIncludeFiles>
+      </ForcedIncludeFiles>
+      <AdditionalIncludeDirectories>.</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Windows</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib;Comctl32.lib</AdditionalDependencies>
+      <AdditionalManifestDependencies>
+      </AdditionalManifestDependencies>
+      <UACExecutionLevel>RequireAdministrator</UACExecutionLevel>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <PrecompiledHeader>Use</PrecompiledHeader>
+      <Optimization>MaxSpeed</Optimization>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS</PreprocessorDefinitions>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <AdditionalIncludeDirectories>.</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Windows</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib</AdditionalDependencies>
+      <UACExecutionLevel>RequireAdministrator</UACExecutionLevel>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <PrecompiledHeader>Use</PrecompiledHeader>
+      <Optimization>MinSpace</Optimization>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS=1</PreprocessorDefinitions>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <FavorSizeOrSpeed>Size</FavorSizeOrSpeed>
+      <ForcedIncludeFiles>
+      </ForcedIncludeFiles>
+      <InlineFunctionExpansion>AnySuitable</InlineFunctionExpansion>
+      <OmitFramePointers>true</OmitFramePointers>
+      <AdditionalIncludeDirectories>.</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Windows</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib</AdditionalDependencies>
+      <UACExecutionLevel>RequireAdministrator</UACExecutionLevel>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <ClInclude Include="bit_ops.h" />
+    <ClInclude Include="tunsafe_config.h" />
+    <ClInclude Include="tunsafe_cpu.h" />
+    <ClInclude Include="crypto\aesgcm\aes.h" />
+    <ClInclude Include="crypto\blake2s.h" />
+    <ClInclude Include="crypto\chacha20poly1305.h" />
+    <ClInclude Include="crypto\siphash.h" />
+    <ClInclude Include="tunsafe_endian.h" />
+    <ClInclude Include="netapi.h" />
+    <ClInclude Include="network_win32_api.h" />
+    <ClInclude Include="network_win32_dnsblock.h" />
+    <ClInclude Include="resource.h" />
+    <ClInclude Include="stdafx.h" />
+    <ClInclude Include="tunsafe_types.h" />
+    <ClInclude Include="wireguard_config.h" />
+    <ClInclude Include="util.h" />
+    <ClInclude Include="network_win32.h" />
+    <ClInclude Include="wireguard.h" />
+    <ClInclude Include="wireguard_proto.h" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="benchmark.cpp" />
+    <ClCompile Include="tunsafe_cpu.cpp" />
+    <ClCompile Include="crypto\aesgcm\aesgcm.cpp" />
+    <ClCompile Include="crypto\blake2s_sse.cpp" />
+    <ClCompile Include="crypto\siphash.cpp" />
+    <ClCompile Include="network_win32_dnsblock.cpp" />
+    <ClCompile Include="util.cpp" />
+    <ClCompile Include="network_win32.cpp" />
+    <ClCompile Include="wireguard.cpp" />
+    <ClCompile Include="crypto\blake2s.cpp">
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|x64'">NotUsing</PrecompiledHeader>
+    </ClCompile>
+    <ClCompile Include="crypto\chacha20poly1305.cpp">
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|x64'">NotUsing</PrecompiledHeader>
+    </ClCompile>
+    <ClCompile Include="crypto\curve25519-donna.cpp">
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|x64'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">NotUsing</PrecompiledHeader>
+    </ClCompile>
+    <ClCompile Include="stdafx.cpp">
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">Create</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">Create</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">Create</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|x64'">Create</PrecompiledHeader>
+    </ClCompile>
+    <ClCompile Include="wireguard_config.cpp" />
+    <ClCompile Include="tunsafe_win32.cpp" />
+    <ClCompile Include="wireguard_proto.cpp" />
+  </ItemGroup>
+  <ItemGroup>
+    <ResourceCompile Include="TunSafe.rc" />
+  </ItemGroup>
+  <ItemGroup>
+    <Image Include="icons\green-bg-icon.ico" />
+    <Image Include="icons\green-icon.ico" />
+    <Image Include="icons\neutral-icon.ico" />
+    <Image Include="icons\red-icon.ico" />
+  </ItemGroup>
+  <ItemGroup>
+    <NASM Include="crypto\aesgcm\aesni_gcm_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\aesgcm\aesni_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\aesgcm\ghash_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\chacha20_x64.asm">
+      <FileType>Document</FileType>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\curve25519_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\poly1305_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+    <Import Project="crypto\nasm.targets" />
+  </ImportGroup>
+</Project>
\ No newline at end of file
diff --git a/TunSafe.vcxproj.filters b/TunSafe.vcxproj.filters
new file mode 100644
index 0000000..220b7f6
--- /dev/null
+++ b/TunSafe.vcxproj.filters
@@ -0,0 +1,154 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="crypto">
+      <UniqueIdentifier>{cfa17b4c-1bee-434e-81b4-ba780c3f7e2d}</UniqueIdentifier>
+    </Filter>
+    <Filter Include="Source Files\Win32">
+      <UniqueIdentifier>{49ba9478-f871-449f-a410-b401e993893f}</UniqueIdentifier>
+    </Filter>
+    <Filter Include="crypto\aesgcm">
+      <UniqueIdentifier>{d31b1b9f-4a2e-42d4-a26c-7c3daa4ccbe3}</UniqueIdentifier>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="stdafx.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tunsafe_endian.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="resource.h" />
+    <ClInclude Include="wireguard.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="wireguard_proto.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="util.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="network_win32_dnsblock.h">
+      <Filter>Source Files\Win32</Filter>
+    </ClInclude>
+    <ClInclude Include="network_win32.h">
+      <Filter>Source Files\Win32</Filter>
+    </ClInclude>
+    <ClInclude Include="network_win32_api.h">
+      <Filter>Source Files\Win32</Filter>
+    </ClInclude>
+    <ClInclude Include="crypto\chacha20poly1305.h">
+      <Filter>crypto</Filter>
+    </ClInclude>
+    <ClInclude Include="crypto\blake2s.h">
+      <Filter>crypto</Filter>
+    </ClInclude>
+    <ClInclude Include="wireguard_config.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="netapi.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="crypto\siphash.h">
+      <Filter>crypto</Filter>
+    </ClInclude>
+    <ClInclude Include="tunsafe_types.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="crypto\aesgcm\aes.h">
+      <Filter>crypto\aesgcm</Filter>
+    </ClInclude>
+    <ClInclude Include="tunsafe_cpu.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="bit_ops.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tunsafe_config.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="stdafx.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="wireguard.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="wireguard_proto.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="util.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="network_win32_dnsblock.cpp">
+      <Filter>Source Files\Win32</Filter>
+    </ClCompile>
+    <ClCompile Include="tunsafe_win32.cpp">
+      <Filter>Source Files\Win32</Filter>
+    </ClCompile>
+    <ClCompile Include="network_win32.cpp">
+      <Filter>Source Files\Win32</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\blake2s.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\blake2s_sse.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\chacha20poly1305.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\curve25519-donna.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="wireguard_config.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\siphash.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\aesgcm\aesgcm.cpp">
+      <Filter>crypto\aesgcm</Filter>
+    </ClCompile>
+    <ClCompile Include="benchmark.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tunsafe_cpu.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+  <ItemGroup>
+    <ResourceCompile Include="TunSafe.rc" />
+  </ItemGroup>
+  <ItemGroup>
+    <Image Include="icons\neutral-icon.ico" />
+    <Image Include="icons\green-icon.ico" />
+    <Image Include="icons\red-icon.ico" />
+    <Image Include="icons\green-bg-icon.ico" />
+  </ItemGroup>
+  <ItemGroup>
+    <NASM Include="crypto\chacha20_x64.asm">
+      <Filter>crypto</Filter>
+    </NASM>
+    <NASM Include="crypto\curve25519_x64_nasm.asm">
+      <Filter>crypto</Filter>
+    </NASM>
+    <NASM Include="crypto\poly1305_x64_nasm.asm">
+      <Filter>crypto</Filter>
+    </NASM>
+    <NASM Include="crypto\aesgcm\aesni_gcm_x64_nasm.asm">
+      <Filter>crypto\aesgcm</Filter>
+    </NASM>
+    <NASM Include="crypto\aesgcm\aesni_x64_nasm.asm">
+      <Filter>crypto\aesgcm</Filter>
+    </NASM>
+    <NASM Include="crypto\aesgcm\ghash_x64_nasm.asm">
+      <Filter>crypto\aesgcm</Filter>
+    </NASM>
+  </ItemGroup>
+</Project>
\ No newline at end of file
diff --git a/benchmark.cpp b/benchmark.cpp
new file mode 100644
index 0000000..4d30d80
--- /dev/null
+++ b/benchmark.cpp
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "tunsafe_types.h"
+#include "crypto/chacha20poly1305.h"
+#include "crypto/aesgcm/aes.h"
+#include "tunsafe_cpu.h"
+
+#include <functional>
+#include <string.h>
+
+#if defined(OS_FREEBSD) || defined(OS_LINUX)
+#include <time.h>
+#include <stdlib.h>
+typedef uint64 LARGE_INTEGER;
+void QueryPerformanceCounter(LARGE_INTEGER *x) {
+  struct timespec ts;
+  if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) {
+    fprintf(stderr, "clock_gettime failed\n");
+    exit(1);
+  }
+  *x = (uint64)ts.tv_sec * 1000000000 + ts.tv_nsec;
+}
+
+void QueryPerformanceFrequency(LARGE_INTEGER *x) {
+  *x = 1000000000;
+}
+#elif defined(OS_MACOSX)
+#include <mach/mach.h>
+#include <mach/mach_time.h>
+typedef uint64 LARGE_INTEGER;
+
+void QueryPerformanceCounter(LARGE_INTEGER *x) {
+  *x = mach_absolute_time();
+}
+
+void QueryPerformanceFrequency(LARGE_INTEGER *x) {
+  mach_timebase_info_data_t timebase = { 0, 0 };
+  if (mach_timebase_info(&timebase) != 0)
+    abort();
+  printf("numer/denom: %d %d\n", timebase.numer, timebase.denom);
+  *x = timebase.denom * 1000000000;  
+}
+
+#endif
+
+int gcm_self_test();
+
+
+
+void *fake_glb;
+void Benchmark() {
+  int64 a, b, f, t1 = 0, t2 = 0;
+
+#if WITH_AESGCM
+  gcm_self_test();
+#endif  // WITH_AESGCM
+
+  PrintCpuFeatures();
+
+  QueryPerformanceFrequency((LARGE_INTEGER*)&f);
+
+  uint8 dst[1500 + 16];
+  uint8 key[32] = {0, 1, 2, 3, 4, 5, 6};
+  uint8 mac[16];
+
+  fake_glb = dst;
+
+  auto RunOneBenchmark = [&](const char *name, const std::function<uint64(size_t)> &ff) {
+    uint64 bytes = 0;
+    QueryPerformanceCounter((LARGE_INTEGER*)&b);
+    size_t i;
+    for (i = 0; bytes < 1000000000; i++)
+      bytes += ff(i);
+    QueryPerformanceCounter((LARGE_INTEGER*)&a);
+    RINFO("%s: %f MB/s", name, (double)bytes * 0.000001 / (a - b) * f);
+  };
+
+  memset(dst, 0, 1500);
+  RunOneBenchmark("chacha20-encrypt", [&](size_t i) -> uint64 { chacha20poly1305_encrypt(dst, dst, 1460, NULL, 0, i, key); return 1460; });
+  RunOneBenchmark("chacha20-decrypt", [&](size_t i) -> uint64 { chacha20poly1305_decrypt_get_mac(dst, dst, 1460, NULL, 0, i, key, mac); return 1460; });
+
+  RunOneBenchmark("poly1305-only", [&](size_t i) -> uint64 { poly1305_get_mac(dst, 1460, NULL, 0, i, key, mac); return 1460; });
+
+#if WITH_AESGCM
+  if (X86_PCAP_AES) {
+    AesGcm128StaticContext sctx;
+    CRYPTO_gcm128_init(&sctx, key, 128);
+
+    RunOneBenchmark("aes128-gcm-encrypt", [&](size_t i) -> uint64 { aesgcm_encrypt(dst, dst, 1460, NULL, 0, i, &sctx); return 1460; });
+    RunOneBenchmark("aes128-gcm-decrypt", [&](size_t i) -> uint64 { aesgcm_decrypt_get_mac(dst, dst, 1460, NULL, 0, i, &sctx, mac); return 1460; });
+  }
+#endif   //  WITH_AESGCM
+}
diff --git a/bit_ops.h b/bit_ops.h
new file mode 100644
index 0000000..1e22032
--- /dev/null
+++ b/bit_ops.h
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#pragma once
+
+#include "tunsafe_types.h"
+#include "tunsafe_endian.h"
+
+#if !defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC)
+static inline int _BitScanReverse64(unsigned long *index, uint64 x) {
+  if (_BitScanReverse(index, x >> 32)) {
+    (*index) += 32;
+    return true;
+  }
+  return _BitScanReverse(index, (uint32)x);
+}
+#endif
+
+#if !defined(COMPILER_MSVC)
+static inline int _BitScanReverse64(unsigned long *index, uint64 x) {
+  *index = 63 - __builtin_clzll(x);
+  return (x != 0);
+}
+
+static inline int _BitScanReverse(unsigned long *index, uint32 x) {
+  *index = 31 - __builtin_clz(x);
+  return (x != 0);
+}
+
+#endif
+
+static inline int FindHighestSetBit32(uint32 x) {
+  unsigned long index;
+  return _BitScanReverse(&index, x) ? (int)(index + 1) : 0;
+}
+
+static inline int FindLastSetBit32(uint32 x) {
+  unsigned long index;
+  _BitScanReverse(&index, x);
+  return index;
+}
+
+static inline int FindHighestSetBit64(uint64 x) {
+  unsigned long index;
+  return _BitScanReverse64(&index, x) ? (int)(index + 1) : 0;
+}
+
+static inline int FindHighestSetBit128(uint64 hi, uint64 lo) {
+  return hi ? 64 + FindHighestSetBit64(hi) : FindHighestSetBit64(lo);
+}
diff --git a/build.py b/build.py
new file mode 100644
index 0000000..934e577
--- /dev/null
+++ b/build.py
@@ -0,0 +1,95 @@
+# SPDX-License-Identifier: AGPL-1.0-only
+# Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+import os
+import shutil
+import win32crypt
+import base64
+import sys
+import zipfile
+import re
+
+MSBUILD_PATH = r"C:\Dev\VS2017\MSBuild\15.0\Bin\MSBuild.exe"
+NSIS_PATH = r'C:\Dev\NSIS\makeNSIS.EXE'
+
+SIGNTOOL_PATH = r'c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe'
+SIGNTOOL_KEY_PATH = '' # put key here
+SIGNTOOL_PASS = '' # put key pass here
+
+def RmTree(path):
+  try:
+    print ('Deleting %s' % path)
+    shutil.rmtree(path)
+  except FileNotFoundError:
+    pass
+  
+def Run(s):
+  print ('Running %s' % s)
+  x = os.system(s)
+  if x:
+    raise Exception('Command failed (%d) : %s' % (x, s))
+
+def CopyFile(src, dst):
+  shutil.copyfile(src, dst)
+
+def SignExe(src):
+  print ('Signing %s' % src)
+  cmd = r'""c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe" sign /f "%s" /p %s /t http://timestamp.verisign.com/scripts/timstamp.dll "%s"' % (SIGNTOOL_KEY_PATH, SIGNTOOL_PASS, src)
+  #cmd = r'""c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe" sign %s ' % (SIGNTOOL_KEY_PATH, )
+  x = os.system(cmd)
+  if x:
+    raise Exception('Signing failed (%d) : %s' % (x, cmd))
+
+def GetVersion():
+  for line in open(BASE + '/tunsafe_config.h', 'r'):
+    m = re.match('^#define TUNSAFE_VERSION_STRING "TunSafe (.*)"$', line)
+    if m:
+      return m.group(1)
+  raise Exception('Version not found')
+
+#
+
+#os.system(r'""')
+
+command = sys.argv[1]
+
+BASE = r'D:\Code\TunSafe'
+
+
+if command == 'build_tap':
+  Run(r'%s /V4 installer\tap\tap-windows6.nsi'  % NSIS_PATH)
+  SignExe(r'installer\tap\TunSafe-TAP-9.21.2.exe')
+  sys.exit(0)
+
+if 1:
+  RmTree(BASE + r'\Win32\Release')
+  RmTree(BASE + r'\x64\Release')
+  Run('%s TunSafe.sln /t:Clean;Rebuild /p:Configuration=Release /p:Platform=x64' % MSBUILD_PATH)
+  Run('%s TunSafe.sln /t:Clean;Rebuild /p:Configuration=Release /p:Platform=Win32' % MSBUILD_PATH)
+
+if 1:
+  CopyFile(BASE + r'\Win32\Release\TunSafe.exe',
+           BASE + r'\installer\x86\TunSafe.exe')
+
+  SignExe(BASE + r'\installer\x86\TunSafe.exe')
+  CopyFile(BASE + r'\x64\Release\TunSafe.exe',
+           BASE + r'\installer\x64\TunSafe.exe')
+  SignExe(BASE + r'\installer\x64\TunSafe.exe')
+
+VERSION = GetVersion()
+
+Run(r'%s /V4 -DPRODUCT_VERSION=%s installer\tunsafe.nsi ' % (NSIS_PATH, VERSION))
+SignExe(BASE + r'\installer\TunSafe-%s.exe' % VERSION)
+
+zipf = zipfile.ZipFile(BASE + '\installer\TunSafe-%s-x86.zip' % VERSION, 'w', zipfile.ZIP_DEFLATED)
+zipf.write(BASE + r'\installer\x86\TunSafe.exe', 'TunSafe.exe')
+zipf.write(BASE + r'\installer\License.txt', 'License.txt')
+zipf.write(BASE + r'\installer\ChangeLog.txt', 'ChangeLog.txt')
+zipf.write(BASE + r'\installer\TunSafe.conf', 'Config\\TunSafe.conf')
+zipf.close()
+
+zipf = zipfile.ZipFile(BASE + '\installer\TunSafe-%s-x64.zip' % VERSION, 'w', zipfile.ZIP_DEFLATED)
+zipf.write(BASE + r'\installer\x64\TunSafe.exe', 'TunSafe.exe')
+zipf.write(BASE + r'\installer\License.txt', 'License.txt')
+zipf.write(BASE + r'\installer\ChangeLog.txt', 'ChangeLog.txt')
+zipf.write(BASE + r'\installer\TunSafe.conf', 'Config\\TunSafe.conf')
+zipf.close()
diff --git a/build_config.h b/build_config.h
new file mode 100644
index 0000000..953087c
--- /dev/null
+++ b/build_config.h
@@ -0,0 +1,116 @@
+// File is taken from Chromium
+#ifndef BUILD_BUILD_CONFIG_H_
+#define BUILD_BUILD_CONFIG_H_
+
+#if defined(__APPLE__)
+#include <TargetConditionals.h>
+#endif
+
+// A set of macros to use for platform detection.
+#if defined(__APPLE__)
+#define OS_MACOSX 1
+#if defined(TARGET_OS_IPHONE) && TARGET_OS_IPHONE
+#define OS_IOS 1
+#endif  // defined(TARGET_OS_IPHONE) && TARGET_OS_IPHONE
+#elif defined(ANDROID)
+#define OS_ANDROID 1
+#elif defined(__native_client__)
+#define OS_NACL 1
+#elif defined(__FLASHPLAYER)
+#define OS_FLASHPLAYER 1
+#elif defined(__linux__)
+#define OS_LINUX 1
+#elif defined(_WIN32)
+#define OS_WIN 1
+#elif defined(__FreeBSD__)
+#define OS_FREEBSD 1
+#elif defined(__OpenBSD__)
+#define OS_OPENBSD 1
+#elif defined(__sun)
+#define OS_SOLARIS 1
+#elif defined(EMSCRIPTEN)
+#define OS_EMSCRIPTEN 1
+#else
+#error Please add support for your platform in build_config.h
+#endif
+
+// For access to standard BSD features, use OS_BSD instead of a
+// more specific macro.
+#if defined(OS_FREEBSD) || defined(OS_OPENBSD)
+#define OS_BSD 1
+#endif
+
+// For access to standard POSIXish features, use OS_POSIX instead of a
+// more specific macro.
+#if defined(OS_MACOSX) || defined(OS_LINUX) || defined(OS_FREEBSD) ||     \
+    defined(OS_OPENBSD) || defined(OS_SOLARIS) || defined(OS_ANDROID) ||  \
+    defined(OS_NACL)
+#define OS_POSIX 1
+#endif
+
+#if defined(OS_POSIX) && !defined(OS_MACOSX) && !defined(OS_ANDROID) && \
+    !defined(OS_NACL)
+#define USE_X11 1  // Use X for graphics.
+#endif
+
+// Compiler detection.
+#if defined(__GNUC__)
+#define COMPILER_GCC 1
+
+#if defined(__clang__)
+#define COMPILER_CLANG 1
+#endif
+#elif defined(_MSC_VER)
+#define COMPILER_MSVC 1
+#elif defined(__TINYC__)
+#define COMPILER_TCC 1
+#else
+#error Please add support for your compiler in build/build_config.h
+#endif
+
+// Processor architecture detection.  For more info on what's defined, see:
+//   http://msdn.microsoft.com/en-us/library/b0084kay.aspx
+//   http://www.agner.org/optimize/calling_conventions.pdf
+//   or with gcc, run: "echo | gcc -E -dM -"
+#if defined(_M_X64) || defined(__x86_64__)
+#define ARCH_CPU_X86_FAMILY 1
+#define ARCH_CPU_X86_64 1
+#define ARCH_CPU_64_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#define ARCH_CPU_ALLOW_UNALIGNED 1
+#elif defined(_M_IX86) || defined(__i386__)
+#define ARCH_CPU_X86_FAMILY 1
+#define ARCH_CPU_X86 1
+#define ARCH_CPU_32_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#define ARCH_CPU_ALLOW_UNALIGNED 1
+#define ARCH_CPU_NEED_64BIT_ALIGN 1
+#elif defined(__ARMEL__) || defined(__arm__) && defined(__ARMCC_VERSION)
+#define ARCH_CPU_ARM_FAMILY 1
+#define ARCH_CPU_ARMEL 1
+#define ARCH_CPU_32_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#elif defined(__pnacl__)
+#define ARCH_CPU_32_BITS 1
+#elif defined(__MIPSEL__)
+#define ARCH_CPU_MIPS_FAMILY 1
+#define ARCH_CPU_MIPSEL 1
+#define ARCH_CPU_32_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#elif defined(EMSCRIPTEN)
+#define ARCH_CPU_JS 1
+#define ARCH_CPU_32_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#elif defined(__FLASHPLAYER)
+#define ARCH_CPU_FLASHPLAYER 1
+#define ARCH_CPU_32_BITS 1
+#else
+#error Please add support for your architecture in build_config.h
+#endif
+
+#if defined(ARCH_CPU_LITTLE_ENDIAN) && defined(ARCH_CPU_BIG_ENDIAN) || !defined(ARCH_CPU_LITTLE_ENDIAN) && !defined(ARCH_CPU_BIG_ENDIAN)
+#error Please add support for your endianness in build_config.h
+#endif
+
+
+#endif  // BUILD_BUILD_CONFIG_H_
diff --git a/build_freebsd.sh b/build_freebsd.sh
new file mode 100644
index 0000000..93a6236
--- /dev/null
+++ b/build_freebsd.sh
@@ -0,0 +1,2 @@
+g++7 -I . -O2 -static -mssse3 -o tunsafe benchmark.cpp tunsafe_cpu.cpp wireguard_config.cpp wireguard.cpp wireguard_proto.cpp util.cpp network_bsd.cpp crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp crypto/curve25519-donna.cpp crypto/siphash.cpp crypto/chacha20_x64_gas.s crypto/poly1305_x64_gas.s ipzip2/ipzip2.cpp -lrt
+
diff --git a/build_linux.sh b/build_linux.sh
new file mode 100644
index 0000000..63a15bc
--- /dev/null
+++ b/build_linux.sh
@@ -0,0 +1,9 @@
+#!/bin/sh
+clang++-6.0 -c -march=skylake-avx512 crypto/poly1305_x64_gas.s crypto/chacha20_x64_gas.s 
+clang++-6.0 -I . -O3 -mssse3 -pthread -lrt -o tunsafe util.cpp wireguard_config.cpp wireguard.cpp \
+wireguard_proto.cpp network_bsd_mt.cpp tunsafe_cpu.cpp benchmark.cpp crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp \
+crypto/curve25519-donna.cpp crypto/siphash.cpp chacha20_x64_gas.o crypto/aesgcm/aesni_gcm_x64_gas.s \
+crypto/aesgcm/aesni_x64_gas.s crypto/aesgcm/aesgcm.cpp poly1305_x64_gas.o ipzip2/ipzip2.cpp \
+crypto/aesgcm/ghash_x64_gas.s
+
+
diff --git a/build_osx.sh b/build_osx.sh
new file mode 100644
index 0000000..b95681b
--- /dev/null
+++ b/build_osx.sh
@@ -0,0 +1,17 @@
+set -e
+
+
+clang++ -c -mavx512f -mavx512vl crypto/poly1305_x64_gas_macosx.s crypto/chacha20_x64_gas_macosx.s 
+
+clang++ -g -O3 -I . -std=c++11 -DNDEBUG=1 -fno-exceptions -fno-rtti -ffunction-sections -o tunsafe \
+wireguard_config.cpp wireguard.cpp wireguard_proto.cpp util.cpp network_bsd_mt.cpp benchmark.cpp tunsafe_cpu.cpp \
+crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp crypto/curve25519-donna.cpp \
+crypto/siphash.cpp crypto/aesgcm/aesgcm.cpp ipzip2/ipzip2.cpp \
+crypto/aesgcm/aesni_gcm_x64_gas_macosx.s crypto/aesgcm/aesni_x64_gas_macosx.s crypto/aesgcm/ghash_x64_gas_macosx.s \
+chacha20_x64_gas_macosx.o poly1305_x64_gas_macosx.o
+
+cp tunsafe tunsafe.unstripped
+strip tunsafe
+rm -f tunsafe_osx.zip
+zip tunsafe_osx.zip tunsafe readme_osx.txt
+
diff --git a/crypto/.gitignore b/crypto/.gitignore
new file mode 100644
index 0000000..fb7243a
--- /dev/null
+++ b/crypto/.gitignore
@@ -0,0 +1 @@
+/old/
\ No newline at end of file
diff --git a/crypto/aesgcm/aes.h b/crypto/aesgcm/aes.h
new file mode 100644
index 0000000..310b1eb
--- /dev/null
+++ b/crypto/aesgcm/aes.h
@@ -0,0 +1,84 @@
+/**
+ * Downloaded from
+ *
+ *  http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndael-fst-3.0.zip
+ *
+ * rijndael-alg-fst.h
+ *
+ * @version 3.0 (December 2000)
+ *
+ * Optimised ANSI C code for the Rijndael cipher (now AES)
+ *
+ * @author Vincent Rijmen <vincent.rijmen@esat.kuleuven.ac.be>
+ * @author Antoon Bosselaers <antoon.bosselaers@esat.kuleuven.ac.be>
+ * @author Paulo Barreto <paulo.barreto@terra.com.br>
+ *
+ * This code is hereby placed in the public domain.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS ''AS IS'' AND ANY EXPRESS
+ * OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
+ * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
+ * EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#ifndef __RIJNDAEL_ALG_FST_H
+#define __RIJNDAEL_ALG_FST_H
+
+#include "tunsafe_types.h"
+
+#define AESGCM_MAXNR	14
+
+struct AesContext {
+  uint32 rk[(AESGCM_MAXNR + 1) * 4];
+  int rounds;
+};
+
+typedef struct { uint64 hi, lo; } aesgcm_u128;
+
+struct AesGcm128StaticContext {
+  void(*gmult)(uint64 Xi[2], const aesgcm_u128 Htable[16]);
+  void(*ghash)(uint64 Xi[2], const aesgcm_u128 Htable[16], const uint8 *inp, size_t len);
+  bool use_aesni_gcm_crypt;
+
+  // Don't move H and Htable cause the asm code depends on them
+  union { uint64 u[2]; uint32 d[4]; uint8 c[16]; size_t t[16 / sizeof(size_t)]; } H;
+  aesgcm_u128 Htable[16];
+  AesContext aes;
+};
+
+struct AesGcm128TempContext {
+  AesGcm128StaticContext *sctx;
+  union { uint64 u[2]; uint32 d[4]; uint8 c[16]; size_t t[16/sizeof(size_t)]; } EKi,EK0,len, Yi, Xi;
+  unsigned int mres, ares;
+};
+
+void CRYPTO_gcm128_init(AesGcm128StaticContext *ctx, const uint8 *key, int key_size);
+
+void CRYPTO_gcm128_setiv(AesGcm128TempContext *ctx, AesGcm128StaticContext *sctx, const unsigned char *iv,size_t len);
+void CRYPTO_gcm128_aad(AesGcm128TempContext *ctx,const uint8 *aad,size_t len);
+void CRYPTO_gcm128_encrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len);
+void CRYPTO_gcm128_decrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len);
+void CRYPTO_gcm128_finish(AesGcm128TempContext *ctx, unsigned char *tag, size_t len);
+
+void aesgcm_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                    const uint8 *ad, const size_t ad_len,
+                    const uint64 nonce, AesGcm128StaticContext *sctx);
+
+void aesgcm_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len,
+                            const uint8 *ad, const size_t ad_len,
+                            const uint64 nonce, AesGcm128StaticContext *sctx,
+                            uint8 mac[16]);
+
+#if defined(ARCH_CPU_X86_64)
+#define WITH_AESGCM 0
+#endif
+
+
+
+#endif /* __RIJNDAEL_ALG_FST_H */
diff --git a/crypto/aesgcm/aesgcm.cpp b/crypto/aesgcm/aesgcm.cpp
new file mode 100644
index 0000000..12ac2cd
--- /dev/null
+++ b/crypto/aesgcm/aesgcm.cpp
@@ -0,0 +1,882 @@
+#include "stdafx.h"
+#include "tunsafe_types.h"
+#include "tunsafe_endian.h"
+#include "tunsafe_cpu.h"
+#include "crypto/aesgcm/aes.h"
+#include <assert.h>
+#include <string.h>
+#include <stdio.h>
+//#include <Windows.h>
+#include "crypto/chacha20poly1305.h"
+#define AESNIGCM_ASM 1
+#define AESGCM_ASM 1
+#define AESNI_GCM 1
+
+// We only implement AES stuff on X86-64
+#if WITH_AESGCM
+
+extern "C" {
+void gcm_init_clmul(aesgcm_u128 Htable[16],const uint64 Xi[2]);
+void gcm_gmult_clmul(uint64 Xi[2],const aesgcm_u128 Htable[16]);
+void gcm_ghash_clmul(uint64 Xi[2],const aesgcm_u128 Htable[16],const uint8 *inp,size_t len);
+void gcm_init_avx(aesgcm_u128 Htable[16],const uint64 Xi[2]);
+void gcm_gmult_avx(uint64 Xi[2],const aesgcm_u128 Htable[16]);
+void gcm_ghash_avx(uint64 Xi[2],const aesgcm_u128 Htable[16],const uint8 *inp,size_t len);
+void gcm_gmult_4bit(uint64 Xi[2], const aesgcm_u128 Htable[16]);
+void gcm_ghash_4bit(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len);
+
+// ivec points to Yi followed by Xi
+// h_and_htable points at h and htable from the static context
+size_t aesni_gcm_encrypt(const uint8 *in,uint8 *out,size_t len,const void *key,uint8 ivec_and_xi[16],uint64 *h_and_htable);
+size_t aesni_gcm_decrypt(const uint8 *in,uint8 *out,size_t len,const void *key,uint8 ivec_and_xi[16],uint64 *h_and_htable);
+void aesni_ctr32_encrypt_blocks(const void *in, void *out, size_t blocks, const AesContext *key, const uint8 *ivec);
+void aesni_encrypt(const void *inp, void *out, const AesContext *key);
+void aesni_decrypt(const void *inp, void *out, const AesContext *key);
+int aesni_set_encrypt_key(const unsigned char *inp, int bits, AesContext *key);
+int aesni_set_decrypt_key(const unsigned char *inp, int bits, AesContext *key);
+};
+
+
+#define GCM_MUL(ctx,Xi)  (*gcm_gmult_p)(ctx->Xi.u,sctx->Htable)
+#define GHASH(ctx,in,len) (*gcm_ghash_p)(ctx->Xi.u,sctx->Htable,in,len)
+#define GHASH_CHUNK       (3*1024)
+
+void CRYPTO_gcm128_aad(AesGcm128TempContext *ctx,const uint8 *aad,size_t len) {
+  size_t i;
+  unsigned int n;
+  AesGcm128StaticContext *sctx = ctx->sctx;
+  uint64 alen = ctx->len.u[0];
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+  void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) = sctx->ghash;
+
+  assert(!ctx->len.u[1]);
+//  if () return -2;
+  alen += len;
+//  if (alen>(uint64(1)<<61) || (sizeof(len)==8 && alen<len))
+//    return -1;
+  ctx->len.u[0] = alen;
+
+  n = ctx->ares;
+  if (n) {
+    while (n && len) {
+      ctx->Xi.c[n] ^= *(aad++);
+      --len;
+      n = (n+1)%16;
+    }
+    if (n==0) GCM_MUL(ctx,Xi);
+    else {
+      ctx->ares = n;
+      return;
+    }
+  }
+
+#ifdef GHASH
+  if ((i = (len&(size_t)-16))) {
+    GHASH(ctx,aad,i);
+    aad += i;
+    len -= i;
+  }
+#else
+  while (len>=16) {
+    for (i=0; i<16; ++i) ctx->Xi.c[i] ^= aad[i];
+    GCM_MUL(ctx,Xi);
+    aad += 16;
+    len -= 16;
+  }
+#endif
+  if (len) {
+    n = (unsigned int)len;
+    for (i=0; i<len; ++i) ctx->Xi.c[i] ^= aad[i];
+  }
+
+  ctx->ares = n;
+}
+
+void CRYPTO_gcm128_encrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len) {
+  unsigned int n, ctr;
+  size_t i;
+  AesGcm128StaticContext *sctx = ctx->sctx;
+  uint64        mlen  = ctx->len.u[1];
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+  void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) = sctx->ghash;
+  mlen += len;
+//  if (mlen>((uint64(1)<<36)-32) || (sizeof(len)==8 && mlen<len))
+//    return -1;
+  ctx->len.u[1] = mlen;
+
+  if (ctx->ares) {
+    /* First call to encrypt finalizes GHASH(AAD) */
+    GCM_MUL(ctx,Xi);
+    ctx->ares = 0;
+  }
+  n = ctx->mres;
+  if (n) {
+    while (n && len) {
+      ctx->Xi.c[n] ^= *(out++) = *(in++)^ctx->EKi.c[n];
+      --len;
+      n = (n+1)%16;
+    }
+    if (n==0) GCM_MUL(ctx,Xi);
+    else {
+      ctx->mres = n;
+      return;
+    }
+  }
+
+#if defined(AESNI_GCM)
+  if (sctx->use_aesni_gcm_crypt && len >= 0x120) {
+    // |aesni_gcm_encrypt| may not process all the input given to it. It may
+    // not process *any* of its input if it is deemed too small.
+    size_t bulk = aesni_gcm_encrypt(in, out, len, &sctx->aes, ctx->Yi.c, sctx->H.u);
+    in += bulk;
+    out += bulk;
+    len -= bulk;
+  }
+#endif
+  ctr = ReadBE32(ctx->Yi.c + 12);
+
+#if defined(STRICT_ALIGNMENT)
+  if (((size_t)in | (size_t)out) % sizeof(size_t) != 0) {
+    for (i = 0; i<len; ++i) {
+      if (n == 0) {
+        aesni_encrypt(ctx->Yi.c, ctx->EKi.c, &sctx->aes);
+        ++ctr;
+        WriteBE32(ctx->Yi.c + 12, ctr);
+      }
+      ctx->Xi.c[n] ^= out[i] = in[i] ^ ctx->EKi.c[n];
+      n = (n + 1) % 16;
+      if (n == 0)
+        GCM_MUL(ctx, Xi);
+    }
+    ctx->mres = n;
+    return;
+  }
+#endif
+  while (len>=GHASH_CHUNK) {
+    aesni_ctr32_encrypt_blocks(in, out, GHASH_CHUNK / 16, &sctx->aes, ctx->Yi.c);
+    GHASH(ctx, out, GHASH_CHUNK);
+    ctr += GHASH_CHUNK / 16;
+    WriteBE32(ctx->Yi.c + 12, ctr);
+    in += GHASH_CHUNK;
+    out += GHASH_CHUNK;
+    len -= GHASH_CHUNK;
+  }
+  if ((i = (len&(size_t)-16))) {
+    aesni_ctr32_encrypt_blocks(in, out, i / 16, &sctx->aes, ctx->Yi.c);
+    GHASH(ctx, out, i);
+    ctr += (uint32)(i / 16);
+    WriteBE32(ctx->Yi.c + 12, ctr);
+    out += i;
+    in += i;
+    len -= i;
+  }
+  if (len) {
+    aesni_encrypt(ctx->Yi.c, ctx->EKi.c, &sctx->aes);
+    ++ctr;
+    WriteBE32(ctx->Yi.c+12,ctr);
+    while (len--) {
+      ctx->Xi.c[n] ^= out[n] = in[n] ^ ctx->EKi.c[n];
+      ++n;
+    }
+  }
+  ctx->mres = n;
+}
+
+void CRYPTO_gcm128_decrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len) {
+  unsigned int n, ctr;
+  size_t i;
+  uint64        mlen  = ctx->len.u[1];
+  AesGcm128StaticContext *sctx = ctx->sctx;
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+  void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16],  const uint8 *inp,size_t len) = sctx->ghash;
+
+  mlen += len;
+//  if (mlen>((uint64(1)<<36)-32) || (sizeof(len)==8 && mlen<len))
+//    return -1;
+  ctx->len.u[1] = mlen;
+
+  if (ctx->ares) {
+    /* First call to decrypt finalizes GHASH(AAD) */
+    GCM_MUL(ctx,Xi);
+    ctx->ares = 0;
+  }
+
+  n = ctx->mres;
+  if (n) {
+    while (n && len) {
+      uint8 c = *(in++);
+      *(out++) = c^ctx->EKi.c[n];
+      ctx->Xi.c[n] ^= c;
+      --len;
+      n = (n+1)%16;
+    }
+    if (n==0) GCM_MUL (ctx,Xi);
+    else {
+      ctx->mres = n;
+      return;
+    }
+  }
+
+#if defined(AESNI_GCM)
+  if (sctx->use_aesni_gcm_crypt) {
+    // |aesni_gcm_decrypt| may not process all the input given to it. It may
+    // not process *any* of its input if it is deemed too small.
+    size_t bulk = aesni_gcm_decrypt(in, out, len, &sctx->aes, ctx->Yi.c, sctx->H.u);
+    in += bulk;
+    out += bulk;
+    len -= bulk;
+  }
+#endif
+  ctr = ReadBE32(ctx->Yi.c + 12);
+
+#if defined(STRICT_ALIGNMENT)
+  if (((size_t)in|(size_t)out)%sizeof(size_t) != 0) {
+    for (i=0;i<len;++i) {
+      uint8 c;
+      if (n==0) {
+        aesni_encrypt(ctx->Yi.c, ctx->EKi.c, key);
+        ++ctr;
+        WriteBE32(ctx->Yi.c+12,ctr);
+      }
+      c = in[i];
+      out[i] = c^ctx->EKi.c[n];
+      ctx->Xi.c[n] ^= c;
+      n = (n+1)%16;
+      if (n==0)
+        GCM_MUL(ctx,Xi);
+    }
+    ctx->mres = n;
+    return;
+  }
+#endif
+  while (len >= GHASH_CHUNK) {
+    GHASH(ctx, in, GHASH_CHUNK);
+    aesni_ctr32_encrypt_blocks(in, out, GHASH_CHUNK / 16, &sctx->aes, ctx->Yi.c);
+    ctr += GHASH_CHUNK / 16;
+    WriteBE32(ctx->Yi.c + 12, ctr);
+    in += GHASH_CHUNK;
+    out += GHASH_CHUNK;
+    len -= GHASH_CHUNK;
+  }
+  if ((i = (len&(size_t)-16))) {
+    GHASH(ctx, in, i);
+    aesni_ctr32_encrypt_blocks(in, out, i / 16, &sctx->aes, ctx->Yi.c);
+    ctr += (uint32)(i / 16);
+    WriteBE32(ctx->Yi.c + 12, ctr);
+    out += i;
+    in += i;
+    len -= i;
+  }
+  if (len) {
+    aesni_encrypt(ctx->Yi.c, ctx->EKi.c, &sctx->aes);
+    ++ctr;
+    WriteBE32(ctx->Yi.c+12,ctr);
+    while (len--) {
+      uint8 c = in[n];
+      ctx->Xi.c[n] ^= c;
+      out[n] = c^ctx->EKi.c[n];
+      ++n;
+    }
+  }
+  ctx->mres = n;
+}
+
+void CRYPTO_gcm128_finish(AesGcm128TempContext *ctx,uint8 *tag, size_t len) {
+  uint64 alen = ctx->len.u[0]<<3;
+  uint64 clen = ctx->len.u[1]<<3;
+  AesGcm128StaticContext *sctx = ctx->sctx;
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+
+  if (ctx->mres || ctx->ares)
+    GCM_MUL(ctx,Xi);
+
+  alen = ToBE64(alen);
+  clen = ToBE64(clen);
+
+  ctx->Xi.u[0] ^= alen;
+  ctx->Xi.u[1] ^= clen;
+  GCM_MUL(ctx,Xi);
+
+  ctx->Xi.u[0] ^= ctx->EK0.u[0];
+  ctx->Xi.u[1] ^= ctx->EK0.u[1];
+
+  memcpy(tag, ctx->Xi.c,len);
+}
+
+#define REDUCE1BIT(V) do { \
+  if (sizeof(size_t)==8) { \
+    uint64 T = 0xe100000000000000ull & (0-(V.lo&1)); \
+    V.lo  = (V.hi<<63)|(V.lo>>1); \
+    V.hi  = (V.hi>>1 )^T; \
+  } else { \
+    uint32 T = 0xe1000000U & (0-(uint32)(V.lo&1)); \
+    V.lo  = (V.hi<<63)|(V.lo>>1); \
+    V.hi  = (V.hi>>1 )^((uint64)T<<32); \
+  } \
+} while(0)
+
+static void gcm_init_4bit(aesgcm_u128 Htable[16], uint64 H[2]) {
+  aesgcm_u128 V;
+
+  Htable[0].hi = 0;
+  Htable[0].lo = 0;
+  V.hi = H[0];
+  V.lo = H[1];
+
+  Htable[8] = V;
+  REDUCE1BIT(V);
+  Htable[4] = V;
+  REDUCE1BIT(V);
+  Htable[2] = V;
+  REDUCE1BIT(V);
+  Htable[1] = V;
+  Htable[3].hi  = V.hi^Htable[2].hi, Htable[3].lo  = V.lo^Htable[2].lo;
+  V=Htable[4];
+  Htable[5].hi  = V.hi^Htable[1].hi, Htable[5].lo  = V.lo^Htable[1].lo;
+  Htable[6].hi  = V.hi^Htable[2].hi, Htable[6].lo  = V.lo^Htable[2].lo;
+  Htable[7].hi  = V.hi^Htable[3].hi, Htable[7].lo  = V.lo^Htable[3].lo;
+  V=Htable[8];
+  Htable[9].hi  = V.hi^Htable[1].hi, Htable[9].lo  = V.lo^Htable[1].lo;
+  Htable[10].hi = V.hi^Htable[2].hi, Htable[10].lo = V.lo^Htable[2].lo;
+  Htable[11].hi = V.hi^Htable[3].hi, Htable[11].lo = V.lo^Htable[3].lo;
+  Htable[12].hi = V.hi^Htable[4].hi, Htable[12].lo = V.lo^Htable[4].lo;
+  Htable[13].hi = V.hi^Htable[5].hi, Htable[13].lo = V.lo^Htable[5].lo;
+  Htable[14].hi = V.hi^Htable[6].hi, Htable[14].lo = V.lo^Htable[6].lo;
+  Htable[15].hi = V.hi^Htable[7].hi, Htable[15].lo = V.lo^Htable[7].lo;
+}
+
+
+#if !AESGCM_ASM
+#define PACK(s)   ((size_t)(s)<<(sizeof(size_t)*8-16))
+static const size_t rem_4bit[16] = {
+  PACK(0x0000), PACK(0x1C20), PACK(0x3840), PACK(0x2460),
+  PACK(0x7080), PACK(0x6CA0), PACK(0x48C0), PACK(0x54E0),
+  PACK(0xE100), PACK(0xFD20), PACK(0xD940), PACK(0xC560),
+  PACK(0x9180), PACK(0x8DA0), PACK(0xA9C0), PACK(0xB5E0)};
+
+void gcm_gmult_4bit(uint64 Xi[2], const aesgcm_u128 Htable[16]) {
+  aesgcm_u128 Z;
+  int cnt = 15;
+  size_t rem, nlo, nhi;
+  const union { long one; char little; } is_endian = {1};
+
+  nlo  = ((const uint8 *)Xi)[15];
+  nhi  = nlo>>4;
+  nlo &= 0xf;
+
+  Z.hi = Htable[nlo].hi;
+  Z.lo = Htable[nlo].lo;
+
+  while (1) {
+    rem  = (size_t)Z.lo&0xf;
+    Z.lo = (Z.hi<<60)|(Z.lo>>4);
+    Z.hi = (Z.hi>>4);
+    if (sizeof(size_t)==8)
+      Z.hi ^= rem_4bit[rem];
+    else
+      Z.hi ^= (uint64)rem_4bit[rem]<<32;
+
+    Z.hi ^= Htable[nhi].hi;
+    Z.lo ^= Htable[nhi].lo;
+
+    if (--cnt<0)    break;
+
+    nlo  = ((const uint8 *)Xi)[cnt];
+    nhi  = nlo>>4;
+    nlo &= 0xf;
+
+    rem  = (size_t)Z.lo&0xf;
+    Z.lo = (Z.hi<<60)|(Z.lo>>4);
+    Z.hi = (Z.hi>>4);
+    if (sizeof(size_t)==8)
+      Z.hi ^= rem_4bit[rem];
+    else
+      Z.hi ^= (uint64)rem_4bit[rem]<<32;
+
+    Z.hi ^= Htable[nlo].hi;
+    Z.lo ^= Htable[nlo].lo;
+  }
+  Xi[0] = ToBE64(Z.hi);
+  Xi[1] = ToBE64(Z.lo);
+}
+
+void gcm_ghash_4bit(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) {
+    aesgcm_u128 Z;
+    int cnt;
+    size_t rem, nlo, nhi;
+
+    do {
+      cnt  = 15;
+      nlo  = ((const uint8 *)Xi)[15];
+      nlo ^= inp[15];
+      nhi  = nlo>>4;
+      nlo &= 0xf;
+
+      Z.hi = Htable[nlo].hi;
+      Z.lo = Htable[nlo].lo;
+
+      while (1) {
+        rem  = (size_t)Z.lo&0xf;
+        Z.lo = (Z.hi<<60)|(Z.lo>>4);
+        Z.hi = (Z.hi>>4);
+        if (sizeof(size_t)==8)
+          Z.hi ^= rem_4bit[rem];
+        else
+          Z.hi ^= (uint64)rem_4bit[rem]<<32;
+
+        Z.hi ^= Htable[nhi].hi;
+        Z.lo ^= Htable[nhi].lo;
+
+        if (--cnt<0)    break;
+
+        nlo  = ((const uint8 *)Xi)[cnt];
+        nlo ^= inp[cnt];
+        nhi  = nlo>>4;
+        nlo &= 0xf;
+
+        rem  = (size_t)Z.lo&0xf;
+        Z.lo = (Z.hi<<60)|(Z.lo>>4);
+        Z.hi = (Z.hi>>4);
+        if (sizeof(size_t)==8)
+          Z.hi ^= rem_4bit[rem];
+        else
+          Z.hi ^= (uint64)rem_4bit[rem]<<32;
+
+        Z.hi ^= Htable[nlo].hi;
+        Z.lo ^= Htable[nlo].lo;
+      }
+    Xi[0] = ToBE64(Z.hi);
+    Xi[1] = ToBE64(Z.lo);
+
+    } while (inp+=16, len-=16);
+}
+#endif
+
+void CRYPTO_gcm128_init(AesGcm128StaticContext *ctx, const uint8 *key, int key_size) {
+  memset(ctx,0,sizeof(*ctx));
+  ctx->use_aesni_gcm_crypt = X86_PCAP_MOVBE;
+  aesni_set_encrypt_key(key, key_size, &ctx->aes);
+  aesni_encrypt(ctx->H.c,ctx->H.c, &ctx->aes);
+  ctx->H.u[0] = ToBE64(ctx->H.u[0]);
+  ctx->H.u[1] = ToBE64(ctx->H.u[1]);
+  if (X86_PCAP_AVX) {
+    gcm_init_avx(ctx->Htable,ctx->H.u);
+    ctx->gmult = gcm_gmult_avx;
+    ctx->ghash = gcm_ghash_avx;
+  } else if (X86_PCAP_PCLMULQDQ) {
+    gcm_init_clmul(ctx->Htable,ctx->H.u);
+    ctx->gmult = gcm_gmult_clmul;
+    ctx->ghash = gcm_ghash_clmul;
+  } else {
+    gcm_init_4bit(ctx->Htable, ctx->H.u);
+    ctx->gmult = gcm_gmult_4bit;
+    ctx->ghash = gcm_ghash_4bit;
+  }
+}
+
+void CRYPTO_gcm128_setiv(AesGcm128TempContext *ctx, AesGcm128StaticContext *sctx, const unsigned char *iv, size_t len) {
+  unsigned int ctr;
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+
+  ctx->sctx = sctx;
+  ctx->Yi.u[0]  = 0;
+  ctx->Yi.u[1]  = 0;
+  ctx->Xi.u[0]  = 0;
+  ctx->Xi.u[1]  = 0;
+  ctx->len.u[0] = 0;  /* AAD length */
+  ctx->len.u[1] = 0;  /* message length */
+  ctx->ares = 0;
+  ctx->mres = 0;
+
+  if (len==12) {
+    memcpy(ctx->Yi.c,iv,12);
+    ctx->Yi.c[15]=1;
+    ctr=1;
+  } else {
+    size_t i;
+    uint64 len0 = len;
+
+    while (len>=16) {
+      for (i=0; i<16; ++i) ctx->Yi.c[i] ^= iv[i];
+      GCM_MUL(ctx,Yi);
+      iv += 16;
+      len -= 16;
+    }
+    if (len) {
+      for (i=0; i<len; ++i) ctx->Yi.c[i] ^= iv[i];
+      GCM_MUL(ctx,Yi);
+    }
+    len0 <<= 3;
+    ctx->Yi.u[1]  ^= ToBE64(len0);
+
+    GCM_MUL(ctx,Yi);
+
+    ctr = ToBE32(ctx->Yi.d[3]);
+  }
+
+  aesni_encrypt(ctx->Yi.c, ctx->EK0.c, &sctx->aes);
+  ++ctr;
+  ctx->Yi.d[3] = ToBE32(ctr);
+}
+
+union AesGcmIV {
+  uint32 nonce[3];
+  uint8 nonceb[12];
+};
+
+void aesgcm_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                    const uint8 *ad, const size_t ad_len,
+                    const uint64 nonce, AesGcm128StaticContext *sctx) {
+  AesGcm128TempContext ctx;
+  AesGcmIV iv;
+
+  WriteLE64(iv.nonce, nonce);
+  iv.nonce[2] = 0;
+
+  CRYPTO_gcm128_setiv(&ctx, sctx, iv.nonceb, sizeof(iv));
+  CRYPTO_gcm128_aad(&ctx, ad, ad_len);
+  CRYPTO_gcm128_encrypt_ctr32(&ctx, src, dst, src_len);
+  CRYPTO_gcm128_finish(&ctx, dst + src_len, 16);
+}
+
+void aesgcm_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len,
+                            const uint8 *ad, const size_t ad_len,
+                            const uint64 nonce, AesGcm128StaticContext *sctx,
+                            uint8 mac[16]) {
+  AesGcm128TempContext ctx;
+  AesGcmIV iv;
+
+  WriteLE64(iv.nonce, nonce);
+  iv.nonce[2] = 0;
+
+  CRYPTO_gcm128_setiv(&ctx, sctx, iv.nonceb, sizeof(iv));
+  CRYPTO_gcm128_aad(&ctx, ad, ad_len);
+  CRYPTO_gcm128_decrypt_ctr32(&ctx, src, dst, src_len);
+  CRYPTO_gcm128_finish(&ctx, mac, 16);
+}
+
+#if 1
+
+/*
+* GCM test vectors from:
+*
+* http://csrc.nist.gov/groups/STM/cavp/documents/mac/gcmtestvectors.zip
+*/
+#define MAX_TESTS   6
+
+static int key_index[MAX_TESTS] =
+{ 0, 0, 1, 1, 1, 1 };
+
+static uint8 key[MAX_TESTS][32] =
+{
+  { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+  { 0xfe, 0xff, 0xe9, 0x92, 0x86, 0x65, 0x73, 0x1c,
+  0x6d, 0x6a, 0x8f, 0x94, 0x67, 0x30, 0x83, 0x08,
+  0xfe, 0xff, 0xe9, 0x92, 0x86, 0x65, 0x73, 0x1c,
+  0x6d, 0x6a, 0x8f, 0x94, 0x67, 0x30, 0x83, 0x08 },  
+};
+
+static size_t iv_len[MAX_TESTS] =
+{ 12, 12, 12, 12, 8, 60 };
+
+static int iv_index[MAX_TESTS] =
+{ 0, 0, 1, 1, 1, 2 };
+
+static uint8 iv[MAX_TESTS][64] =
+{
+  { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00 },
+  { 0xca, 0xfe, 0xba, 0xbe, 0xfa, 0xce, 0xdb, 0xad,
+  0xde, 0xca, 0xf8, 0x88 },
+  { 0x93, 0x13, 0x22, 0x5d, 0xf8, 0x84, 0x06, 0xe5,
+  0x55, 0x90, 0x9c, 0x5a, 0xff, 0x52, 0x69, 0xaa, 
+  0x6a, 0x7a, 0x95, 0x38, 0x53, 0x4f, 0x7d, 0xa1,
+  0xe4, 0xc3, 0x03, 0xd2, 0xa3, 0x18, 0xa7, 0x28, 
+  0xc3, 0xc0, 0xc9, 0x51, 0x56, 0x80, 0x95, 0x39,
+  0xfc, 0xf0, 0xe2, 0x42, 0x9a, 0x6b, 0x52, 0x54, 
+  0x16, 0xae, 0xdb, 0xf5, 0xa0, 0xde, 0x6a, 0x57,
+  0xa6, 0x37, 0xb3, 0x9b }, 
+};
+
+static size_t add_len[MAX_TESTS] =
+{ 0, 0, 0, 20, 20, 20 };
+
+int add_index[MAX_TESTS] =
+{ 0, 0, 0, 1, 1, 1 };
+
+static uint8 additional[MAX_TESTS][64] =
+{
+  { 0x00 },
+  { 0xfe, 0xed, 0xfa, 0xce, 0xde, 0xad, 0xbe, 0xef,
+  0xfe, 0xed, 0xfa, 0xce, 0xde, 0xad, 0xbe, 0xef, 
+  0xab, 0xad, 0xda, 0xd2 },
+};
+
+static size_t pt_len[MAX_TESTS] =
+{ 0, 16, 64, 60, 60, 60 };
+
+static int pt_index[MAX_TESTS] =
+{ 0, 0, 1, 1, 1, 1 };
+
+static uint8 pt[MAX_TESTS][64] =
+{
+  { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+  { 0xd9, 0x31, 0x32, 0x25, 0xf8, 0x84, 0x06, 0xe5,
+  0xa5, 0x59, 0x09, 0xc5, 0xaf, 0xf5, 0x26, 0x9a,
+  0x86, 0xa7, 0xa9, 0x53, 0x15, 0x34, 0xf7, 0xda,
+  0x2e, 0x4c, 0x30, 0x3d, 0x8a, 0x31, 0x8a, 0x72,
+  0x1c, 0x3c, 0x0c, 0x95, 0x95, 0x68, 0x09, 0x53,
+  0x2f, 0xcf, 0x0e, 0x24, 0x49, 0xa6, 0xb5, 0x25,
+  0xb1, 0x6a, 0xed, 0xf5, 0xaa, 0x0d, 0xe6, 0x57,
+  0xba, 0x63, 0x7b, 0x39, 0x1a, 0xaf, 0xd2, 0x55 },
+};
+
+static uint8 ct[MAX_TESTS * 3][64] =
+{
+  { 0x00 },
+  { 0x03, 0x88, 0xda, 0xce, 0x60, 0xb6, 0xa3, 0x92,
+  0xf3, 0x28, 0xc2, 0xb9, 0x71, 0xb2, 0xfe, 0x78 },
+  { 0x42, 0x83, 0x1e, 0xc2, 0x21, 0x77, 0x74, 0x24,
+  0x4b, 0x72, 0x21, 0xb7, 0x84, 0xd0, 0xd4, 0x9c, 
+  0xe3, 0xaa, 0x21, 0x2f, 0x2c, 0x02, 0xa4, 0xe0,
+  0x35, 0xc1, 0x7e, 0x23, 0x29, 0xac, 0xa1, 0x2e, 
+  0x21, 0xd5, 0x14, 0xb2, 0x54, 0x66, 0x93, 0x1c,
+  0x7d, 0x8f, 0x6a, 0x5a, 0xac, 0x84, 0xaa, 0x05, 
+  0x1b, 0xa3, 0x0b, 0x39, 0x6a, 0x0a, 0xac, 0x97,
+  0x3d, 0x58, 0xe0, 0x91, 0x47, 0x3f, 0x59, 0x85 },
+  { 0x42, 0x83, 0x1e, 0xc2, 0x21, 0x77, 0x74, 0x24,
+  0x4b, 0x72, 0x21, 0xb7, 0x84, 0xd0, 0xd4, 0x9c, 
+  0xe3, 0xaa, 0x21, 0x2f, 0x2c, 0x02, 0xa4, 0xe0,
+  0x35, 0xc1, 0x7e, 0x23, 0x29, 0xac, 0xa1, 0x2e, 
+  0x21, 0xd5, 0x14, 0xb2, 0x54, 0x66, 0x93, 0x1c,
+  0x7d, 0x8f, 0x6a, 0x5a, 0xac, 0x84, 0xaa, 0x05, 
+  0x1b, 0xa3, 0x0b, 0x39, 0x6a, 0x0a, 0xac, 0x97,
+  0x3d, 0x58, 0xe0, 0x91 },
+  { 0x61, 0x35, 0x3b, 0x4c, 0x28, 0x06, 0x93, 0x4a,
+  0x77, 0x7f, 0xf5, 0x1f, 0xa2, 0x2a, 0x47, 0x55, 
+  0x69, 0x9b, 0x2a, 0x71, 0x4f, 0xcd, 0xc6, 0xf8,
+  0x37, 0x66, 0xe5, 0xf9, 0x7b, 0x6c, 0x74, 0x23, 
+  0x73, 0x80, 0x69, 0x00, 0xe4, 0x9f, 0x24, 0xb2,
+  0x2b, 0x09, 0x75, 0x44, 0xd4, 0x89, 0x6b, 0x42, 
+  0x49, 0x89, 0xb5, 0xe1, 0xeb, 0xac, 0x0f, 0x07,
+  0xc2, 0x3f, 0x45, 0x98 },
+  { 0x8c, 0xe2, 0x49, 0x98, 0x62, 0x56, 0x15, 0xb6,
+  0x03, 0xa0, 0x33, 0xac, 0xa1, 0x3f, 0xb8, 0x94, 
+  0xbe, 0x91, 0x12, 0xa5, 0xc3, 0xa2, 0x11, 0xa8,
+  0xba, 0x26, 0x2a, 0x3c, 0xca, 0x7e, 0x2c, 0xa7, 
+  0x01, 0xe4, 0xa9, 0xa4, 0xfb, 0xa4, 0x3c, 0x90,
+  0xcc, 0xdc, 0xb2, 0x81, 0xd4, 0x8c, 0x7c, 0x6f, 
+  0xd6, 0x28, 0x75, 0xd2, 0xac, 0xa4, 0x17, 0x03,
+  0x4c, 0x34, 0xae, 0xe5 },
+  { 0x00 },
+  { 0x98, 0xe7, 0x24, 0x7c, 0x07, 0xf0, 0xfe, 0x41,
+  0x1c, 0x26, 0x7e, 0x43, 0x84, 0xb0, 0xf6, 0x00 }, 
+  { 0x39, 0x80, 0xca, 0x0b, 0x3c, 0x00, 0xe8, 0x41,
+  0xeb, 0x06, 0xfa, 0xc4, 0x87, 0x2a, 0x27, 0x57, 
+  0x85, 0x9e, 0x1c, 0xea, 0xa6, 0xef, 0xd9, 0x84,
+  0x62, 0x85, 0x93, 0xb4, 0x0c, 0xa1, 0xe1, 0x9c, 
+  0x7d, 0x77, 0x3d, 0x00, 0xc1, 0x44, 0xc5, 0x25,
+  0xac, 0x61, 0x9d, 0x18, 0xc8, 0x4a, 0x3f, 0x47, 
+  0x18, 0xe2, 0x44, 0x8b, 0x2f, 0xe3, 0x24, 0xd9,
+  0xcc, 0xda, 0x27, 0x10, 0xac, 0xad, 0xe2, 0x56 },
+  { 0x39, 0x80, 0xca, 0x0b, 0x3c, 0x00, 0xe8, 0x41,
+  0xeb, 0x06, 0xfa, 0xc4, 0x87, 0x2a, 0x27, 0x57, 
+  0x85, 0x9e, 0x1c, 0xea, 0xa6, 0xef, 0xd9, 0x84,
+  0x62, 0x85, 0x93, 0xb4, 0x0c, 0xa1, 0xe1, 0x9c, 
+  0x7d, 0x77, 0x3d, 0x00, 0xc1, 0x44, 0xc5, 0x25, 
+  0xac, 0x61, 0x9d, 0x18, 0xc8, 0x4a, 0x3f, 0x47, 
+  0x18, 0xe2, 0x44, 0x8b, 0x2f, 0xe3, 0x24, 0xd9,
+  0xcc, 0xda, 0x27, 0x10 }, 
+  { 0x0f, 0x10, 0xf5, 0x99, 0xae, 0x14, 0xa1, 0x54,
+  0xed, 0x24, 0xb3, 0x6e, 0x25, 0x32, 0x4d, 0xb8, 
+  0xc5, 0x66, 0x63, 0x2e, 0xf2, 0xbb, 0xb3, 0x4f,
+  0x83, 0x47, 0x28, 0x0f, 0xc4, 0x50, 0x70, 0x57, 
+  0xfd, 0xdc, 0x29, 0xdf, 0x9a, 0x47, 0x1f, 0x75,
+  0xc6, 0x65, 0x41, 0xd4, 0xd4, 0xda, 0xd1, 0xc9, 
+  0xe9, 0x3a, 0x19, 0xa5, 0x8e, 0x8b, 0x47, 0x3f,
+  0xa0, 0xf0, 0x62, 0xf7 }, 
+  { 0xd2, 0x7e, 0x88, 0x68, 0x1c, 0xe3, 0x24, 0x3c,
+  0x48, 0x30, 0x16, 0x5a, 0x8f, 0xdc, 0xf9, 0xff, 
+  0x1d, 0xe9, 0xa1, 0xd8, 0xe6, 0xb4, 0x47, 0xef,
+  0x6e, 0xf7, 0xb7, 0x98, 0x28, 0x66, 0x6e, 0x45, 
+  0x81, 0xe7, 0x90, 0x12, 0xaf, 0x34, 0xdd, 0xd9,
+  0xe2, 0xf0, 0x37, 0x58, 0x9b, 0x29, 0x2d, 0xb3, 
+  0xe6, 0x7c, 0x03, 0x67, 0x45, 0xfa, 0x22, 0xe7,
+  0xe9, 0xb7, 0x37, 0x3b }, 
+  { 0x00 },
+  { 0xce, 0xa7, 0x40, 0x3d, 0x4d, 0x60, 0x6b, 0x6e, 
+  0x07, 0x4e, 0xc5, 0xd3, 0xba, 0xf3, 0x9d, 0x18 }, 
+  { 0x52, 0x2d, 0xc1, 0xf0, 0x99, 0x56, 0x7d, 0x07, 
+  0xf4, 0x7f, 0x37, 0xa3, 0x2a, 0x84, 0x42, 0x7d, 
+  0x64, 0x3a, 0x8c, 0xdc, 0xbf, 0xe5, 0xc0, 0xc9, 
+  0x75, 0x98, 0xa2, 0xbd, 0x25, 0x55, 0xd1, 0xaa, 
+  0x8c, 0xb0, 0x8e, 0x48, 0x59, 0x0d, 0xbb, 0x3d, 
+  0xa7, 0xb0, 0x8b, 0x10, 0x56, 0x82, 0x88, 0x38, 
+  0xc5, 0xf6, 0x1e, 0x63, 0x93, 0xba, 0x7a, 0x0a, 
+  0xbc, 0xc9, 0xf6, 0x62, 0x89, 0x80, 0x15, 0xad }, 
+  { 0x52, 0x2d, 0xc1, 0xf0, 0x99, 0x56, 0x7d, 0x07, 
+  0xf4, 0x7f, 0x37, 0xa3, 0x2a, 0x84, 0x42, 0x7d,  
+  0x64, 0x3a, 0x8c, 0xdc, 0xbf, 0xe5, 0xc0, 0xc9, 
+  0x75, 0x98, 0xa2, 0xbd, 0x25, 0x55, 0xd1, 0xaa, 
+  0x8c, 0xb0, 0x8e, 0x48, 0x59, 0x0d, 0xbb, 0x3d, 
+  0xa7, 0xb0, 0x8b, 0x10, 0x56, 0x82, 0x88, 0x38, 
+  0xc5, 0xf6, 0x1e, 0x63, 0x93, 0xba, 0x7a, 0x0a, 
+  0xbc, 0xc9, 0xf6, 0x62 }, 
+  { 0xc3, 0x76, 0x2d, 0xf1, 0xca, 0x78, 0x7d, 0x32,
+  0xae, 0x47, 0xc1, 0x3b, 0xf1, 0x98, 0x44, 0xcb, 
+  0xaf, 0x1a, 0xe1, 0x4d, 0x0b, 0x97, 0x6a, 0xfa,
+  0xc5, 0x2f, 0xf7, 0xd7, 0x9b, 0xba, 0x9d, 0xe0, 
+  0xfe, 0xb5, 0x82, 0xd3, 0x39, 0x34, 0xa4, 0xf0,
+  0x95, 0x4c, 0xc2, 0x36, 0x3b, 0xc7, 0x3f, 0x78, 
+  0x62, 0xac, 0x43, 0x0e, 0x64, 0xab, 0xe4, 0x99,
+  0xf4, 0x7c, 0x9b, 0x1f }, 
+  { 0x5a, 0x8d, 0xef, 0x2f, 0x0c, 0x9e, 0x53, 0xf1,
+  0xf7, 0x5d, 0x78, 0x53, 0x65, 0x9e, 0x2a, 0x20, 
+  0xee, 0xb2, 0xb2, 0x2a, 0xaf, 0xde, 0x64, 0x19,
+  0xa0, 0x58, 0xab, 0x4f, 0x6f, 0x74, 0x6b, 0xf4, 
+  0x0f, 0xc0, 0xc3, 0xb7, 0x80, 0xf2, 0x44, 0x45,
+  0x2d, 0xa3, 0xeb, 0xf1, 0xc5, 0xd8, 0x2c, 0xde, 
+  0xa2, 0x41, 0x89, 0x97, 0x20, 0x0e, 0xf8, 0x2e,
+  0x44, 0xae, 0x7e, 0x3f }, 
+};
+
+static uint8 tag[MAX_TESTS * 3][16] =
+{
+  { 0x58, 0xe2, 0xfc, 0xce, 0xfa, 0x7e, 0x30, 0x61,
+  0x36, 0x7f, 0x1d, 0x57, 0xa4, 0xe7, 0x45, 0x5a },
+  { 0xab, 0x6e, 0x47, 0xd4, 0x2c, 0xec, 0x13, 0xbd,
+  0xf5, 0x3a, 0x67, 0xb2, 0x12, 0x57, 0xbd, 0xdf },
+  { 0x4d, 0x5c, 0x2a, 0xf3, 0x27, 0xcd, 0x64, 0xa6,
+  0x2c, 0xf3, 0x5a, 0xbd, 0x2b, 0xa6, 0xfa, 0xb4 }, 
+  { 0x5b, 0xc9, 0x4f, 0xbc, 0x32, 0x21, 0xa5, 0xdb,
+  0x94, 0xfa, 0xe9, 0x5a, 0xe7, 0x12, 0x1a, 0x47 },
+  { 0x36, 0x12, 0xd2, 0xe7, 0x9e, 0x3b, 0x07, 0x85,
+  0x56, 0x1b, 0xe1, 0x4a, 0xac, 0xa2, 0xfc, 0xcb },
+  { 0x61, 0x9c, 0xc5, 0xae, 0xff, 0xfe, 0x0b, 0xfa,
+  0x46, 0x2a, 0xf4, 0x3c, 0x16, 0x99, 0xd0, 0x50 },
+  { 0xcd, 0x33, 0xb2, 0x8a, 0xc7, 0x73, 0xf7, 0x4b,
+  0xa0, 0x0e, 0xd1, 0xf3, 0x12, 0x57, 0x24, 0x35 },
+  { 0x2f, 0xf5, 0x8d, 0x80, 0x03, 0x39, 0x27, 0xab,
+  0x8e, 0xf4, 0xd4, 0x58, 0x75, 0x14, 0xf0, 0xfb }, 
+  { 0x99, 0x24, 0xa7, 0xc8, 0x58, 0x73, 0x36, 0xbf,
+  0xb1, 0x18, 0x02, 0x4d, 0xb8, 0x67, 0x4a, 0x14 },
+  { 0x25, 0x19, 0x49, 0x8e, 0x80, 0xf1, 0x47, 0x8f,
+  0x37, 0xba, 0x55, 0xbd, 0x6d, 0x27, 0x61, 0x8c }, 
+  { 0x65, 0xdc, 0xc5, 0x7f, 0xcf, 0x62, 0x3a, 0x24,
+  0x09, 0x4f, 0xcc, 0xa4, 0x0d, 0x35, 0x33, 0xf8 }, 
+  { 0xdc, 0xf5, 0x66, 0xff, 0x29, 0x1c, 0x25, 0xbb,
+  0xb8, 0x56, 0x8f, 0xc3, 0xd3, 0x76, 0xa6, 0xd9 }, 
+  { 0x53, 0x0f, 0x8a, 0xfb, 0xc7, 0x45, 0x36, 0xb9,
+  0xa9, 0x63, 0xb4, 0xf1, 0xc4, 0xcb, 0x73, 0x8b }, 
+  { 0xd0, 0xd1, 0xc8, 0xa7, 0x99, 0x99, 0x6b, 0xf0,
+  0x26, 0x5b, 0x98, 0xb5, 0xd4, 0x8a, 0xb9, 0x19 }, 
+  { 0xb0, 0x94, 0xda, 0xc5, 0xd9, 0x34, 0x71, 0xbd,
+  0xec, 0x1a, 0x50, 0x22, 0x70, 0xe3, 0xcc, 0x6c }, 
+  { 0x76, 0xfc, 0x6e, 0xce, 0x0f, 0x4e, 0x17, 0x68,
+  0xcd, 0xdf, 0x88, 0x53, 0xbb, 0x2d, 0x55, 0x1b }, 
+  { 0x3a, 0x33, 0x7d, 0xbf, 0x46, 0xa7, 0x92, 0xc4,
+  0x5e, 0x45, 0x49, 0x13, 0xfe, 0x2e, 0xa8, 0xf2 }, 
+  { 0xa4, 0x4a, 0x82, 0x66, 0xee, 0x1c, 0x8e, 0xb0,
+  0xc8, 0xb5, 0xd4, 0xcf, 0x5a, 0xe9, 0xf1, 0x9a }, 
+};
+
+int gcm_self_test()
+{
+  uint8 buf[64];
+  uint8 tag_buf[16];
+  int i, j;
+
+  AesGcm128TempContext ctx;
+  AesGcm128StaticContext sctx;
+
+
+  {
+    AesContext aes;
+    uint8  key[16] = {43,126,21,22,40,174,210,166,171,247,21,136,9,207,79,60};
+    uint8   in[16] = {107,193,190,226,46,64,159,150,233,61,126,17,115,147,23,42};
+    uint8   out[16] = {58,215,123,180,13,122,54,96,168,158,202,243,36,102,239,151}, t[16];
+    aesni_set_encrypt_key(key, 128, &aes);
+    aesni_encrypt(in, t, &aes);
+    if (memcmp(t, out,16)) { printf("AES test fail!\n"); return 1; }
+    aesni_set_decrypt_key(key, 128, &aes);
+    aesni_decrypt(out, t, &aes);
+    if (memcmp(t, in,16)) { printf("AES test fail!\n"); return 1; }
+  }
+
+  uint8 correct[] = { 62,85,184,249,224,220,4,77,201,216,202,172,121,7,25,200, };
+  if (0) {
+    uint8 buf[512 + 16];
+    for (size_t i = 0; i < 512; i++)
+      buf[i] = (uint8)(i >> 4);// 0x11;
+    uint8 buf2[512 + 16];
+    for (size_t i = 0; i < 512; i++)
+      buf2[i] = buf[i];
+
+    size_t pp = 0x60;
+
+    CRYPTO_gcm128_init(&sctx, key[0], 128);
+    
+    sctx.use_aesni_gcm_crypt = 1;
+
+    aesgcm_decrypt_get_mac(buf, buf, pp, NULL, 0, 1, &sctx, buf + pp);
+    sctx.use_aesni_gcm_crypt = 0;
+    aesgcm_decrypt_get_mac(buf2, buf2, pp, NULL, 0, 1, &sctx, buf2 + pp);
+    //aesgcm_encrypt(buf, buf, 0x120 + 32, NULL, 0, 1, &sctx);
+
+    for (size_t i = 0; i < 16; i++)
+      printf("%d,", buf[pp + i]);
+    printf("\n");
+    for (size_t i = 0; i < 16; i++)
+      printf("%d,", buf2[pp + i]);
+    printf("\n");
+
+    if (memcmp(buf2 + pp, buf + pp, 16) == 0)
+      printf("CORRECT!!\n");
+    else
+      printf("******** FAIL ************\n");
+//    for(size_t i = 0; i < 16; i++)
+//      printf("%d,", buf[pp +i]);
+    printf("\n");
+  }
+  return 0;
+
+  for( j = 0; j < 3; j++ ) {
+    int key_len = 128 + 64 * j;
+    for( i = 0; i < MAX_TESTS; i++ ) {
+      CRYPTO_gcm128_init(&sctx, key[key_index[i]], key_len);
+      CRYPTO_gcm128_setiv(&ctx, &sctx, iv[iv_index[i]], iv_len[i]);
+      CRYPTO_gcm128_aad(&ctx, additional[add_index[i]], add_len[i]);
+      CRYPTO_gcm128_encrypt_ctr32(&ctx, pt[pt_index[i]], buf, pt_len[i]);
+      CRYPTO_gcm128_finish(&ctx, tag_buf, 16);
+      if(memcmp( buf, ct[j * 6 + i], pt_len[i] ) != 0 ||
+         memcmp( tag_buf, tag[j * 6 + i], 16 ) != 0 ) {
+        printf( "AES-GCM-%3d #%d (%s):  failed\n", key_len, i, "enc"  );
+        return( 1 );
+      }
+
+      CRYPTO_gcm128_init(&sctx, key[key_index[i]], key_len);
+      CRYPTO_gcm128_setiv(&ctx, &sctx, iv[iv_index[i]], iv_len[i]);
+      CRYPTO_gcm128_aad(&ctx, additional[add_index[i]], add_len[i]);
+      CRYPTO_gcm128_decrypt_ctr32(&ctx, ct[j * 6 + i], buf, pt_len[i]);
+      CRYPTO_gcm128_finish(&ctx, tag_buf, 16);
+      if(memcmp( buf, pt[pt_index[i]], pt_len[i] ) != 0 ||
+         memcmp( tag_buf, tag[j * 6 + i], 16 ) != 0 ) {
+        printf( "AES-GCM-%3d #%d (%s): failed\n", key_len, i, "dec"  );
+        return( 1 );
+      }
+    }
+  }
+
+  return( 0 );
+}
+
+//int main() {
+//  gcm_self_test();
+//}
+#endif
+
+#endif  // #if WITH_AESGCM
diff --git a/crypto/aesgcm/aesni-gcm-x86_64.pl b/crypto/aesgcm/aesni-gcm-x86_64.pl
new file mode 100644
index 0000000..f1607c7
--- /dev/null
+++ b/crypto/aesgcm/aesni-gcm-x86_64.pl
@@ -0,0 +1,1146 @@
+#! /usr/bin/env perl
+# Copyright 2013-2016 The OpenSSL Project Authors. All Rights Reserved.
+
+# Ludde note : This is the stitched AES+GCM code. Min size = 0x60
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+#
+# AES-NI-CTR+GHASH stitch.
+#
+# February 2013
+#
+# OpenSSL GCM implementation is organized in such way that its
+# performance is rather close to the sum of its streamed components,
+# in the context parallelized AES-NI CTR and modulo-scheduled
+# PCLMULQDQ-enabled GHASH. Unfortunately, as no stitch implementation
+# was observed to perform significantly better than the sum of the
+# components on contemporary CPUs, the effort was deemed impossible to
+# justify. This module is based on combination of Intel submissions,
+# [1] and [2], with MOVBE twist suggested by Ilya Albrekht and Max
+# Locktyukhin of Intel Corp. who verified that it reduces shuffles
+# pressure with notable relative improvement, achieving 1.0 cycle per
+# byte processed with 128-bit key on Haswell processor, 0.74 - on
+# Broadwell, 0.63 - on Skylake... [Mentioned results are raw profiled
+# measurements for favourable packet size, one divisible by 96.
+# Applications using the EVP interface will observe a few percent
+# worse performance.]
+#
+# Knights Landing processes 1 byte in 1.25 cycles (measured with EVP).
+#
+# [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest
+# [2] http://www.intel.com/content/dam/www/public/us/en/documents/software-support/enabling-high-performance-gcm.pdf
+
+$flavour = shift;
+$output  = shift;
+if ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
+
+$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/);
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../x86_64-xlate.pl" and -f $xlate) or
+die "can't locate x86_64-xlate.pl";
+
+# |$avx| in ghash-x86_64.pl must be set to at least 1; otherwise tags will
+# be computed incorrectly.
+#
+# In upstream, this is controlled by shelling out to the compiler to check
+# versions, but BoringSSL is intended to be used with pre-generated perlasm
+# output, so this isn't useful anyway.
+#
+# The upstream code uses the condition |$avx>1| even though no AVX2
+# instructions are used, because it assumes MOVBE is supported by the assembler
+# if and only if AVX2 is also supported by the assembler; see
+# https://marc.info/?l=openssl-dev&m=146567589526984&w=2.
+$avx = 2;
+
+open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
+*STDOUT=*OUT;
+
+# See the comment above regarding why the condition is ($avx>1) when there are
+# no AVX2 instructions being used.
+if ($avx>1) {{{
+
+($inp,$out,$len,$key,$ivp,$Xip)=("%rdi","%rsi","%rdx","%rcx","%r8","%r9");
+
+($Ii,$T1,$T2,$Hkey,
+ $Z0,$Z1,$Z2,$Z3,$Xi) = map("%xmm$_",(0..8));
+
+($inout0,$inout1,$inout2,$inout3,$inout4,$inout5,$rndkey) = map("%xmm$_",(9..15));
+
+($counter,$rounds,$ret,$const,$in0,$end0)=("%ebx","%ebp","%r10","%r11","%r14","%r15");
+
+$code=<<___;
+.text
+
+.type	_aesni_ctr32_ghash_6x,\@abi-omnipotent
+.align	32
+_aesni_ctr32_ghash_6x:
+.cfi_startproc
+	vmovdqu		0x20($const),$T2	# borrow $T2, .Lone_msb
+	sub		\$6,$len
+	vpxor		$Z0,$Z0,$Z0		# $Z0   = 0
+	vmovdqu		0x00-0x80($key),$rndkey
+	vpaddb		$T2,$T1,$inout1
+	vpaddb		$T2,$inout1,$inout2
+	vpaddb		$T2,$inout2,$inout3
+	vpaddb		$T2,$inout3,$inout4
+	vpaddb		$T2,$inout4,$inout5
+	vpxor		$rndkey,$T1,$inout0
+	vmovdqu		$Z0,16+8(%rsp)		# "$Z3" = 0
+	jmp		.Loop6x
+
+.align	32
+.Loop6x:
+	add		\$`6<<24`,$counter
+	jc		.Lhandle_ctr32		# discard $inout[1-5]?
+	vmovdqu		0x00-0x20($Xip),$Hkey	# $Hkey^1
+	  vpaddb	$T2,$inout5,$T1		# next counter value
+	  vpxor		$rndkey,$inout1,$inout1
+	  vpxor		$rndkey,$inout2,$inout2
+
+.Lresume_ctr32:
+	vmovdqu		$T1,($ivp)		# save next counter value
+	vpclmulqdq	\$0x10,$Hkey,$Z3,$Z1
+	  vpxor		$rndkey,$inout3,$inout3
+	  vmovups	0x10-0x80($key),$T2	# borrow $T2 for $rndkey
+	vpclmulqdq	\$0x01,$Hkey,$Z3,$Z2
+
+	# At this point, the current block of 96 (0x60) bytes has already been
+	# loaded into registers. Concurrently with processing it, we want to
+	# load the next 96 bytes of input for the next round. Obviously, we can
+	# only do this if there are at least 96 more bytes of input beyond the
+	# input we're currently processing, or else we'd read past the end of
+	# the input buffer. Here, we set |%r12| to 96 if there are at least 96
+	# bytes of input beyond the 96 bytes we're already processing, and we
+	# set |%r12| to 0 otherwise. In the case where we set |%r12| to 96,
+	# we'll read in the next block so that it is in registers for the next
+	# loop iteration. In the case where we set |%r12| to 0, we'll re-read
+	# the current block and then ignore what we re-read.
+	#
+	# At this point, |$in0| points to the current (already read into
+	# registers) block, and |$end0| points to 2*96 bytes before the end of
+	# the input. Thus, |$in0| > |$end0| means that we do not have the next
+	# 96-byte block to read in, and |$in0| <= |$end0| means we do.
+	xor		%r12,%r12
+	cmp		$in0,$end0
+
+	  vaesenc	$T2,$inout0,$inout0
+	vmovdqu		0x30+8(%rsp),$Ii	# I[4]
+	  vpxor		$rndkey,$inout4,$inout4
+	vpclmulqdq	\$0x00,$Hkey,$Z3,$T1
+	  vaesenc	$T2,$inout1,$inout1
+	  vpxor		$rndkey,$inout5,$inout5
+	setnc		%r12b
+	vpclmulqdq	\$0x11,$Hkey,$Z3,$Z3
+	  vaesenc	$T2,$inout2,$inout2
+	vmovdqu		0x10-0x20($Xip),$Hkey	# $Hkey^2
+	neg		%r12
+	  vaesenc	$T2,$inout3,$inout3
+	 vpxor		$Z1,$Z2,$Z2
+	vpclmulqdq	\$0x00,$Hkey,$Ii,$Z1
+	 vpxor		$Z0,$Xi,$Xi		# modulo-scheduled
+	  vaesenc	$T2,$inout4,$inout4
+	 vpxor		$Z1,$T1,$Z0
+	and		\$0x60,%r12
+	  vmovups	0x20-0x80($key),$rndkey
+	vpclmulqdq	\$0x10,$Hkey,$Ii,$T1
+	  vaesenc	$T2,$inout5,$inout5
+
+	vpclmulqdq	\$0x01,$Hkey,$Ii,$T2
+	lea		($in0,%r12),$in0
+	  vaesenc	$rndkey,$inout0,$inout0
+	 vpxor		16+8(%rsp),$Xi,$Xi	# modulo-scheduled [vpxor $Z3,$Xi,$Xi]
+	vpclmulqdq	\$0x11,$Hkey,$Ii,$Hkey
+	 vmovdqu	0x40+8(%rsp),$Ii	# I[3]
+	  vaesenc	$rndkey,$inout1,$inout1
+	movbe		0x58($in0),%r13
+	  vaesenc	$rndkey,$inout2,$inout2
+	movbe		0x50($in0),%r12
+	  vaesenc	$rndkey,$inout3,$inout3
+	mov		%r13,0x20+8(%rsp)
+	  vaesenc	$rndkey,$inout4,$inout4
+	mov		%r12,0x28+8(%rsp)
+	vmovdqu		0x30-0x20($Xip),$Z1	# borrow $Z1 for $Hkey^3
+	  vaesenc	$rndkey,$inout5,$inout5
+
+	  vmovups	0x30-0x80($key),$rndkey
+	 vpxor		$T1,$Z2,$Z2
+	vpclmulqdq	\$0x00,$Z1,$Ii,$T1
+	  vaesenc	$rndkey,$inout0,$inout0
+	 vpxor		$T2,$Z2,$Z2
+	vpclmulqdq	\$0x10,$Z1,$Ii,$T2
+	  vaesenc	$rndkey,$inout1,$inout1
+	 vpxor		$Hkey,$Z3,$Z3
+	vpclmulqdq	\$0x01,$Z1,$Ii,$Hkey
+	  vaesenc	$rndkey,$inout2,$inout2
+	vpclmulqdq	\$0x11,$Z1,$Ii,$Z1
+	 vmovdqu	0x50+8(%rsp),$Ii	# I[2]
+	  vaesenc	$rndkey,$inout3,$inout3
+	  vaesenc	$rndkey,$inout4,$inout4
+	 vpxor		$T1,$Z0,$Z0
+	vmovdqu		0x40-0x20($Xip),$T1	# borrow $T1 for $Hkey^4
+	  vaesenc	$rndkey,$inout5,$inout5
+
+	  vmovups	0x40-0x80($key),$rndkey
+	 vpxor		$T2,$Z2,$Z2
+	vpclmulqdq	\$0x00,$T1,$Ii,$T2
+	  vaesenc	$rndkey,$inout0,$inout0
+	 vpxor		$Hkey,$Z2,$Z2
+	vpclmulqdq	\$0x10,$T1,$Ii,$Hkey
+	  vaesenc	$rndkey,$inout1,$inout1
+	movbe		0x48($in0),%r13
+	 vpxor		$Z1,$Z3,$Z3
+	vpclmulqdq	\$0x01,$T1,$Ii,$Z1
+	  vaesenc	$rndkey,$inout2,$inout2
+	movbe		0x40($in0),%r12
+	vpclmulqdq	\$0x11,$T1,$Ii,$T1
+	 vmovdqu	0x60+8(%rsp),$Ii	# I[1]
+	  vaesenc	$rndkey,$inout3,$inout3
+	mov		%r13,0x30+8(%rsp)
+	  vaesenc	$rndkey,$inout4,$inout4
+	mov		%r12,0x38+8(%rsp)
+	 vpxor		$T2,$Z0,$Z0
+	vmovdqu		0x60-0x20($Xip),$T2	# borrow $T2 for $Hkey^5
+	  vaesenc	$rndkey,$inout5,$inout5
+
+	  vmovups	0x50-0x80($key),$rndkey
+	 vpxor		$Hkey,$Z2,$Z2
+	vpclmulqdq	\$0x00,$T2,$Ii,$Hkey
+	  vaesenc	$rndkey,$inout0,$inout0
+	 vpxor		$Z1,$Z2,$Z2
+	vpclmulqdq	\$0x10,$T2,$Ii,$Z1
+	  vaesenc	$rndkey,$inout1,$inout1
+	movbe		0x38($in0),%r13
+	 vpxor		$T1,$Z3,$Z3
+	vpclmulqdq	\$0x01,$T2,$Ii,$T1
+	 vpxor		0x70+8(%rsp),$Xi,$Xi	# accumulate I[0]
+	  vaesenc	$rndkey,$inout2,$inout2
+	movbe		0x30($in0),%r12
+	vpclmulqdq	\$0x11,$T2,$Ii,$T2
+	  vaesenc	$rndkey,$inout3,$inout3
+	mov		%r13,0x40+8(%rsp)
+	  vaesenc	$rndkey,$inout4,$inout4
+	mov		%r12,0x48+8(%rsp)
+	 vpxor		$Hkey,$Z0,$Z0
+	 vmovdqu	0x70-0x20($Xip),$Hkey	# $Hkey^6
+	  vaesenc	$rndkey,$inout5,$inout5
+
+	  vmovups	0x60-0x80($key),$rndkey
+	 vpxor		$Z1,$Z2,$Z2
+	vpclmulqdq	\$0x10,$Hkey,$Xi,$Z1
+	  vaesenc	$rndkey,$inout0,$inout0
+	 vpxor		$T1,$Z2,$Z2
+	vpclmulqdq	\$0x01,$Hkey,$Xi,$T1
+	  vaesenc	$rndkey,$inout1,$inout1
+	movbe		0x28($in0),%r13
+	 vpxor		$T2,$Z3,$Z3
+	vpclmulqdq	\$0x00,$Hkey,$Xi,$T2
+	  vaesenc	$rndkey,$inout2,$inout2
+	movbe		0x20($in0),%r12
+	vpclmulqdq	\$0x11,$Hkey,$Xi,$Xi
+	  vaesenc	$rndkey,$inout3,$inout3
+	mov		%r13,0x50+8(%rsp)
+	  vaesenc	$rndkey,$inout4,$inout4
+	mov		%r12,0x58+8(%rsp)
+	vpxor		$Z1,$Z2,$Z2
+	  vaesenc	$rndkey,$inout5,$inout5
+	vpxor		$T1,$Z2,$Z2
+
+	  vmovups	0x70-0x80($key),$rndkey
+	vpslldq		\$8,$Z2,$Z1
+	vpxor		$T2,$Z0,$Z0
+	vmovdqu		0x10($const),$Hkey	# .Lpoly
+
+	  vaesenc	$rndkey,$inout0,$inout0
+	vpxor		$Xi,$Z3,$Z3
+	  vaesenc	$rndkey,$inout1,$inout1
+	vpxor		$Z1,$Z0,$Z0
+	movbe		0x18($in0),%r13
+	  vaesenc	$rndkey,$inout2,$inout2
+	movbe		0x10($in0),%r12
+	vpalignr	\$8,$Z0,$Z0,$Ii		# 1st phase
+	vpclmulqdq	\$0x10,$Hkey,$Z0,$Z0
+	mov		%r13,0x60+8(%rsp)
+	  vaesenc	$rndkey,$inout3,$inout3
+	mov		%r12,0x68+8(%rsp)
+	  vaesenc	$rndkey,$inout4,$inout4
+	  vmovups	0x80-0x80($key),$T1	# borrow $T1 for $rndkey
+	  vaesenc	$rndkey,$inout5,$inout5
+
+	  vaesenc	$T1,$inout0,$inout0
+	  vmovups	0x90-0x80($key),$rndkey
+	  vaesenc	$T1,$inout1,$inout1
+	vpsrldq		\$8,$Z2,$Z2
+	  vaesenc	$T1,$inout2,$inout2
+	vpxor		$Z2,$Z3,$Z3
+	  vaesenc	$T1,$inout3,$inout3
+	vpxor		$Ii,$Z0,$Z0
+	movbe		0x08($in0),%r13
+	  vaesenc	$T1,$inout4,$inout4
+	movbe		0x00($in0),%r12
+	  vaesenc	$T1,$inout5,$inout5
+	  vmovups	0xa0-0x80($key),$T1
+	  cmp		\$11,$rounds
+	  jb		.Lenc_tail		# 128-bit key
+
+	  vaesenc	$rndkey,$inout0,$inout0
+	  vaesenc	$rndkey,$inout1,$inout1
+	  vaesenc	$rndkey,$inout2,$inout2
+	  vaesenc	$rndkey,$inout3,$inout3
+	  vaesenc	$rndkey,$inout4,$inout4
+	  vaesenc	$rndkey,$inout5,$inout5
+
+	  vaesenc	$T1,$inout0,$inout0
+	  vaesenc	$T1,$inout1,$inout1
+	  vaesenc	$T1,$inout2,$inout2
+	  vaesenc	$T1,$inout3,$inout3
+	  vaesenc	$T1,$inout4,$inout4
+	  vmovups	0xb0-0x80($key),$rndkey
+	  vaesenc	$T1,$inout5,$inout5
+	  vmovups	0xc0-0x80($key),$T1
+	  je		.Lenc_tail		# 192-bit key
+
+	  vaesenc	$rndkey,$inout0,$inout0
+	  vaesenc	$rndkey,$inout1,$inout1
+	  vaesenc	$rndkey,$inout2,$inout2
+	  vaesenc	$rndkey,$inout3,$inout3
+	  vaesenc	$rndkey,$inout4,$inout4
+	  vaesenc	$rndkey,$inout5,$inout5
+
+	  vaesenc	$T1,$inout0,$inout0
+	  vaesenc	$T1,$inout1,$inout1
+	  vaesenc	$T1,$inout2,$inout2
+	  vaesenc	$T1,$inout3,$inout3
+	  vaesenc	$T1,$inout4,$inout4
+	  vmovups	0xd0-0x80($key),$rndkey
+	  vaesenc	$T1,$inout5,$inout5
+	  vmovups	0xe0-0x80($key),$T1
+	  jmp		.Lenc_tail		# 256-bit key
+
+.align	32
+.Lhandle_ctr32:
+	vmovdqu		($const),$Ii		# borrow $Ii for .Lbswap_mask
+	  vpshufb	$Ii,$T1,$Z2		# byte-swap counter
+	  vmovdqu	0x30($const),$Z1	# borrow $Z1, .Ltwo_lsb
+	  vpaddd	0x40($const),$Z2,$inout1	# .Lone_lsb
+	  vpaddd	$Z1,$Z2,$inout2
+	vmovdqu		0x00-0x20($Xip),$Hkey	# $Hkey^1
+	  vpaddd	$Z1,$inout1,$inout3
+	  vpshufb	$Ii,$inout1,$inout1
+	  vpaddd	$Z1,$inout2,$inout4
+	  vpshufb	$Ii,$inout2,$inout2
+	  vpxor		$rndkey,$inout1,$inout1
+	  vpaddd	$Z1,$inout3,$inout5
+	  vpshufb	$Ii,$inout3,$inout3
+	  vpxor		$rndkey,$inout2,$inout2
+	  vpaddd	$Z1,$inout4,$T1		# byte-swapped next counter value
+	  vpshufb	$Ii,$inout4,$inout4
+	  vpshufb	$Ii,$inout5,$inout5
+	  vpshufb	$Ii,$T1,$T1		# next counter value
+	jmp		.Lresume_ctr32
+
+.align	32
+.Lenc_tail:
+	  vaesenc	$rndkey,$inout0,$inout0
+	vmovdqu		$Z3,16+8(%rsp)		# postpone vpxor $Z3,$Xi,$Xi
+	vpalignr	\$8,$Z0,$Z0,$Xi		# 2nd phase
+	  vaesenc	$rndkey,$inout1,$inout1
+	vpclmulqdq	\$0x10,$Hkey,$Z0,$Z0
+	  vpxor		0x00($inp),$T1,$T2
+	  vaesenc	$rndkey,$inout2,$inout2
+	  vpxor		0x10($inp),$T1,$Ii
+	  vaesenc	$rndkey,$inout3,$inout3
+	  vpxor		0x20($inp),$T1,$Z1
+	  vaesenc	$rndkey,$inout4,$inout4
+	  vpxor		0x30($inp),$T1,$Z2
+	  vaesenc	$rndkey,$inout5,$inout5
+	  vpxor		0x40($inp),$T1,$Z3
+	  vpxor		0x50($inp),$T1,$Hkey
+	  vmovdqu	($ivp),$T1		# load next counter value
+
+	  vaesenclast	$T2,$inout0,$inout0
+	  vmovdqu	0x20($const),$T2	# borrow $T2, .Lone_msb
+	  vaesenclast	$Ii,$inout1,$inout1
+	 vpaddb		$T2,$T1,$Ii
+	mov		%r13,0x70+8(%rsp)
+	lea		0x60($inp),$inp
+	  vaesenclast	$Z1,$inout2,$inout2
+	 vpaddb		$T2,$Ii,$Z1
+	mov		%r12,0x78+8(%rsp)
+	lea		0x60($out),$out
+	  vmovdqu	0x00-0x80($key),$rndkey
+	  vaesenclast	$Z2,$inout3,$inout3
+	 vpaddb		$T2,$Z1,$Z2
+	  vaesenclast	$Z3, $inout4,$inout4
+	 vpaddb		$T2,$Z2,$Z3
+	  vaesenclast	$Hkey,$inout5,$inout5
+	 vpaddb		$T2,$Z3,$Hkey
+
+	add		\$0x60,$ret
+	sub		\$0x6,$len
+	jc		.L6x_done
+
+	  vmovups	$inout0,-0x60($out)	# save output
+	 vpxor		$rndkey,$T1,$inout0
+	  vmovups	$inout1,-0x50($out)
+	 vmovdqa	$Ii,$inout1		# 0 latency
+	  vmovups	$inout2,-0x40($out)
+	 vmovdqa	$Z1,$inout2		# 0 latency
+	  vmovups	$inout3,-0x30($out)
+	 vmovdqa	$Z2,$inout3		# 0 latency
+	  vmovups	$inout4,-0x20($out)
+	 vmovdqa	$Z3,$inout4		# 0 latency
+	  vmovups	$inout5,-0x10($out)
+	 vmovdqa	$Hkey,$inout5		# 0 latency
+	vmovdqu		0x20+8(%rsp),$Z3	# I[5]
+	jmp		.Loop6x
+
+.L6x_done:
+	vpxor		16+8(%rsp),$Xi,$Xi	# modulo-scheduled
+	vpxor		$Z0,$Xi,$Xi		# modulo-scheduled
+
+	ret
+.cfi_endproc
+.size	_aesni_ctr32_ghash_6x,.-_aesni_ctr32_ghash_6x
+___
+######################################################################
+#
+# size_t aesni_gcm_[en|de]crypt(const void *inp, void *out, size_t len,
+#		const AES_KEY *key, struct { u128 Yi, Xi; } *yi_xi,
+#		struct { u128 H,Htbl[9]; } *H_htbl);
+$code.=<<___;
+.globl	aesni_gcm_decrypt
+.type	aesni_gcm_decrypt,\@function,6
+.align	32
+aesni_gcm_decrypt:
+.cfi_startproc
+	xor	$ret,$ret
+
+	# We call |_aesni_ctr32_ghash_6x|, which requires at least 96 (0x60)
+	# bytes of input.
+	cmp	\$0x60,$len			# minimal accepted length
+	jb	.Lgcm_dec_abort
+
+	lea	(%rsp),%rax			# save stack pointer
+.cfi_def_cfa_register	%rax
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+___
+
+$code .= <<___ if ($win64);
+	lea	-0xa8(%rsp),%rsp
+	movaps	%xmm6,-0xd8(%rax)
+	movaps	%xmm7,-0xc8(%rax)
+	movaps	%xmm8,-0xb8(%rax)
+	movaps	%xmm9,-0xa8(%rax)
+	movaps	%xmm10,-0x98(%rax)
+	movaps	%xmm11,-0x88(%rax)
+	movaps	%xmm12,-0x78(%rax)
+	movaps	%xmm13,-0x68(%rax)
+	movaps	%xmm14,-0x58(%rax)
+	movaps	%xmm15,-0x48(%rax)
+.Lgcm_dec_body:
+___
+
+$code.=<<___;
+	vzeroupper
+
+	vmovdqu		($ivp),$T1		# input counter value
+	add		\$-128,%rsp
+	mov		12($ivp),$counter
+	lea		.Lbswap_mask(%rip),$const
+	lea		-0x80($key),$in0	# borrow $in0
+	mov		\$0xf80,$end0		# borrow $end0
+	vmovdqu		0x10($ivp),$Xi		# load Xi
+	and		\$-128,%rsp		# ensure stack alignment
+	vmovdqu		($const),$Ii		# borrow $Ii for .Lbswap_mask
+	lea		0x80($key),$key		# size optimization
+	lea		0x10+0x20($Xip),$Xip	# size optimization
+	mov		0xf0-0x80($key),$rounds
+	vpshufb		$Ii,$Xi,$Xi
+
+	and		$end0,$in0
+	and		%rsp,$end0
+	sub		$in0,$end0
+	jc		.Ldec_no_key_aliasing
+	cmp		\$768,$end0
+	jnc		.Ldec_no_key_aliasing
+	sub		$end0,%rsp		# avoid aliasing with key
+.Ldec_no_key_aliasing:
+
+	vmovdqu		0x50($inp),$Z3		# I[5]
+	lea		($inp),$in0
+	vmovdqu		0x40($inp),$Z0
+
+	# |_aesni_ctr32_ghash_6x| requires |$end0| to point to 2*96 (0xc0)
+	# bytes before the end of the input. Note, in particular, that this is
+	# correct even if |$len| is not an even multiple of 96 or 16. XXX: This
+	# seems to require that |$inp| + |$len| >= 2*96 (0xc0); i.e. |$inp| must
+	# not be near the very beginning of the address space when |$len| < 2*96
+	# (0xc0).
+	lea		-0xc0($inp,$len),$end0
+
+	vmovdqu		0x30($inp),$Z1
+	shr		\$4,$len
+	xor		$ret,$ret
+	vmovdqu		0x20($inp),$Z2
+	 vpshufb	$Ii,$Z3,$Z3		# passed to _aesni_ctr32_ghash_6x
+	vmovdqu		0x10($inp),$T2
+	 vpshufb	$Ii,$Z0,$Z0
+	vmovdqu		($inp),$Hkey
+	 vpshufb	$Ii,$Z1,$Z1
+	vmovdqu		$Z0,0x30(%rsp)
+	 vpshufb	$Ii,$Z2,$Z2
+	vmovdqu		$Z1,0x40(%rsp)
+	 vpshufb	$Ii,$T2,$T2
+	vmovdqu		$Z2,0x50(%rsp)
+	 vpshufb	$Ii,$Hkey,$Hkey
+	vmovdqu		$T2,0x60(%rsp)
+	vmovdqu		$Hkey,0x70(%rsp)
+
+	call		_aesni_ctr32_ghash_6x
+
+	vmovups		$inout0,-0x60($out)	# save output
+	vmovups		$inout1,-0x50($out)
+	vmovups		$inout2,-0x40($out)
+	vmovups		$inout3,-0x30($out)
+	vmovups		$inout4,-0x20($out)
+	vmovups		$inout5,-0x10($out)
+
+	vpshufb		($const),$Xi,$Xi	# .Lbswap_mask
+	vmovdqu		$Xi,0x10($ivp)		# output Xi
+
+	vzeroupper
+___
+
+$code.=<<___ if ($win64);
+	movaps	-0xd8(%rax),%xmm6
+	movaps	-0xc8(%rax),%xmm7
+	movaps	-0xb8(%rax),%xmm8
+	movaps	-0xa8(%rax),%xmm9
+	movaps	-0x98(%rax),%xmm10
+	movaps	-0x88(%rax),%xmm11
+	movaps	-0x78(%rax),%xmm12
+	movaps	-0x68(%rax),%xmm13
+	movaps	-0x58(%rax),%xmm14
+	movaps	-0x48(%rax),%xmm15
+___
+$code.=<<___;
+	mov	-48(%rax),%r15
+.cfi_restore	%r15
+	mov	-40(%rax),%r14
+.cfi_restore	%r14
+	mov	-32(%rax),%r13
+.cfi_restore	%r13
+	mov	-24(%rax),%r12
+.cfi_restore	%r12
+	mov	-16(%rax),%rbp
+.cfi_restore	%rbp
+	mov	-8(%rax),%rbx
+.cfi_restore	%rbx
+	lea	(%rax),%rsp		# restore %rsp
+.cfi_def_cfa_register	%rsp
+.Lgcm_dec_abort:
+	mov	$ret,%rax		# return value
+	ret
+.cfi_endproc
+.size	aesni_gcm_decrypt,.-aesni_gcm_decrypt
+___
+
+$code.=<<___;
+.type	_aesni_ctr32_6x,\@abi-omnipotent
+.align	32
+_aesni_ctr32_6x:
+.cfi_startproc
+	vmovdqu		0x00-0x80($key),$Z0	# borrow $Z0 for $rndkey
+	vmovdqu		0x20($const),$T2	# borrow $T2, .Lone_msb
+	lea		-1($rounds),%r13
+	vmovups		0x10-0x80($key),$rndkey
+	lea		0x20-0x80($key),%r12
+	vpxor		$Z0,$T1,$inout0
+	add		\$`6<<24`,$counter
+	jc		.Lhandle_ctr32_2
+	vpaddb		$T2,$T1,$inout1
+	vpaddb		$T2,$inout1,$inout2
+	vpxor		$Z0,$inout1,$inout1
+	vpaddb		$T2,$inout2,$inout3
+	vpxor		$Z0,$inout2,$inout2
+	vpaddb		$T2,$inout3,$inout4
+	vpxor		$Z0,$inout3,$inout3
+	vpaddb		$T2,$inout4,$inout5
+	vpxor		$Z0,$inout4,$inout4
+	vpaddb		$T2,$inout5,$T1
+	vpxor		$Z0,$inout5,$inout5
+	jmp		.Loop_ctr32
+
+.align	16
+.Loop_ctr32:
+	vaesenc		$rndkey,$inout0,$inout0
+	vaesenc		$rndkey,$inout1,$inout1
+	vaesenc		$rndkey,$inout2,$inout2
+	vaesenc		$rndkey,$inout3,$inout3
+	vaesenc		$rndkey,$inout4,$inout4
+	vaesenc		$rndkey,$inout5,$inout5
+	vmovups		(%r12),$rndkey
+	lea		0x10(%r12),%r12
+	dec		%r13d
+	jnz		.Loop_ctr32
+
+	vmovdqu		(%r12),$Hkey		# last round key
+	vaesenc		$rndkey,$inout0,$inout0
+	vpxor		0x00($inp),$Hkey,$Z0
+	vaesenc		$rndkey,$inout1,$inout1
+	vpxor		0x10($inp),$Hkey,$Z1
+	vaesenc		$rndkey,$inout2,$inout2
+	vpxor		0x20($inp),$Hkey,$Z2
+	vaesenc		$rndkey,$inout3,$inout3
+	vpxor		0x30($inp),$Hkey,$Xi
+	vaesenc		$rndkey,$inout4,$inout4
+	vpxor		0x40($inp),$Hkey,$T2
+	vaesenc		$rndkey,$inout5,$inout5
+	vpxor		0x50($inp),$Hkey,$Hkey
+	lea		0x60($inp),$inp
+
+	vaesenclast	$Z0,$inout0,$inout0
+	vaesenclast	$Z1,$inout1,$inout1
+	vaesenclast	$Z2,$inout2,$inout2
+	vaesenclast	$Xi,$inout3,$inout3
+	vaesenclast	$T2,$inout4,$inout4
+	vaesenclast	$Hkey,$inout5,$inout5
+	vmovups		$inout0,0x00($out)
+	vmovups		$inout1,0x10($out)
+	vmovups		$inout2,0x20($out)
+	vmovups		$inout3,0x30($out)
+	vmovups		$inout4,0x40($out)
+	vmovups		$inout5,0x50($out)
+	lea		0x60($out),$out
+
+	ret
+.align	32
+.Lhandle_ctr32_2:
+	vpshufb		$Ii,$T1,$Z2		# byte-swap counter
+	vmovdqu		0x30($const),$Z1	# borrow $Z1, .Ltwo_lsb
+	vpaddd		0x40($const),$Z2,$inout1	# .Lone_lsb
+	vpaddd		$Z1,$Z2,$inout2
+	vpaddd		$Z1,$inout1,$inout3
+	vpshufb		$Ii,$inout1,$inout1
+	vpaddd		$Z1,$inout2,$inout4
+	vpshufb		$Ii,$inout2,$inout2
+	vpxor		$Z0,$inout1,$inout1
+	vpaddd		$Z1,$inout3,$inout5
+	vpshufb		$Ii,$inout3,$inout3
+	vpxor		$Z0,$inout2,$inout2
+	vpaddd		$Z1,$inout4,$T1		# byte-swapped next counter value
+	vpshufb		$Ii,$inout4,$inout4
+	vpxor		$Z0,$inout3,$inout3
+	vpshufb		$Ii,$inout5,$inout5
+	vpxor		$Z0,$inout4,$inout4
+	vpshufb		$Ii,$T1,$T1		# next counter value
+	vpxor		$Z0,$inout5,$inout5
+	jmp	.Loop_ctr32
+.cfi_endproc
+.size	_aesni_ctr32_6x,.-_aesni_ctr32_6x
+
+.globl	aesni_gcm_encrypt
+.type	aesni_gcm_encrypt,\@function,6
+.align	32
+aesni_gcm_encrypt:
+.cfi_startproc
+	xor	$ret,$ret
+
+	# We call |_aesni_ctr32_6x| twice, each call consuming 96 bytes of
+	# input. Then we call |_aesni_ctr32_ghash_6x|, which requires at
+	# least 96 more bytes of input.
+	cmp	\$0x60*3,$len			# minimal accepted length
+	jb	.Lgcm_enc_abort
+
+	lea	(%rsp),%rax			# save stack pointer
+.cfi_def_cfa_register	%rax
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+___
+$code.=<<___ if ($win64);
+	lea	-0xa8(%rsp),%rsp
+	movaps	%xmm6,-0xd8(%rax)
+	movaps	%xmm7,-0xc8(%rax)
+	movaps	%xmm8,-0xb8(%rax)
+	movaps	%xmm9,-0xa8(%rax)
+	movaps	%xmm10,-0x98(%rax)
+	movaps	%xmm11,-0x88(%rax)
+	movaps	%xmm12,-0x78(%rax)
+	movaps	%xmm13,-0x68(%rax)
+	movaps	%xmm14,-0x58(%rax)
+	movaps	%xmm15,-0x48(%rax)
+.Lgcm_enc_body:
+___
+$code.=<<___;
+	vzeroupper
+
+	vmovdqu		($ivp),$T1		# input counter value
+	add		\$-128,%rsp
+	mov		12($ivp),$counter
+	lea		.Lbswap_mask(%rip),$const
+	lea		-0x80($key),$in0	# borrow $in0
+	mov		\$0xf80,$end0		# borrow $end0
+	lea		0x80($key),$key		# size optimization
+	vmovdqu		($const),$Ii		# borrow $Ii for .Lbswap_mask
+	and		\$-128,%rsp		# ensure stack alignment
+	mov		0xf0-0x80($key),$rounds
+
+	and		$end0,$in0
+	and		%rsp,$end0
+	sub		$in0,$end0
+	jc		.Lenc_no_key_aliasing
+	cmp		\$768,$end0
+	jnc		.Lenc_no_key_aliasing
+	sub		$end0,%rsp		# avoid aliasing with key
+.Lenc_no_key_aliasing:
+
+	lea		($out),$in0
+
+	# |_aesni_ctr32_ghash_6x| requires |$end0| to point to 2*96 (0xc0)
+	# bytes before the end of the input. Note, in particular, that this is
+	# correct even if |$len| is not an even multiple of 96 or 16. Unlike in
+	# the decryption case, there's no caveat that |$out| must not be near
+	# the very beginning of the address space, because we know that
+	# |$len| >= 3*96 from the check above, and so we know
+	# |$out| + |$len| >= 2*96 (0xc0).
+	lea		-0xc0($out,$len),$end0
+
+	shr		\$4,$len
+
+	call		_aesni_ctr32_6x
+
+	vpshufb		$Ii,$inout0,$Xi		# save bswapped output on stack
+	vpshufb		$Ii,$inout1,$T2
+	vmovdqu		$Xi,0x70(%rsp)
+	vpshufb		$Ii,$inout2,$Z0
+	vmovdqu		$T2,0x60(%rsp)
+	vpshufb		$Ii,$inout3,$Z1
+	vmovdqu		$Z0,0x50(%rsp)
+	vpshufb		$Ii,$inout4,$Z2
+	vmovdqu		$Z1,0x40(%rsp)
+	vpshufb		$Ii,$inout5,$Z3		# passed to _aesni_ctr32_ghash_6x
+	vmovdqu		$Z2,0x30(%rsp)
+
+	call		_aesni_ctr32_6x
+
+	vmovdqu		0x10($ivp),$Xi		# load Xi
+	lea		0x10+0x20($Xip),$Xip	# size optimization
+	sub		\$12,$len
+	mov		\$0x60*2,$ret
+	vpshufb		$Ii,$Xi,$Xi
+
+	call		_aesni_ctr32_ghash_6x
+	vmovdqu		0x20(%rsp),$Z3		# I[5]
+	 vmovdqu	($const),$Ii		# borrow $Ii for .Lbswap_mask
+	vmovdqu		0x00-0x20($Xip),$Hkey	# $Hkey^1
+	vpunpckhqdq	$Z3,$Z3,$T1
+	vmovdqu		0x20-0x20($Xip),$rndkey	# borrow $rndkey for $HK
+	 vmovups	$inout0,-0x60($out)	# save output
+	 vpshufb	$Ii,$inout0,$inout0	# but keep bswapped copy
+	vpxor		$Z3,$T1,$T1
+	 vmovups	$inout1,-0x50($out)
+	 vpshufb	$Ii,$inout1,$inout1
+	 vmovups	$inout2,-0x40($out)
+	 vpshufb	$Ii,$inout2,$inout2
+	 vmovups	$inout3,-0x30($out)
+	 vpshufb	$Ii,$inout3,$inout3
+	 vmovups	$inout4,-0x20($out)
+	 vpshufb	$Ii,$inout4,$inout4
+	 vmovups	$inout5,-0x10($out)
+	 vpshufb	$Ii,$inout5,$inout5
+	 vmovdqu	$inout0,0x10(%rsp)	# free $inout0
+___
+{ my ($HK,$T3)=($rndkey,$inout0);
+
+$code.=<<___;
+	 vmovdqu	0x30(%rsp),$Z2		# I[4]
+	 vmovdqu	0x10-0x20($Xip),$Ii	# borrow $Ii for $Hkey^2
+	 vpunpckhqdq	$Z2,$Z2,$T2
+	vpclmulqdq	\$0x00,$Hkey,$Z3,$Z1
+	 vpxor		$Z2,$T2,$T2
+	vpclmulqdq	\$0x11,$Hkey,$Z3,$Z3
+	vpclmulqdq	\$0x00,$HK,$T1,$T1
+
+	 vmovdqu	0x40(%rsp),$T3		# I[3]
+	vpclmulqdq	\$0x00,$Ii,$Z2,$Z0
+	 vmovdqu	0x30-0x20($Xip),$Hkey	# $Hkey^3
+	vpxor		$Z1,$Z0,$Z0
+	 vpunpckhqdq	$T3,$T3,$Z1
+	vpclmulqdq	\$0x11,$Ii,$Z2,$Z2
+	 vpxor		$T3,$Z1,$Z1
+	vpxor		$Z3,$Z2,$Z2
+	vpclmulqdq	\$0x10,$HK,$T2,$T2
+	 vmovdqu	0x50-0x20($Xip),$HK
+	vpxor		$T1,$T2,$T2
+
+	 vmovdqu	0x50(%rsp),$T1		# I[2]
+	vpclmulqdq	\$0x00,$Hkey,$T3,$Z3
+	 vmovdqu	0x40-0x20($Xip),$Ii	# borrow $Ii for $Hkey^4
+	vpxor		$Z0,$Z3,$Z3
+	 vpunpckhqdq	$T1,$T1,$Z0
+	vpclmulqdq	\$0x11,$Hkey,$T3,$T3
+	 vpxor		$T1,$Z0,$Z0
+	vpxor		$Z2,$T3,$T3
+	vpclmulqdq	\$0x00,$HK,$Z1,$Z1
+	vpxor		$T2,$Z1,$Z1
+
+	 vmovdqu	0x60(%rsp),$T2		# I[1]
+	vpclmulqdq	\$0x00,$Ii,$T1,$Z2
+	 vmovdqu	0x60-0x20($Xip),$Hkey	# $Hkey^5
+	vpxor		$Z3,$Z2,$Z2
+	 vpunpckhqdq	$T2,$T2,$Z3
+	vpclmulqdq	\$0x11,$Ii,$T1,$T1
+	 vpxor		$T2,$Z3,$Z3
+	vpxor		$T3,$T1,$T1
+	vpclmulqdq	\$0x10,$HK,$Z0,$Z0
+	 vmovdqu	0x80-0x20($Xip),$HK
+	vpxor		$Z1,$Z0,$Z0
+
+	 vpxor		0x70(%rsp),$Xi,$Xi	# accumulate I[0]
+	vpclmulqdq	\$0x00,$Hkey,$T2,$Z1
+	 vmovdqu	0x70-0x20($Xip),$Ii	# borrow $Ii for $Hkey^6
+	 vpunpckhqdq	$Xi,$Xi,$T3
+	vpxor		$Z2,$Z1,$Z1
+	vpclmulqdq	\$0x11,$Hkey,$T2,$T2
+	 vpxor		$Xi,$T3,$T3
+	vpxor		$T1,$T2,$T2
+	vpclmulqdq	\$0x00,$HK,$Z3,$Z3
+	vpxor		$Z0,$Z3,$Z0
+
+	vpclmulqdq	\$0x00,$Ii,$Xi,$Z2
+	 vmovdqu	0x00-0x20($Xip),$Hkey	# $Hkey^1
+	 vpunpckhqdq	$inout5,$inout5,$T1
+	vpclmulqdq	\$0x11,$Ii,$Xi,$Xi
+	 vpxor		$inout5,$T1,$T1
+	vpxor		$Z1,$Z2,$Z1
+	vpclmulqdq	\$0x10,$HK,$T3,$T3
+	 vmovdqu	0x20-0x20($Xip),$HK
+	vpxor		$T2,$Xi,$Z3
+	vpxor		$Z0,$T3,$Z2
+
+	 vmovdqu	0x10-0x20($Xip),$Ii	# borrow $Ii for $Hkey^2
+	  vpxor		$Z1,$Z3,$T3		# aggregated Karatsuba post-processing
+	vpclmulqdq	\$0x00,$Hkey,$inout5,$Z0
+	  vpxor		$T3,$Z2,$Z2
+	 vpunpckhqdq	$inout4,$inout4,$T2
+	vpclmulqdq	\$0x11,$Hkey,$inout5,$inout5
+	 vpxor		$inout4,$T2,$T2
+	  vpslldq	\$8,$Z2,$T3
+	vpclmulqdq	\$0x00,$HK,$T1,$T1
+	  vpxor		$T3,$Z1,$Xi
+	  vpsrldq	\$8,$Z2,$Z2
+	  vpxor		$Z2,$Z3,$Z3
+
+	vpclmulqdq	\$0x00,$Ii,$inout4,$Z1
+	 vmovdqu	0x30-0x20($Xip),$Hkey	# $Hkey^3
+	vpxor		$Z0,$Z1,$Z1
+	 vpunpckhqdq	$inout3,$inout3,$T3
+	vpclmulqdq	\$0x11,$Ii,$inout4,$inout4
+	 vpxor		$inout3,$T3,$T3
+	vpxor		$inout5,$inout4,$inout4
+	  vpalignr	\$8,$Xi,$Xi,$inout5	# 1st phase
+	vpclmulqdq	\$0x10,$HK,$T2,$T2
+	 vmovdqu	0x50-0x20($Xip),$HK
+	vpxor		$T1,$T2,$T2
+
+	vpclmulqdq	\$0x00,$Hkey,$inout3,$Z0
+	 vmovdqu	0x40-0x20($Xip),$Ii	# borrow $Ii for $Hkey^4
+	vpxor		$Z1,$Z0,$Z0
+	 vpunpckhqdq	$inout2,$inout2,$T1
+	vpclmulqdq	\$0x11,$Hkey,$inout3,$inout3
+	 vpxor		$inout2,$T1,$T1
+	vpxor		$inout4,$inout3,$inout3
+	  vxorps	0x10(%rsp),$Z3,$Z3	# accumulate $inout0
+	vpclmulqdq	\$0x00,$HK,$T3,$T3
+	vpxor		$T2,$T3,$T3
+
+	  vpclmulqdq	\$0x10,0x10($const),$Xi,$Xi
+	  vxorps	$inout5,$Xi,$Xi
+
+	vpclmulqdq	\$0x00,$Ii,$inout2,$Z1
+	 vmovdqu	0x60-0x20($Xip),$Hkey	# $Hkey^5
+	vpxor		$Z0,$Z1,$Z1
+	 vpunpckhqdq	$inout1,$inout1,$T2
+	vpclmulqdq	\$0x11,$Ii,$inout2,$inout2
+	 vpxor		$inout1,$T2,$T2
+	  vpalignr	\$8,$Xi,$Xi,$inout5	# 2nd phase
+	vpxor		$inout3,$inout2,$inout2
+	vpclmulqdq	\$0x10,$HK,$T1,$T1
+	 vmovdqu	0x80-0x20($Xip),$HK
+	vpxor		$T3,$T1,$T1
+
+	  vxorps	$Z3,$inout5,$inout5
+	  vpclmulqdq	\$0x10,0x10($const),$Xi,$Xi
+	  vxorps	$inout5,$Xi,$Xi
+
+	vpclmulqdq	\$0x00,$Hkey,$inout1,$Z0
+	 vmovdqu	0x70-0x20($Xip),$Ii	# borrow $Ii for $Hkey^6
+	vpxor		$Z1,$Z0,$Z0
+	 vpunpckhqdq	$Xi,$Xi,$T3
+	vpclmulqdq	\$0x11,$Hkey,$inout1,$inout1
+	 vpxor		$Xi,$T3,$T3
+	vpxor		$inout2,$inout1,$inout1
+	vpclmulqdq	\$0x00,$HK,$T2,$T2
+	vpxor		$T1,$T2,$T2
+
+	vpclmulqdq	\$0x00,$Ii,$Xi,$Z1
+	vpclmulqdq	\$0x11,$Ii,$Xi,$Z3
+	vpxor		$Z0,$Z1,$Z1
+	vpclmulqdq	\$0x10,$HK,$T3,$Z2
+	vpxor		$inout1,$Z3,$Z3
+	vpxor		$T2,$Z2,$Z2
+
+	vpxor		$Z1,$Z3,$Z0		# aggregated Karatsuba post-processing
+	vpxor		$Z0,$Z2,$Z2
+	vpslldq		\$8,$Z2,$T1
+	vmovdqu		0x10($const),$Hkey	# .Lpoly
+	vpsrldq		\$8,$Z2,$Z2
+	vpxor		$T1,$Z1,$Xi
+	vpxor		$Z2,$Z3,$Z3
+
+	vpalignr	\$8,$Xi,$Xi,$T2		# 1st phase
+	vpclmulqdq	\$0x10,$Hkey,$Xi,$Xi
+	vpxor		$T2,$Xi,$Xi
+
+	vpalignr	\$8,$Xi,$Xi,$T2		# 2nd phase
+	vpclmulqdq	\$0x10,$Hkey,$Xi,$Xi
+	vpxor		$Z3,$T2,$T2
+	vpxor		$T2,$Xi,$Xi
+___
+}
+$code.=<<___;
+	vpshufb		($const),$Xi,$Xi	# .Lbswap_mask
+	vmovdqu		$Xi,0x10($ivp)		# output Xi
+
+	vzeroupper
+___
+$code.=<<___ if ($win64);
+	movaps	-0xd8(%rax),%xmm6
+	movaps	-0xc8(%rax),%xmm7
+	movaps	-0xb8(%rax),%xmm8
+	movaps	-0xa8(%rax),%xmm9
+	movaps	-0x98(%rax),%xmm10
+	movaps	-0x88(%rax),%xmm11
+	movaps	-0x78(%rax),%xmm12
+	movaps	-0x68(%rax),%xmm13
+	movaps	-0x58(%rax),%xmm14
+	movaps	-0x48(%rax),%xmm15
+___
+$code.=<<___;
+	mov	-48(%rax),%r15
+.cfi_restore	%r15
+	mov	-40(%rax),%r14
+.cfi_restore	%r14
+	mov	-32(%rax),%r13
+.cfi_restore	%r13
+	mov	-24(%rax),%r12
+.cfi_restore	%r12
+	mov	-16(%rax),%rbp
+.cfi_restore	%rbp
+	mov	-8(%rax),%rbx
+.cfi_restore	%rbx
+	lea	(%rax),%rsp		# restore %rsp
+.cfi_def_cfa_register	%rsp
+.Lgcm_enc_abort:
+	mov	$ret,%rax		# return value
+	ret
+.cfi_endproc
+.size	aesni_gcm_encrypt,.-aesni_gcm_encrypt
+___
+
+$code.=<<___;
+.align	64
+.Lbswap_mask:
+	.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+.Lpoly:
+	.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+.Lone_msb:
+	.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+.Ltwo_lsb:
+	.byte	2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.Lone_lsb:
+	.byte	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.asciz	"AES-NI GCM module for x86_64, CRYPTOGAMS by <appro\@openssl.org>"
+.align	64
+___
+if ($win64) {
+$rec="%rcx";
+$frame="%rdx";
+$context="%r8";
+$disp="%r9";
+
+$code.=<<___
+.extern	__imp_RtlVirtualUnwind
+.type	gcm_se_handler,\@abi-omnipotent
+.align	16
+gcm_se_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue label
+	cmp	%r10,%rbx		# context->Rip<prologue label
+	jb	.Lcommon_seh_tail
+
+	mov	152($context),%rax	# pull context->Rsp
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=epilogue label
+	jae	.Lcommon_seh_tail
+
+	mov	120($context),%rax	# pull context->Rax
+
+	mov	-48(%rax),%r15
+	mov	-40(%rax),%r14
+	mov	-32(%rax),%r13
+	mov	-24(%rax),%r12
+	mov	-16(%rax),%rbp
+	mov	-8(%rax),%rbx
+	mov	%r15,240($context)
+	mov	%r14,232($context)
+	mov	%r13,224($context)
+	mov	%r12,216($context)
+	mov	%rbp,160($context)
+	mov	%rbx,144($context)
+
+	lea	-0xd8(%rax),%rsi	# %xmm save area
+	lea	512($context),%rdi	# & context.Xmm6
+	mov	\$20,%ecx		# 10*sizeof(%xmm0)/sizeof(%rax)
+	.long	0xa548f3fc		# cld; rep movsq
+
+.Lcommon_seh_tail:
+	mov	8(%rax),%rdi
+	mov	16(%rax),%rsi
+	mov	%rax,152($context)	# restore context->Rsp
+	mov	%rsi,168($context)	# restore context->Rsi
+	mov	%rdi,176($context)	# restore context->Rdi
+
+	mov	40($disp),%rdi		# disp->ContextRecord
+	mov	$context,%rsi		# context
+	mov	\$154,%ecx		# sizeof(CONTEXT)
+	.long	0xa548f3fc		# cld; rep movsq
+
+	mov	$disp,%rsi
+	xor	%rcx,%rcx		# arg1, UNW_FLAG_NHANDLER
+	mov	8(%rsi),%rdx		# arg2, disp->ImageBase
+	mov	0(%rsi),%r8		# arg3, disp->ControlPc
+	mov	16(%rsi),%r9		# arg4, disp->FunctionEntry
+	mov	40(%rsi),%r10		# disp->ContextRecord
+	lea	56(%rsi),%r11		# &disp->HandlerData
+	lea	24(%rsi),%r12		# &disp->EstablisherFrame
+	mov	%r10,32(%rsp)		# arg5
+	mov	%r11,40(%rsp)		# arg6
+	mov	%r12,48(%rsp)		# arg7
+	mov	%rcx,56(%rsp)		# arg8, (NULL)
+	call	*__imp_RtlVirtualUnwind(%rip)
+
+	mov	\$1,%eax		# ExceptionContinueSearch
+	add	\$64,%rsp
+	popfq
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+	pop	%rbp
+	pop	%rbx
+	pop	%rdi
+	pop	%rsi
+	ret
+.size	gcm_se_handler,.-gcm_se_handler
+
+.section	.pdata
+.align	4
+	.rva	.LSEH_begin_aesni_gcm_decrypt
+	.rva	.LSEH_end_aesni_gcm_decrypt
+	.rva	.LSEH_gcm_dec_info
+
+	.rva	.LSEH_begin_aesni_gcm_encrypt
+	.rva	.LSEH_end_aesni_gcm_encrypt
+	.rva	.LSEH_gcm_enc_info
+.section	.xdata
+.align	8
+.LSEH_gcm_dec_info:
+	.byte	9,0,0,0
+	.rva	gcm_se_handler
+	.rva	.Lgcm_dec_body,.Lgcm_dec_abort
+.LSEH_gcm_enc_info:
+	.byte	9,0,0,0
+	.rva	gcm_se_handler
+	.rva	.Lgcm_enc_body,.Lgcm_enc_abort
+___
+}
+}}} else {{{
+$code=<<___;	# assembler is too old
+.text
+
+.globl	aesni_gcm_encrypt
+.type	aesni_gcm_encrypt,\@abi-omnipotent
+aesni_gcm_encrypt:
+	xor	%eax,%eax
+	ret
+.size	aesni_gcm_encrypt,.-aesni_gcm_encrypt
+
+.globl	aesni_gcm_decrypt
+.type	aesni_gcm_decrypt,\@abi-omnipotent
+aesni_gcm_decrypt:
+	xor	%eax,%eax
+	ret
+.size	aesni_gcm_decrypt,.-aesni_gcm_decrypt
+___
+}}}
+
+$code =~ s/\`([^\`]*)\`/eval($1)/gem;
+
+print $code;
+
+close STDOUT;
diff --git a/crypto/aesgcm/aesni-x86.pl b/crypto/aesgcm/aesni-x86.pl
new file mode 100644
index 0000000..cf1a51e
--- /dev/null
+++ b/crypto/aesgcm/aesni-x86.pl
@@ -0,0 +1,2544 @@
+#! /usr/bin/env perl
+# Copyright 2009-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# This module implements support for Intel AES-NI extension. In
+# OpenSSL context it's used with Intel engine, but can also be used as
+# drop-in replacement for crypto/aes/asm/aes-586.pl [see below for
+# details].
+#
+# Performance.
+#
+# To start with see corresponding paragraph in aesni-x86_64.pl...
+# Instead of filling table similar to one found there I've chosen to
+# summarize *comparison* results for raw ECB, CTR and CBC benchmarks.
+# The simplified table below represents 32-bit performance relative
+# to 64-bit one in every given point. Ratios vary for different
+# encryption modes, therefore interval values.
+#
+#	16-byte     64-byte     256-byte    1-KB        8-KB
+#	53-67%      67-84%      91-94%      95-98%      97-99.5%
+#
+# Lower ratios for smaller block sizes are perfectly understandable,
+# because function call overhead is higher in 32-bit mode. Largest
+# 8-KB block performance is virtually same: 32-bit code is less than
+# 1% slower for ECB, CBC and CCM, and ~3% slower otherwise.
+
+# January 2011
+#
+# See aesni-x86_64.pl for details. Unlike x86_64 version this module
+# interleaves at most 6 aes[enc|dec] instructions, because there are
+# not enough registers for 8x interleave [which should be optimal for
+# Sandy Bridge]. Actually, performance results for 6x interleave
+# factor presented in aesni-x86_64.pl (except for CTR) are for this
+# module.
+
+# April 2011
+#
+# Add aesni_xts_[en|de]crypt. Westmere spends 1.50 cycles processing
+# one byte out of 8KB with 128-bit key, Sandy Bridge - 1.09.
+
+# November 2015
+#
+# Add aesni_ocb_[en|de]crypt. [Removed in BoringSSL]
+
+######################################################################
+# Current large-block performance in cycles per byte processed with
+# 128-bit key (less is better).
+#
+#		CBC en-/decrypt	CTR	XTS	ECB	OCB
+# Westmere	3.77/1.37	1.37	1.52	1.27
+# * Bridge	5.07/0.98	0.99	1.09	0.91	1.10
+# Haswell	4.44/0.80	0.97	1.03	0.72	0.76
+# Skylake	2.68/0.65	0.65	0.66	0.64	0.66
+# Silvermont	5.77/3.56	3.67	4.03	3.46	4.03
+# Goldmont	3.84/1.39	1.39	1.63	1.31	1.70
+# Bulldozer	5.80/0.98	1.05	1.24	0.93	1.23
+
+$PREFIX="aesni";	# if $PREFIX is set to "AES", the script
+			# generates drop-in replacement for
+			# crypto/aes/asm/aes-586.pl:-)
+$inline=1;		# inline _aesni_[en|de]crypt
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+push(@INC,"${dir}","${dir}../../../perlasm");
+require "x86asm.pl";
+
+$output = pop;
+open OUT,">$output";
+*STDOUT=*OUT;
+
+&asm_init($ARGV[0]);
+
+&external_label("OPENSSL_ia32cap_P");
+&static_label("key_const");
+
+if ($PREFIX eq "aesni")	{ $movekey=\&movups; }
+else			{ $movekey=\&movups; }
+
+$len="eax";
+$rounds="ecx";
+$key="edx";
+$inp="esi";
+$out="edi";
+$rounds_="ebx";	# backup copy for $rounds
+$key_="ebp";	# backup copy for $key
+
+$rndkey0="xmm0";
+$rndkey1="xmm1";
+$inout0="xmm2";
+$inout1="xmm3";
+$inout2="xmm4";
+$inout3="xmm5";	$in1="xmm5";
+$inout4="xmm6";	$in0="xmm6";
+$inout5="xmm7";	$ivec="xmm7";
+
+# AESNI extension
+sub aeskeygenassist
+{ my($dst,$src,$imm)=@_;
+    if ("$dst:$src" =~ /xmm([0-7]):xmm([0-7])/)
+    {	&data_byte(0x66,0x0f,0x3a,0xdf,0xc0|($1<<3)|$2,$imm);	}
+}
+sub aescommon
+{ my($opcodelet,$dst,$src)=@_;
+    if ("$dst:$src" =~ /xmm([0-7]):xmm([0-7])/)
+    {	&data_byte(0x66,0x0f,0x38,$opcodelet,0xc0|($1<<3)|$2);}
+}
+sub aesimc	{ aescommon(0xdb,@_); }
+sub aesenc	{ aescommon(0xdc,@_); }
+sub aesenclast	{ aescommon(0xdd,@_); }
+sub aesdec	{ aescommon(0xde,@_); }
+sub aesdeclast	{ aescommon(0xdf,@_); }
+
+# Inline version of internal aesni_[en|de]crypt1
+{ my $sn;
+sub aesni_inline_generate1
+{ my ($p,$inout,$ivec)=@_; $inout=$inout0 if (!defined($inout));
+  $sn++;
+
+    &$movekey		($rndkey0,&QWP(0,$key));
+    &$movekey		($rndkey1,&QWP(16,$key));
+    &xorps		($ivec,$rndkey0)	if (defined($ivec));
+    &lea		($key,&DWP(32,$key));
+    &xorps		($inout,$ivec)		if (defined($ivec));
+    &xorps		($inout,$rndkey0)	if (!defined($ivec));
+    &set_label("${p}1_loop_$sn");
+	eval"&aes${p}	($inout,$rndkey1)";
+	&dec		($rounds);
+	&$movekey	($rndkey1,&QWP(0,$key));
+	&lea		($key,&DWP(16,$key));
+    &jnz		(&label("${p}1_loop_$sn"));
+    eval"&aes${p}last	($inout,$rndkey1)";
+}}
+
+sub aesni_generate1	# fully unrolled loop
+{ my ($p,$inout)=@_; $inout=$inout0 if (!defined($inout));
+
+    &function_begin_B("_aesni_${p}rypt1");
+	&movups		($rndkey0,&QWP(0,$key));
+	&$movekey	($rndkey1,&QWP(0x10,$key));
+	&xorps		($inout,$rndkey0);
+	&$movekey	($rndkey0,&QWP(0x20,$key));
+	&lea		($key,&DWP(0x30,$key));
+	&cmp		($rounds,11);
+	&jb		(&label("${p}128"));
+	&lea		($key,&DWP(0x20,$key));
+	&je		(&label("${p}192"));
+	&lea		($key,&DWP(0x20,$key));
+	eval"&aes${p}	($inout,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(-0x40,$key));
+	eval"&aes${p}	($inout,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(-0x30,$key));
+    &set_label("${p}192");
+	eval"&aes${p}	($inout,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(-0x20,$key));
+	eval"&aes${p}	($inout,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(-0x10,$key));
+    &set_label("${p}128");
+	eval"&aes${p}	($inout,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(0,$key));
+	eval"&aes${p}	($inout,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(0x10,$key));
+	eval"&aes${p}	($inout,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(0x20,$key));
+	eval"&aes${p}	($inout,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(0x30,$key));
+	eval"&aes${p}	($inout,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(0x40,$key));
+	eval"&aes${p}	($inout,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(0x50,$key));
+	eval"&aes${p}	($inout,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(0x60,$key));
+	eval"&aes${p}	($inout,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(0x70,$key));
+	eval"&aes${p}	($inout,$rndkey1)";
+    eval"&aes${p}last	($inout,$rndkey0)";
+    &ret();
+    &function_end_B("_aesni_${p}rypt1");
+}
+
+# void $PREFIX_encrypt (const void *inp,void *out,const AES_KEY *key);
+&aesni_generate1("enc") if (!$inline);
+&function_begin_B("${PREFIX}_encrypt");
+	&mov	("eax",&wparam(0));
+	&mov	($key,&wparam(2));
+	&movups	($inout0,&QWP(0,"eax"));
+	&mov	($rounds,&DWP(240,$key));
+	&mov	("eax",&wparam(1));
+	if ($inline)
+	{   &aesni_inline_generate1("enc");	}
+	else
+	{   &call	("_aesni_encrypt1");	}
+	&pxor	($rndkey0,$rndkey0);		# clear register bank
+	&pxor	($rndkey1,$rndkey1);
+	&movups	(&QWP(0,"eax"),$inout0);
+	&pxor	($inout0,$inout0);
+	&ret	();
+&function_end_B("${PREFIX}_encrypt");
+
+# void $PREFIX_decrypt (const void *inp,void *out,const AES_KEY *key);
+&aesni_generate1("dec") if(!$inline);
+&function_begin_B("${PREFIX}_decrypt");
+	&mov	("eax",&wparam(0));
+	&mov	($key,&wparam(2));
+	&movups	($inout0,&QWP(0,"eax"));
+	&mov	($rounds,&DWP(240,$key));
+	&mov	("eax",&wparam(1));
+	if ($inline)
+	{   &aesni_inline_generate1("dec");	}
+	else
+	{   &call	("_aesni_decrypt1");	}
+	&pxor	($rndkey0,$rndkey0);		# clear register bank
+	&pxor	($rndkey1,$rndkey1);
+	&movups	(&QWP(0,"eax"),$inout0);
+	&pxor	($inout0,$inout0);
+	&ret	();
+&function_end_B("${PREFIX}_decrypt");
+
+# _aesni_[en|de]cryptN are private interfaces, N denotes interleave
+# factor. Why 3x subroutine were originally used in loops? Even though
+# aes[enc|dec] latency was originally 6, it could be scheduled only
+# every *2nd* cycle. Thus 3x interleave was the one providing optimal
+# utilization, i.e. when subroutine's throughput is virtually same as
+# of non-interleaved subroutine [for number of input blocks up to 3].
+# This is why it originally made no sense to implement 2x subroutine.
+# But times change and it became appropriate to spend extra 192 bytes
+# on 2x subroutine on Atom Silvermont account. For processors that
+# can schedule aes[enc|dec] every cycle optimal interleave factor
+# equals to corresponding instructions latency. 8x is optimal for
+# * Bridge, but it's unfeasible to accommodate such implementation
+# in XMM registers addressable in 32-bit mode and therefore maximum
+# of 6x is used instead...
+
+sub aesni_generate2
+{ my $p=shift;
+
+    &function_begin_B("_aesni_${p}rypt2");
+	&$movekey	($rndkey0,&QWP(0,$key));
+	&shl		($rounds,4);
+	&$movekey	($rndkey1,&QWP(16,$key));
+	&xorps		($inout0,$rndkey0);
+	&pxor		($inout1,$rndkey0);
+	&$movekey	($rndkey0,&QWP(32,$key));
+	&lea		($key,&DWP(32,$key,$rounds));
+	&neg		($rounds);
+	&add		($rounds,16);
+
+    &set_label("${p}2_loop");
+	eval"&aes${p}	($inout0,$rndkey1)";
+	eval"&aes${p}	($inout1,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(0,$key,$rounds));
+	&add		($rounds,32);
+	eval"&aes${p}	($inout0,$rndkey0)";
+	eval"&aes${p}	($inout1,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(-16,$key,$rounds));
+	&jnz		(&label("${p}2_loop"));
+    eval"&aes${p}	($inout0,$rndkey1)";
+    eval"&aes${p}	($inout1,$rndkey1)";
+    eval"&aes${p}last	($inout0,$rndkey0)";
+    eval"&aes${p}last	($inout1,$rndkey0)";
+    &ret();
+    &function_end_B("_aesni_${p}rypt2");
+}
+
+sub aesni_generate3
+{ my $p=shift;
+
+    &function_begin_B("_aesni_${p}rypt3");
+	&$movekey	($rndkey0,&QWP(0,$key));
+	&shl		($rounds,4);
+	&$movekey	($rndkey1,&QWP(16,$key));
+	&xorps		($inout0,$rndkey0);
+	&pxor		($inout1,$rndkey0);
+	&pxor		($inout2,$rndkey0);
+	&$movekey	($rndkey0,&QWP(32,$key));
+	&lea		($key,&DWP(32,$key,$rounds));
+	&neg		($rounds);
+	&add		($rounds,16);
+
+    &set_label("${p}3_loop");
+	eval"&aes${p}	($inout0,$rndkey1)";
+	eval"&aes${p}	($inout1,$rndkey1)";
+	eval"&aes${p}	($inout2,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(0,$key,$rounds));
+	&add		($rounds,32);
+	eval"&aes${p}	($inout0,$rndkey0)";
+	eval"&aes${p}	($inout1,$rndkey0)";
+	eval"&aes${p}	($inout2,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(-16,$key,$rounds));
+	&jnz		(&label("${p}3_loop"));
+    eval"&aes${p}	($inout0,$rndkey1)";
+    eval"&aes${p}	($inout1,$rndkey1)";
+    eval"&aes${p}	($inout2,$rndkey1)";
+    eval"&aes${p}last	($inout0,$rndkey0)";
+    eval"&aes${p}last	($inout1,$rndkey0)";
+    eval"&aes${p}last	($inout2,$rndkey0)";
+    &ret();
+    &function_end_B("_aesni_${p}rypt3");
+}
+
+# 4x interleave is implemented to improve small block performance,
+# most notably [and naturally] 4 block by ~30%. One can argue that one
+# should have implemented 5x as well, but improvement  would be <20%,
+# so it's not worth it...
+sub aesni_generate4
+{ my $p=shift;
+
+    &function_begin_B("_aesni_${p}rypt4");
+	&$movekey	($rndkey0,&QWP(0,$key));
+	&$movekey	($rndkey1,&QWP(16,$key));
+	&shl		($rounds,4);
+	&xorps		($inout0,$rndkey0);
+	&pxor		($inout1,$rndkey0);
+	&pxor		($inout2,$rndkey0);
+	&pxor		($inout3,$rndkey0);
+	&$movekey	($rndkey0,&QWP(32,$key));
+	&lea		($key,&DWP(32,$key,$rounds));
+	&neg		($rounds);
+	&data_byte	(0x0f,0x1f,0x40,0x00);
+	&add		($rounds,16);
+
+    &set_label("${p}4_loop");
+	eval"&aes${p}	($inout0,$rndkey1)";
+	eval"&aes${p}	($inout1,$rndkey1)";
+	eval"&aes${p}	($inout2,$rndkey1)";
+	eval"&aes${p}	($inout3,$rndkey1)";
+	&$movekey	($rndkey1,&QWP(0,$key,$rounds));
+	&add		($rounds,32);
+	eval"&aes${p}	($inout0,$rndkey0)";
+	eval"&aes${p}	($inout1,$rndkey0)";
+	eval"&aes${p}	($inout2,$rndkey0)";
+	eval"&aes${p}	($inout3,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(-16,$key,$rounds));
+    &jnz		(&label("${p}4_loop"));
+
+    eval"&aes${p}	($inout0,$rndkey1)";
+    eval"&aes${p}	($inout1,$rndkey1)";
+    eval"&aes${p}	($inout2,$rndkey1)";
+    eval"&aes${p}	($inout3,$rndkey1)";
+    eval"&aes${p}last	($inout0,$rndkey0)";
+    eval"&aes${p}last	($inout1,$rndkey0)";
+    eval"&aes${p}last	($inout2,$rndkey0)";
+    eval"&aes${p}last	($inout3,$rndkey0)";
+    &ret();
+    &function_end_B("_aesni_${p}rypt4");
+}
+
+sub aesni_generate6
+{ my $p=shift;
+
+    &function_begin_B("_aesni_${p}rypt6");
+    &static_label("_aesni_${p}rypt6_enter");
+	&$movekey	($rndkey0,&QWP(0,$key));
+	&shl		($rounds,4);
+	&$movekey	($rndkey1,&QWP(16,$key));
+	&xorps		($inout0,$rndkey0);
+	&pxor		($inout1,$rndkey0);	# pxor does better here
+	&pxor		($inout2,$rndkey0);
+	eval"&aes${p}	($inout0,$rndkey1)";
+	&pxor		($inout3,$rndkey0);
+	&pxor		($inout4,$rndkey0);
+	eval"&aes${p}	($inout1,$rndkey1)";
+	&lea		($key,&DWP(32,$key,$rounds));
+	&neg		($rounds);
+	eval"&aes${p}	($inout2,$rndkey1)";
+	&pxor		($inout5,$rndkey0);
+	&$movekey	($rndkey0,&QWP(0,$key,$rounds));
+	&add		($rounds,16);
+	&jmp		(&label("_aesni_${p}rypt6_inner"));
+
+    &set_label("${p}6_loop",16);
+	eval"&aes${p}	($inout0,$rndkey1)";
+	eval"&aes${p}	($inout1,$rndkey1)";
+	eval"&aes${p}	($inout2,$rndkey1)";
+    &set_label("_aesni_${p}rypt6_inner");
+	eval"&aes${p}	($inout3,$rndkey1)";
+	eval"&aes${p}	($inout4,$rndkey1)";
+	eval"&aes${p}	($inout5,$rndkey1)";
+    &set_label("_aesni_${p}rypt6_enter");
+	&$movekey	($rndkey1,&QWP(0,$key,$rounds));
+	&add		($rounds,32);
+	eval"&aes${p}	($inout0,$rndkey0)";
+	eval"&aes${p}	($inout1,$rndkey0)";
+	eval"&aes${p}	($inout2,$rndkey0)";
+	eval"&aes${p}	($inout3,$rndkey0)";
+	eval"&aes${p}	($inout4,$rndkey0)";
+	eval"&aes${p}	($inout5,$rndkey0)";
+	&$movekey	($rndkey0,&QWP(-16,$key,$rounds));
+    &jnz		(&label("${p}6_loop"));
+
+    eval"&aes${p}	($inout0,$rndkey1)";
+    eval"&aes${p}	($inout1,$rndkey1)";
+    eval"&aes${p}	($inout2,$rndkey1)";
+    eval"&aes${p}	($inout3,$rndkey1)";
+    eval"&aes${p}	($inout4,$rndkey1)";
+    eval"&aes${p}	($inout5,$rndkey1)";
+    eval"&aes${p}last	($inout0,$rndkey0)";
+    eval"&aes${p}last	($inout1,$rndkey0)";
+    eval"&aes${p}last	($inout2,$rndkey0)";
+    eval"&aes${p}last	($inout3,$rndkey0)";
+    eval"&aes${p}last	($inout4,$rndkey0)";
+    eval"&aes${p}last	($inout5,$rndkey0)";
+    &ret();
+    &function_end_B("_aesni_${p}rypt6");
+}
+&aesni_generate2("enc") if ($PREFIX eq "aesni");
+&aesni_generate2("dec");
+&aesni_generate3("enc") if ($PREFIX eq "aesni");
+&aesni_generate3("dec");
+&aesni_generate4("enc") if ($PREFIX eq "aesni");
+&aesni_generate4("dec");
+&aesni_generate6("enc") if ($PREFIX eq "aesni");
+&aesni_generate6("dec");
+
+if ($PREFIX eq "aesni") {
+######################################################################
+# void aesni_ecb_encrypt (const void *in, void *out,
+#                         size_t length, const AES_KEY *key,
+#                         int enc);
+&function_begin("aesni_ecb_encrypt");
+	&mov	($inp,&wparam(0));
+	&mov	($out,&wparam(1));
+	&mov	($len,&wparam(2));
+	&mov	($key,&wparam(3));
+	&mov	($rounds_,&wparam(4));
+	&and	($len,-16);
+	&jz	(&label("ecb_ret"));
+	&mov	($rounds,&DWP(240,$key));
+	&test	($rounds_,$rounds_);
+	&jz	(&label("ecb_decrypt"));
+
+	&mov	($key_,$key);		# backup $key
+	&mov	($rounds_,$rounds);	# backup $rounds
+	&cmp	($len,0x60);
+	&jb	(&label("ecb_enc_tail"));
+
+	&movdqu	($inout0,&QWP(0,$inp));
+	&movdqu	($inout1,&QWP(0x10,$inp));
+	&movdqu	($inout2,&QWP(0x20,$inp));
+	&movdqu	($inout3,&QWP(0x30,$inp));
+	&movdqu	($inout4,&QWP(0x40,$inp));
+	&movdqu	($inout5,&QWP(0x50,$inp));
+	&lea	($inp,&DWP(0x60,$inp));
+	&sub	($len,0x60);
+	&jmp	(&label("ecb_enc_loop6_enter"));
+
+&set_label("ecb_enc_loop6",16);
+	&movups	(&QWP(0,$out),$inout0);
+	&movdqu	($inout0,&QWP(0,$inp));
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movdqu	($inout1,&QWP(0x10,$inp));
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movdqu	($inout2,&QWP(0x20,$inp));
+	&movups	(&QWP(0x30,$out),$inout3);
+	&movdqu	($inout3,&QWP(0x30,$inp));
+	&movups	(&QWP(0x40,$out),$inout4);
+	&movdqu	($inout4,&QWP(0x40,$inp));
+	&movups	(&QWP(0x50,$out),$inout5);
+	&lea	($out,&DWP(0x60,$out));
+	&movdqu	($inout5,&QWP(0x50,$inp));
+	&lea	($inp,&DWP(0x60,$inp));
+&set_label("ecb_enc_loop6_enter");
+
+	&call	("_aesni_encrypt6");
+
+	&mov	($key,$key_);		# restore $key
+	&mov	($rounds,$rounds_);	# restore $rounds
+	&sub	($len,0x60);
+	&jnc	(&label("ecb_enc_loop6"));
+
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+	&movups	(&QWP(0x40,$out),$inout4);
+	&movups	(&QWP(0x50,$out),$inout5);
+	&lea	($out,&DWP(0x60,$out));
+	&add	($len,0x60);
+	&jz	(&label("ecb_ret"));
+
+&set_label("ecb_enc_tail");
+	&movups	($inout0,&QWP(0,$inp));
+	&cmp	($len,0x20);
+	&jb	(&label("ecb_enc_one"));
+	&movups	($inout1,&QWP(0x10,$inp));
+	&je	(&label("ecb_enc_two"));
+	&movups	($inout2,&QWP(0x20,$inp));
+	&cmp	($len,0x40);
+	&jb	(&label("ecb_enc_three"));
+	&movups	($inout3,&QWP(0x30,$inp));
+	&je	(&label("ecb_enc_four"));
+	&movups	($inout4,&QWP(0x40,$inp));
+	&xorps	($inout5,$inout5);
+	&call	("_aesni_encrypt6");
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+	&movups	(&QWP(0x40,$out),$inout4);
+	jmp	(&label("ecb_ret"));
+
+&set_label("ecb_enc_one",16);
+	if ($inline)
+	{   &aesni_inline_generate1("enc");	}
+	else
+	{   &call	("_aesni_encrypt1");	}
+	&movups	(&QWP(0,$out),$inout0);
+	&jmp	(&label("ecb_ret"));
+
+&set_label("ecb_enc_two",16);
+	&call	("_aesni_encrypt2");
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&jmp	(&label("ecb_ret"));
+
+&set_label("ecb_enc_three",16);
+	&call	("_aesni_encrypt3");
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&jmp	(&label("ecb_ret"));
+
+&set_label("ecb_enc_four",16);
+	&call	("_aesni_encrypt4");
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+	&jmp	(&label("ecb_ret"));
+######################################################################
+&set_label("ecb_decrypt",16);
+	&mov	($key_,$key);		# backup $key
+	&mov	($rounds_,$rounds);	# backup $rounds
+	&cmp	($len,0x60);
+	&jb	(&label("ecb_dec_tail"));
+
+	&movdqu	($inout0,&QWP(0,$inp));
+	&movdqu	($inout1,&QWP(0x10,$inp));
+	&movdqu	($inout2,&QWP(0x20,$inp));
+	&movdqu	($inout3,&QWP(0x30,$inp));
+	&movdqu	($inout4,&QWP(0x40,$inp));
+	&movdqu	($inout5,&QWP(0x50,$inp));
+	&lea	($inp,&DWP(0x60,$inp));
+	&sub	($len,0x60);
+	&jmp	(&label("ecb_dec_loop6_enter"));
+
+&set_label("ecb_dec_loop6",16);
+	&movups	(&QWP(0,$out),$inout0);
+	&movdqu	($inout0,&QWP(0,$inp));
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movdqu	($inout1,&QWP(0x10,$inp));
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movdqu	($inout2,&QWP(0x20,$inp));
+	&movups	(&QWP(0x30,$out),$inout3);
+	&movdqu	($inout3,&QWP(0x30,$inp));
+	&movups	(&QWP(0x40,$out),$inout4);
+	&movdqu	($inout4,&QWP(0x40,$inp));
+	&movups	(&QWP(0x50,$out),$inout5);
+	&lea	($out,&DWP(0x60,$out));
+	&movdqu	($inout5,&QWP(0x50,$inp));
+	&lea	($inp,&DWP(0x60,$inp));
+&set_label("ecb_dec_loop6_enter");
+
+	&call	("_aesni_decrypt6");
+
+	&mov	($key,$key_);		# restore $key
+	&mov	($rounds,$rounds_);	# restore $rounds
+	&sub	($len,0x60);
+	&jnc	(&label("ecb_dec_loop6"));
+
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+	&movups	(&QWP(0x40,$out),$inout4);
+	&movups	(&QWP(0x50,$out),$inout5);
+	&lea	($out,&DWP(0x60,$out));
+	&add	($len,0x60);
+	&jz	(&label("ecb_ret"));
+
+&set_label("ecb_dec_tail");
+	&movups	($inout0,&QWP(0,$inp));
+	&cmp	($len,0x20);
+	&jb	(&label("ecb_dec_one"));
+	&movups	($inout1,&QWP(0x10,$inp));
+	&je	(&label("ecb_dec_two"));
+	&movups	($inout2,&QWP(0x20,$inp));
+	&cmp	($len,0x40);
+	&jb	(&label("ecb_dec_three"));
+	&movups	($inout3,&QWP(0x30,$inp));
+	&je	(&label("ecb_dec_four"));
+	&movups	($inout4,&QWP(0x40,$inp));
+	&xorps	($inout5,$inout5);
+	&call	("_aesni_decrypt6");
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+	&movups	(&QWP(0x40,$out),$inout4);
+	&jmp	(&label("ecb_ret"));
+
+&set_label("ecb_dec_one",16);
+	if ($inline)
+	{   &aesni_inline_generate1("dec");	}
+	else
+	{   &call	("_aesni_decrypt1");	}
+	&movups	(&QWP(0,$out),$inout0);
+	&jmp	(&label("ecb_ret"));
+
+&set_label("ecb_dec_two",16);
+	&call	("_aesni_decrypt2");
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&jmp	(&label("ecb_ret"));
+
+&set_label("ecb_dec_three",16);
+	&call	("_aesni_decrypt3");
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&jmp	(&label("ecb_ret"));
+
+&set_label("ecb_dec_four",16);
+	&call	("_aesni_decrypt4");
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+
+&set_label("ecb_ret");
+	&pxor	("xmm0","xmm0");		# clear register bank
+	&pxor	("xmm1","xmm1");
+	&pxor	("xmm2","xmm2");
+	&pxor	("xmm3","xmm3");
+	&pxor	("xmm4","xmm4");
+	&pxor	("xmm5","xmm5");
+	&pxor	("xmm6","xmm6");
+	&pxor	("xmm7","xmm7");
+&function_end("aesni_ecb_encrypt");
+
+######################################################################
+# void aesni_ccm64_[en|de]crypt_blocks (const void *in, void *out,
+#                         size_t blocks, const AES_KEY *key,
+#                         const char *ivec,char *cmac);
+#
+# Handles only complete blocks, operates on 64-bit counter and
+# does not update *ivec! Nor does it finalize CMAC value
+# (see engine/eng_aesni.c for details)
+#
+{ my $cmac=$inout1;
+&function_begin("aesni_ccm64_encrypt_blocks");
+	&mov	($inp,&wparam(0));
+	&mov	($out,&wparam(1));
+	&mov	($len,&wparam(2));
+	&mov	($key,&wparam(3));
+	&mov	($rounds_,&wparam(4));
+	&mov	($rounds,&wparam(5));
+	&mov	($key_,"esp");
+	&sub	("esp",60);
+	&and	("esp",-16);			# align stack
+	&mov	(&DWP(48,"esp"),$key_);
+
+	&movdqu	($ivec,&QWP(0,$rounds_));	# load ivec
+	&movdqu	($cmac,&QWP(0,$rounds));	# load cmac
+	&mov	($rounds,&DWP(240,$key));
+
+	# compose byte-swap control mask for pshufb on stack
+	&mov	(&DWP(0,"esp"),0x0c0d0e0f);
+	&mov	(&DWP(4,"esp"),0x08090a0b);
+	&mov	(&DWP(8,"esp"),0x04050607);
+	&mov	(&DWP(12,"esp"),0x00010203);
+
+	# compose counter increment vector on stack
+	&mov	($rounds_,1);
+	&xor	($key_,$key_);
+	&mov	(&DWP(16,"esp"),$rounds_);
+	&mov	(&DWP(20,"esp"),$key_);
+	&mov	(&DWP(24,"esp"),$key_);
+	&mov	(&DWP(28,"esp"),$key_);
+
+	&shl	($rounds,4);
+	&mov	($rounds_,16);
+	&lea	($key_,&DWP(0,$key));
+	&movdqa	($inout3,&QWP(0,"esp"));
+	&movdqa	($inout0,$ivec);
+	&lea	($key,&DWP(32,$key,$rounds));
+	&sub	($rounds_,$rounds);
+	&pshufb	($ivec,$inout3);
+
+&set_label("ccm64_enc_outer");
+	&$movekey	($rndkey0,&QWP(0,$key_));
+	&mov		($rounds,$rounds_);
+	&movups		($in0,&QWP(0,$inp));
+
+	&xorps		($inout0,$rndkey0);
+	&$movekey	($rndkey1,&QWP(16,$key_));
+	&xorps		($rndkey0,$in0);
+	&xorps		($cmac,$rndkey0);		# cmac^=inp
+	&$movekey	($rndkey0,&QWP(32,$key_));
+
+&set_label("ccm64_enc2_loop");
+	&aesenc		($inout0,$rndkey1);
+	&aesenc		($cmac,$rndkey1);
+	&$movekey	($rndkey1,&QWP(0,$key,$rounds));
+	&add		($rounds,32);
+	&aesenc		($inout0,$rndkey0);
+	&aesenc		($cmac,$rndkey0);
+	&$movekey	($rndkey0,&QWP(-16,$key,$rounds));
+	&jnz		(&label("ccm64_enc2_loop"));
+	&aesenc		($inout0,$rndkey1);
+	&aesenc		($cmac,$rndkey1);
+	&paddq		($ivec,&QWP(16,"esp"));
+	&dec		($len);
+	&aesenclast	($inout0,$rndkey0);
+	&aesenclast	($cmac,$rndkey0);
+
+	&lea	($inp,&DWP(16,$inp));
+	&xorps	($in0,$inout0);			# inp^=E(ivec)
+	&movdqa	($inout0,$ivec);
+	&movups	(&QWP(0,$out),$in0);		# save output
+	&pshufb	($inout0,$inout3);
+	&lea	($out,&DWP(16,$out));
+	&jnz	(&label("ccm64_enc_outer"));
+
+	&mov	("esp",&DWP(48,"esp"));
+	&mov	($out,&wparam(5));
+	&movups	(&QWP(0,$out),$cmac);
+
+	&pxor	("xmm0","xmm0");		# clear register bank
+	&pxor	("xmm1","xmm1");
+	&pxor	("xmm2","xmm2");
+	&pxor	("xmm3","xmm3");
+	&pxor	("xmm4","xmm4");
+	&pxor	("xmm5","xmm5");
+	&pxor	("xmm6","xmm6");
+	&pxor	("xmm7","xmm7");
+&function_end("aesni_ccm64_encrypt_blocks");
+
+&function_begin("aesni_ccm64_decrypt_blocks");
+	&mov	($inp,&wparam(0));
+	&mov	($out,&wparam(1));
+	&mov	($len,&wparam(2));
+	&mov	($key,&wparam(3));
+	&mov	($rounds_,&wparam(4));
+	&mov	($rounds,&wparam(5));
+	&mov	($key_,"esp");
+	&sub	("esp",60);
+	&and	("esp",-16);			# align stack
+	&mov	(&DWP(48,"esp"),$key_);
+
+	&movdqu	($ivec,&QWP(0,$rounds_));	# load ivec
+	&movdqu	($cmac,&QWP(0,$rounds));	# load cmac
+	&mov	($rounds,&DWP(240,$key));
+
+	# compose byte-swap control mask for pshufb on stack
+	&mov	(&DWP(0,"esp"),0x0c0d0e0f);
+	&mov	(&DWP(4,"esp"),0x08090a0b);
+	&mov	(&DWP(8,"esp"),0x04050607);
+	&mov	(&DWP(12,"esp"),0x00010203);
+
+	# compose counter increment vector on stack
+	&mov	($rounds_,1);
+	&xor	($key_,$key_);
+	&mov	(&DWP(16,"esp"),$rounds_);
+	&mov	(&DWP(20,"esp"),$key_);
+	&mov	(&DWP(24,"esp"),$key_);
+	&mov	(&DWP(28,"esp"),$key_);
+
+	&movdqa	($inout3,&QWP(0,"esp"));	# bswap mask
+	&movdqa	($inout0,$ivec);
+
+	&mov	($key_,$key);
+	&mov	($rounds_,$rounds);
+
+	&pshufb	($ivec,$inout3);
+	if ($inline)
+	{   &aesni_inline_generate1("enc");	}
+	else
+	{   &call	("_aesni_encrypt1");	}
+	&shl	($rounds_,4);
+	&mov	($rounds,16);
+	&movups	($in0,&QWP(0,$inp));		# load inp
+	&paddq	($ivec,&QWP(16,"esp"));
+	&lea	($inp,&QWP(16,$inp));
+	&sub	($rounds,$rounds_);
+	&lea	($key,&DWP(32,$key_,$rounds_));
+	&mov	($rounds_,$rounds);
+	&jmp	(&label("ccm64_dec_outer"));
+
+&set_label("ccm64_dec_outer",16);
+	&xorps	($in0,$inout0);			# inp ^= E(ivec)
+	&movdqa	($inout0,$ivec);
+	&movups	(&QWP(0,$out),$in0);		# save output
+	&lea	($out,&DWP(16,$out));
+	&pshufb	($inout0,$inout3);
+
+	&sub	($len,1);
+	&jz	(&label("ccm64_dec_break"));
+
+	&$movekey	($rndkey0,&QWP(0,$key_));
+	&mov		($rounds,$rounds_);
+	&$movekey	($rndkey1,&QWP(16,$key_));
+	&xorps		($in0,$rndkey0);
+	&xorps		($inout0,$rndkey0);
+	&xorps		($cmac,$in0);		# cmac^=out
+	&$movekey	($rndkey0,&QWP(32,$key_));
+
+&set_label("ccm64_dec2_loop");
+	&aesenc		($inout0,$rndkey1);
+	&aesenc		($cmac,$rndkey1);
+	&$movekey	($rndkey1,&QWP(0,$key,$rounds));
+	&add		($rounds,32);
+	&aesenc		($inout0,$rndkey0);
+	&aesenc		($cmac,$rndkey0);
+	&$movekey	($rndkey0,&QWP(-16,$key,$rounds));
+	&jnz		(&label("ccm64_dec2_loop"));
+	&movups		($in0,&QWP(0,$inp));	# load inp
+	&paddq		($ivec,&QWP(16,"esp"));
+	&aesenc		($inout0,$rndkey1);
+	&aesenc		($cmac,$rndkey1);
+	&aesenclast	($inout0,$rndkey0);
+	&aesenclast	($cmac,$rndkey0);
+	&lea		($inp,&QWP(16,$inp));
+	&jmp	(&label("ccm64_dec_outer"));
+
+&set_label("ccm64_dec_break",16);
+	&mov	($rounds,&DWP(240,$key_));
+	&mov	($key,$key_);
+	if ($inline)
+	{   &aesni_inline_generate1("enc",$cmac,$in0);	}
+	else
+	{   &call	("_aesni_encrypt1",$cmac);	}
+
+	&mov	("esp",&DWP(48,"esp"));
+	&mov	($out,&wparam(5));
+	&movups	(&QWP(0,$out),$cmac);
+
+	&pxor	("xmm0","xmm0");		# clear register bank
+	&pxor	("xmm1","xmm1");
+	&pxor	("xmm2","xmm2");
+	&pxor	("xmm3","xmm3");
+	&pxor	("xmm4","xmm4");
+	&pxor	("xmm5","xmm5");
+	&pxor	("xmm6","xmm6");
+	&pxor	("xmm7","xmm7");
+&function_end("aesni_ccm64_decrypt_blocks");
+}
+
+######################################################################
+# void aesni_ctr32_encrypt_blocks (const void *in, void *out,
+#                         size_t blocks, const AES_KEY *key,
+#                         const char *ivec);
+#
+# Handles only complete blocks, operates on 32-bit counter and
+# does not update *ivec! (see crypto/modes/ctr128.c for details)
+#
+# stack layout:
+#	0	pshufb mask
+#	16	vector addend: 0,6,6,6
+# 	32	counter-less ivec
+#	48	1st triplet of counter vector
+#	64	2nd triplet of counter vector
+#	80	saved %esp
+
+&function_begin("aesni_ctr32_encrypt_blocks");
+	&mov	($inp,&wparam(0));
+	&mov	($out,&wparam(1));
+	&mov	($len,&wparam(2));
+	&mov	($key,&wparam(3));
+	&mov	($rounds_,&wparam(4));
+	&mov	($key_,"esp");
+	&sub	("esp",88);
+	&and	("esp",-16);			# align stack
+	&mov	(&DWP(80,"esp"),$key_);
+
+	&cmp	($len,1);
+	&je	(&label("ctr32_one_shortcut"));
+
+	&movdqu	($inout5,&QWP(0,$rounds_));	# load ivec
+
+	# compose byte-swap control mask for pshufb on stack
+	&mov	(&DWP(0,"esp"),0x0c0d0e0f);
+	&mov	(&DWP(4,"esp"),0x08090a0b);
+	&mov	(&DWP(8,"esp"),0x04050607);
+	&mov	(&DWP(12,"esp"),0x00010203);
+
+	# compose counter increment vector on stack
+	&mov	($rounds,6);
+	&xor	($key_,$key_);
+	&mov	(&DWP(16,"esp"),$rounds);
+	&mov	(&DWP(20,"esp"),$rounds);
+	&mov	(&DWP(24,"esp"),$rounds);
+	&mov	(&DWP(28,"esp"),$key_);
+
+	&pextrd	($rounds_,$inout5,3);		# pull 32-bit counter
+	&pinsrd	($inout5,$key_,3);		# wipe 32-bit counter
+
+	&mov	($rounds,&DWP(240,$key));	# key->rounds
+
+	# compose 2 vectors of 3x32-bit counters
+	&bswap	($rounds_);
+	&pxor	($rndkey0,$rndkey0);
+	&pxor	($rndkey1,$rndkey1);
+	&movdqa	($inout0,&QWP(0,"esp"));	# load byte-swap mask
+	&pinsrd	($rndkey0,$rounds_,0);
+	&lea	($key_,&DWP(3,$rounds_));
+	&pinsrd	($rndkey1,$key_,0);
+	&inc	($rounds_);
+	&pinsrd	($rndkey0,$rounds_,1);
+	&inc	($key_);
+	&pinsrd	($rndkey1,$key_,1);
+	&inc	($rounds_);
+	&pinsrd	($rndkey0,$rounds_,2);
+	&inc	($key_);
+	&pinsrd	($rndkey1,$key_,2);
+	&movdqa	(&QWP(48,"esp"),$rndkey0);	# save 1st triplet
+	&pshufb	($rndkey0,$inout0);		# byte swap
+	&movdqu	($inout4,&QWP(0,$key));		# key[0]
+	&movdqa	(&QWP(64,"esp"),$rndkey1);	# save 2nd triplet
+	&pshufb	($rndkey1,$inout0);		# byte swap
+
+	&pshufd	($inout0,$rndkey0,3<<6);	# place counter to upper dword
+	&pshufd	($inout1,$rndkey0,2<<6);
+	&cmp	($len,6);
+	&jb	(&label("ctr32_tail"));
+	&pxor	($inout5,$inout4);		# counter-less ivec^key[0]
+	&shl	($rounds,4);
+	&mov	($rounds_,16);
+	&movdqa	(&QWP(32,"esp"),$inout5);	# save counter-less ivec^key[0]
+	&mov	($key_,$key);			# backup $key
+	&sub	($rounds_,$rounds);		# backup twisted $rounds
+	&lea	($key,&DWP(32,$key,$rounds));
+	&sub	($len,6);
+	&jmp	(&label("ctr32_loop6"));
+
+&set_label("ctr32_loop6",16);
+	# inlining _aesni_encrypt6's prologue gives ~6% improvement...
+	&pshufd	($inout2,$rndkey0,1<<6);
+	&movdqa	($rndkey0,&QWP(32,"esp"));	# pull counter-less ivec
+	&pshufd	($inout3,$rndkey1,3<<6);
+	&pxor		($inout0,$rndkey0);	# merge counter-less ivec
+	&pshufd	($inout4,$rndkey1,2<<6);
+	&pxor		($inout1,$rndkey0);
+	&pshufd	($inout5,$rndkey1,1<<6);
+	&$movekey	($rndkey1,&QWP(16,$key_));
+	&pxor		($inout2,$rndkey0);
+	&pxor		($inout3,$rndkey0);
+	&aesenc		($inout0,$rndkey1);
+	&pxor		($inout4,$rndkey0);
+	&pxor		($inout5,$rndkey0);
+	&aesenc		($inout1,$rndkey1);
+	&$movekey	($rndkey0,&QWP(32,$key_));
+	&mov		($rounds,$rounds_);
+	&aesenc		($inout2,$rndkey1);
+	&aesenc		($inout3,$rndkey1);
+	&aesenc		($inout4,$rndkey1);
+	&aesenc		($inout5,$rndkey1);
+
+	&call		(&label("_aesni_encrypt6_enter"));
+
+	&movups	($rndkey1,&QWP(0,$inp));
+	&movups	($rndkey0,&QWP(0x10,$inp));
+	&xorps	($inout0,$rndkey1);
+	&movups	($rndkey1,&QWP(0x20,$inp));
+	&xorps	($inout1,$rndkey0);
+	&movups	(&QWP(0,$out),$inout0);
+	&movdqa	($rndkey0,&QWP(16,"esp"));	# load increment
+	&xorps	($inout2,$rndkey1);
+	&movdqa	($rndkey1,&QWP(64,"esp"));	# load 2nd triplet
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+
+	&paddd	($rndkey1,$rndkey0);		# 2nd triplet increment
+	&paddd	($rndkey0,&QWP(48,"esp"));	# 1st triplet increment
+	&movdqa	($inout0,&QWP(0,"esp"));	# load byte swap mask
+
+	&movups	($inout1,&QWP(0x30,$inp));
+	&movups	($inout2,&QWP(0x40,$inp));
+	&xorps	($inout3,$inout1);
+	&movups	($inout1,&QWP(0x50,$inp));
+	&lea	($inp,&DWP(0x60,$inp));
+	&movdqa	(&QWP(48,"esp"),$rndkey0);	# save 1st triplet
+	&pshufb	($rndkey0,$inout0);		# byte swap
+	&xorps	($inout4,$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+	&xorps	($inout5,$inout1);
+	&movdqa	(&QWP(64,"esp"),$rndkey1);	# save 2nd triplet
+	&pshufb	($rndkey1,$inout0);		# byte swap
+	&movups	(&QWP(0x40,$out),$inout4);
+	&pshufd	($inout0,$rndkey0,3<<6);
+	&movups	(&QWP(0x50,$out),$inout5);
+	&lea	($out,&DWP(0x60,$out));
+
+	&pshufd	($inout1,$rndkey0,2<<6);
+	&sub	($len,6);
+	&jnc	(&label("ctr32_loop6"));
+
+	&add	($len,6);
+	&jz	(&label("ctr32_ret"));
+	&movdqu	($inout5,&QWP(0,$key_));
+	&mov	($key,$key_);
+	&pxor	($inout5,&QWP(32,"esp"));	# restore count-less ivec
+	&mov	($rounds,&DWP(240,$key_));	# restore $rounds
+
+&set_label("ctr32_tail");
+	&por	($inout0,$inout5);
+	&cmp	($len,2);
+	&jb	(&label("ctr32_one"));
+
+	&pshufd	($inout2,$rndkey0,1<<6);
+	&por	($inout1,$inout5);
+	&je	(&label("ctr32_two"));
+
+	&pshufd	($inout3,$rndkey1,3<<6);
+	&por	($inout2,$inout5);
+	&cmp	($len,4);
+	&jb	(&label("ctr32_three"));
+
+	&pshufd	($inout4,$rndkey1,2<<6);
+	&por	($inout3,$inout5);
+	&je	(&label("ctr32_four"));
+
+	&por	($inout4,$inout5);
+	&call	("_aesni_encrypt6");
+	&movups	($rndkey1,&QWP(0,$inp));
+	&movups	($rndkey0,&QWP(0x10,$inp));
+	&xorps	($inout0,$rndkey1);
+	&movups	($rndkey1,&QWP(0x20,$inp));
+	&xorps	($inout1,$rndkey0);
+	&movups	($rndkey0,&QWP(0x30,$inp));
+	&xorps	($inout2,$rndkey1);
+	&movups	($rndkey1,&QWP(0x40,$inp));
+	&xorps	($inout3,$rndkey0);
+	&movups	(&QWP(0,$out),$inout0);
+	&xorps	($inout4,$rndkey1);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+	&movups	(&QWP(0x40,$out),$inout4);
+	&jmp	(&label("ctr32_ret"));
+
+&set_label("ctr32_one_shortcut",16);
+	&movups	($inout0,&QWP(0,$rounds_));	# load ivec
+	&mov	($rounds,&DWP(240,$key));
+
+&set_label("ctr32_one");
+	if ($inline)
+	{   &aesni_inline_generate1("enc");	}
+	else
+	{   &call	("_aesni_encrypt1");	}
+	&movups	($in0,&QWP(0,$inp));
+	&xorps	($in0,$inout0);
+	&movups	(&QWP(0,$out),$in0);
+	&jmp	(&label("ctr32_ret"));
+
+&set_label("ctr32_two",16);
+	&call	("_aesni_encrypt2");
+	&movups	($inout3,&QWP(0,$inp));
+	&movups	($inout4,&QWP(0x10,$inp));
+	&xorps	($inout0,$inout3);
+	&xorps	($inout1,$inout4);
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&jmp	(&label("ctr32_ret"));
+
+&set_label("ctr32_three",16);
+	&call	("_aesni_encrypt3");
+	&movups	($inout3,&QWP(0,$inp));
+	&movups	($inout4,&QWP(0x10,$inp));
+	&xorps	($inout0,$inout3);
+	&movups	($inout5,&QWP(0x20,$inp));
+	&xorps	($inout1,$inout4);
+	&movups	(&QWP(0,$out),$inout0);
+	&xorps	($inout2,$inout5);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&jmp	(&label("ctr32_ret"));
+
+&set_label("ctr32_four",16);
+	&call	("_aesni_encrypt4");
+	&movups	($inout4,&QWP(0,$inp));
+	&movups	($inout5,&QWP(0x10,$inp));
+	&movups	($rndkey1,&QWP(0x20,$inp));
+	&xorps	($inout0,$inout4);
+	&movups	($rndkey0,&QWP(0x30,$inp));
+	&xorps	($inout1,$inout5);
+	&movups	(&QWP(0,$out),$inout0);
+	&xorps	($inout2,$rndkey1);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&xorps	($inout3,$rndkey0);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+
+&set_label("ctr32_ret");
+	&pxor	("xmm0","xmm0");		# clear register bank
+	&pxor	("xmm1","xmm1");
+	&pxor	("xmm2","xmm2");
+	&pxor	("xmm3","xmm3");
+	&pxor	("xmm4","xmm4");
+	&movdqa	(&QWP(32,"esp"),"xmm0");	# clear stack
+	&pxor	("xmm5","xmm5");
+	&movdqa	(&QWP(48,"esp"),"xmm0");
+	&pxor	("xmm6","xmm6");
+	&movdqa	(&QWP(64,"esp"),"xmm0");
+	&pxor	("xmm7","xmm7");
+	&mov	("esp",&DWP(80,"esp"));
+&function_end("aesni_ctr32_encrypt_blocks");
+
+######################################################################
+# void aesni_xts_[en|de]crypt(const char *inp,char *out,size_t len,
+#	const AES_KEY *key1, const AES_KEY *key2
+#	const unsigned char iv[16]);
+#
+{ my ($tweak,$twtmp,$twres,$twmask)=($rndkey1,$rndkey0,$inout0,$inout1);
+
+&function_begin("aesni_xts_encrypt");
+	&mov	($key,&wparam(4));		# key2
+	&mov	($inp,&wparam(5));		# clear-text tweak
+
+	&mov	($rounds,&DWP(240,$key));	# key2->rounds
+	&movups	($inout0,&QWP(0,$inp));
+	if ($inline)
+	{   &aesni_inline_generate1("enc");	}
+	else
+	{   &call	("_aesni_encrypt1");	}
+
+	&mov	($inp,&wparam(0));
+	&mov	($out,&wparam(1));
+	&mov	($len,&wparam(2));
+	&mov	($key,&wparam(3));		# key1
+
+	&mov	($key_,"esp");
+	&sub	("esp",16*7+8);
+	&mov	($rounds,&DWP(240,$key));	# key1->rounds
+	&and	("esp",-16);			# align stack
+
+	&mov	(&DWP(16*6+0,"esp"),0x87);	# compose the magic constant
+	&mov	(&DWP(16*6+4,"esp"),0);
+	&mov	(&DWP(16*6+8,"esp"),1);
+	&mov	(&DWP(16*6+12,"esp"),0);
+	&mov	(&DWP(16*7+0,"esp"),$len);	# save original $len
+	&mov	(&DWP(16*7+4,"esp"),$key_);	# save original %esp
+
+	&movdqa	($tweak,$inout0);
+	&pxor	($twtmp,$twtmp);
+	&movdqa	($twmask,&QWP(6*16,"esp"));	# 0x0...010...87
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+
+	&and	($len,-16);
+	&mov	($key_,$key);			# backup $key
+	&mov	($rounds_,$rounds);		# backup $rounds
+	&sub	($len,16*6);
+	&jc	(&label("xts_enc_short"));
+
+	&shl	($rounds,4);
+	&mov	($rounds_,16);
+	&sub	($rounds_,$rounds);
+	&lea	($key,&DWP(32,$key,$rounds));
+	&jmp	(&label("xts_enc_loop6"));
+
+&set_label("xts_enc_loop6",16);
+	for ($i=0;$i<4;$i++) {
+	    &pshufd	($twres,$twtmp,0x13);
+	    &pxor	($twtmp,$twtmp);
+	    &movdqa	(&QWP(16*$i,"esp"),$tweak);
+	    &paddq	($tweak,$tweak);	# &psllq($tweak,1);
+	    &pand	($twres,$twmask);	# isolate carry and residue
+	    &pcmpgtd	($twtmp,$tweak);	# broadcast upper bits
+	    &pxor	($tweak,$twres);
+	}
+	&pshufd	($inout5,$twtmp,0x13);
+	&movdqa	(&QWP(16*$i++,"esp"),$tweak);
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	 &$movekey	($rndkey0,&QWP(0,$key_));
+	&pand	($inout5,$twmask);		# isolate carry and residue
+	 &movups	($inout0,&QWP(0,$inp));	# load input
+	&pxor	($inout5,$tweak);
+
+	# inline _aesni_encrypt6 prologue and flip xor with tweak and key[0]
+	&mov	($rounds,$rounds_);		# restore $rounds
+	&movdqu	($inout1,&QWP(16*1,$inp));
+	 &xorps		($inout0,$rndkey0);	# input^=rndkey[0]
+	&movdqu	($inout2,&QWP(16*2,$inp));
+	 &pxor		($inout1,$rndkey0);
+	&movdqu	($inout3,&QWP(16*3,$inp));
+	 &pxor		($inout2,$rndkey0);
+	&movdqu	($inout4,&QWP(16*4,$inp));
+	 &pxor		($inout3,$rndkey0);
+	&movdqu	($rndkey1,&QWP(16*5,$inp));
+	 &pxor		($inout4,$rndkey0);
+	&lea	($inp,&DWP(16*6,$inp));
+	&pxor	($inout0,&QWP(16*0,"esp"));	# input^=tweak
+	&movdqa	(&QWP(16*$i,"esp"),$inout5);	# save last tweak
+	&pxor	($inout5,$rndkey1);
+
+	 &$movekey	($rndkey1,&QWP(16,$key_));
+	&pxor	($inout1,&QWP(16*1,"esp"));
+	&pxor	($inout2,&QWP(16*2,"esp"));
+	 &aesenc	($inout0,$rndkey1);
+	&pxor	($inout3,&QWP(16*3,"esp"));
+	&pxor	($inout4,&QWP(16*4,"esp"));
+	 &aesenc	($inout1,$rndkey1);
+	&pxor		($inout5,$rndkey0);
+	 &$movekey	($rndkey0,&QWP(32,$key_));
+	 &aesenc	($inout2,$rndkey1);
+	 &aesenc	($inout3,$rndkey1);
+	 &aesenc	($inout4,$rndkey1);
+	 &aesenc	($inout5,$rndkey1);
+	&call		(&label("_aesni_encrypt6_enter"));
+
+	&movdqa	($tweak,&QWP(16*5,"esp"));	# last tweak
+       &pxor	($twtmp,$twtmp);
+	&xorps	($inout0,&QWP(16*0,"esp"));	# output^=tweak
+       &pcmpgtd	($twtmp,$tweak);		# broadcast upper bits
+	&xorps	($inout1,&QWP(16*1,"esp"));
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&xorps	($inout2,&QWP(16*2,"esp"));
+	&movups	(&QWP(16*1,$out),$inout1);
+	&xorps	($inout3,&QWP(16*3,"esp"));
+	&movups	(&QWP(16*2,$out),$inout2);
+	&xorps	($inout4,&QWP(16*4,"esp"));
+	&movups	(&QWP(16*3,$out),$inout3);
+	&xorps	($inout5,$tweak);
+	&movups	(&QWP(16*4,$out),$inout4);
+       &pshufd	($twres,$twtmp,0x13);
+	&movups	(&QWP(16*5,$out),$inout5);
+	&lea	($out,&DWP(16*6,$out));
+       &movdqa	($twmask,&QWP(16*6,"esp"));	# 0x0...010...87
+
+	&pxor	($twtmp,$twtmp);
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+
+	&sub	($len,16*6);
+	&jnc	(&label("xts_enc_loop6"));
+
+	&mov	($rounds,&DWP(240,$key_));	# restore $rounds
+	&mov	($key,$key_);			# restore $key
+	&mov	($rounds_,$rounds);
+
+&set_label("xts_enc_short");
+	&add	($len,16*6);
+	&jz	(&label("xts_enc_done6x"));
+
+	&movdqa	($inout3,$tweak);		# put aside previous tweak
+	&cmp	($len,0x20);
+	&jb	(&label("xts_enc_one"));
+
+	&pshufd	($twres,$twtmp,0x13);
+	&pxor	($twtmp,$twtmp);
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+	&je	(&label("xts_enc_two"));
+
+	&pshufd	($twres,$twtmp,0x13);
+	&pxor	($twtmp,$twtmp);
+	&movdqa	($inout4,$tweak);		# put aside previous tweak
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+	&cmp	($len,0x40);
+	&jb	(&label("xts_enc_three"));
+
+	&pshufd	($twres,$twtmp,0x13);
+	&pxor	($twtmp,$twtmp);
+	&movdqa	($inout5,$tweak);		# put aside previous tweak
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+	&movdqa	(&QWP(16*0,"esp"),$inout3);
+	&movdqa	(&QWP(16*1,"esp"),$inout4);
+	&je	(&label("xts_enc_four"));
+
+	&movdqa	(&QWP(16*2,"esp"),$inout5);
+	&pshufd	($inout5,$twtmp,0x13);
+	&movdqa	(&QWP(16*3,"esp"),$tweak);
+	&paddq	($tweak,$tweak);		# &psllq($inout0,1);
+	&pand	($inout5,$twmask);		# isolate carry and residue
+	&pxor	($inout5,$tweak);
+
+	&movdqu	($inout0,&QWP(16*0,$inp));	# load input
+	&movdqu	($inout1,&QWP(16*1,$inp));
+	&movdqu	($inout2,&QWP(16*2,$inp));
+	&pxor	($inout0,&QWP(16*0,"esp"));	# input^=tweak
+	&movdqu	($inout3,&QWP(16*3,$inp));
+	&pxor	($inout1,&QWP(16*1,"esp"));
+	&movdqu	($inout4,&QWP(16*4,$inp));
+	&pxor	($inout2,&QWP(16*2,"esp"));
+	&lea	($inp,&DWP(16*5,$inp));
+	&pxor	($inout3,&QWP(16*3,"esp"));
+	&movdqa	(&QWP(16*4,"esp"),$inout5);	# save last tweak
+	&pxor	($inout4,$inout5);
+
+	&call	("_aesni_encrypt6");
+
+	&movaps	($tweak,&QWP(16*4,"esp"));	# last tweak
+	&xorps	($inout0,&QWP(16*0,"esp"));	# output^=tweak
+	&xorps	($inout1,&QWP(16*1,"esp"));
+	&xorps	($inout2,&QWP(16*2,"esp"));
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&xorps	($inout3,&QWP(16*3,"esp"));
+	&movups	(&QWP(16*1,$out),$inout1);
+	&xorps	($inout4,$tweak);
+	&movups	(&QWP(16*2,$out),$inout2);
+	&movups	(&QWP(16*3,$out),$inout3);
+	&movups	(&QWP(16*4,$out),$inout4);
+	&lea	($out,&DWP(16*5,$out));
+	&jmp	(&label("xts_enc_done"));
+
+&set_label("xts_enc_one",16);
+	&movups	($inout0,&QWP(16*0,$inp));	# load input
+	&lea	($inp,&DWP(16*1,$inp));
+	&xorps	($inout0,$inout3);		# input^=tweak
+	if ($inline)
+	{   &aesni_inline_generate1("enc");	}
+	else
+	{   &call	("_aesni_encrypt1");	}
+	&xorps	($inout0,$inout3);		# output^=tweak
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&lea	($out,&DWP(16*1,$out));
+
+	&movdqa	($tweak,$inout3);		# last tweak
+	&jmp	(&label("xts_enc_done"));
+
+&set_label("xts_enc_two",16);
+	&movaps	($inout4,$tweak);		# put aside last tweak
+
+	&movups	($inout0,&QWP(16*0,$inp));	# load input
+	&movups	($inout1,&QWP(16*1,$inp));
+	&lea	($inp,&DWP(16*2,$inp));
+	&xorps	($inout0,$inout3);		# input^=tweak
+	&xorps	($inout1,$inout4);
+
+	&call	("_aesni_encrypt2");
+
+	&xorps	($inout0,$inout3);		# output^=tweak
+	&xorps	($inout1,$inout4);
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&movups	(&QWP(16*1,$out),$inout1);
+	&lea	($out,&DWP(16*2,$out));
+
+	&movdqa	($tweak,$inout4);		# last tweak
+	&jmp	(&label("xts_enc_done"));
+
+&set_label("xts_enc_three",16);
+	&movaps	($inout5,$tweak);		# put aside last tweak
+	&movups	($inout0,&QWP(16*0,$inp));	# load input
+	&movups	($inout1,&QWP(16*1,$inp));
+	&movups	($inout2,&QWP(16*2,$inp));
+	&lea	($inp,&DWP(16*3,$inp));
+	&xorps	($inout0,$inout3);		# input^=tweak
+	&xorps	($inout1,$inout4);
+	&xorps	($inout2,$inout5);
+
+	&call	("_aesni_encrypt3");
+
+	&xorps	($inout0,$inout3);		# output^=tweak
+	&xorps	($inout1,$inout4);
+	&xorps	($inout2,$inout5);
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&movups	(&QWP(16*1,$out),$inout1);
+	&movups	(&QWP(16*2,$out),$inout2);
+	&lea	($out,&DWP(16*3,$out));
+
+	&movdqa	($tweak,$inout5);		# last tweak
+	&jmp	(&label("xts_enc_done"));
+
+&set_label("xts_enc_four",16);
+	&movaps	($inout4,$tweak);		# put aside last tweak
+
+	&movups	($inout0,&QWP(16*0,$inp));	# load input
+	&movups	($inout1,&QWP(16*1,$inp));
+	&movups	($inout2,&QWP(16*2,$inp));
+	&xorps	($inout0,&QWP(16*0,"esp"));	# input^=tweak
+	&movups	($inout3,&QWP(16*3,$inp));
+	&lea	($inp,&DWP(16*4,$inp));
+	&xorps	($inout1,&QWP(16*1,"esp"));
+	&xorps	($inout2,$inout5);
+	&xorps	($inout3,$inout4);
+
+	&call	("_aesni_encrypt4");
+
+	&xorps	($inout0,&QWP(16*0,"esp"));	# output^=tweak
+	&xorps	($inout1,&QWP(16*1,"esp"));
+	&xorps	($inout2,$inout5);
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&xorps	($inout3,$inout4);
+	&movups	(&QWP(16*1,$out),$inout1);
+	&movups	(&QWP(16*2,$out),$inout2);
+	&movups	(&QWP(16*3,$out),$inout3);
+	&lea	($out,&DWP(16*4,$out));
+
+	&movdqa	($tweak,$inout4);		# last tweak
+	&jmp	(&label("xts_enc_done"));
+
+&set_label("xts_enc_done6x",16);		# $tweak is pre-calculated
+	&mov	($len,&DWP(16*7+0,"esp"));	# restore original $len
+	&and	($len,15);
+	&jz	(&label("xts_enc_ret"));
+	&movdqa	($inout3,$tweak);
+	&mov	(&DWP(16*7+0,"esp"),$len);	# save $len%16
+	&jmp	(&label("xts_enc_steal"));
+
+&set_label("xts_enc_done",16);
+	&mov	($len,&DWP(16*7+0,"esp"));	# restore original $len
+	&pxor	($twtmp,$twtmp);
+	&and	($len,15);
+	&jz	(&label("xts_enc_ret"));
+
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&mov	(&DWP(16*7+0,"esp"),$len);	# save $len%16
+	&pshufd	($inout3,$twtmp,0x13);
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($inout3,&QWP(16*6,"esp"));	# isolate carry and residue
+	&pxor	($inout3,$tweak);
+
+&set_label("xts_enc_steal");
+	&movz	($rounds,&BP(0,$inp));
+	&movz	($key,&BP(-16,$out));
+	&lea	($inp,&DWP(1,$inp));
+	&mov	(&BP(-16,$out),&LB($rounds));
+	&mov	(&BP(0,$out),&LB($key));
+	&lea	($out,&DWP(1,$out));
+	&sub	($len,1);
+	&jnz	(&label("xts_enc_steal"));
+
+	&sub	($out,&DWP(16*7+0,"esp"));	# rewind $out
+	&mov	($key,$key_);			# restore $key
+	&mov	($rounds,$rounds_);		# restore $rounds
+
+	&movups	($inout0,&QWP(-16,$out));	# load input
+	&xorps	($inout0,$inout3);		# input^=tweak
+	if ($inline)
+	{   &aesni_inline_generate1("enc");	}
+	else
+	{   &call	("_aesni_encrypt1");	}
+	&xorps	($inout0,$inout3);		# output^=tweak
+	&movups	(&QWP(-16,$out),$inout0);	# write output
+
+&set_label("xts_enc_ret");
+	&pxor	("xmm0","xmm0");		# clear register bank
+	&pxor	("xmm1","xmm1");
+	&pxor	("xmm2","xmm2");
+	&movdqa	(&QWP(16*0,"esp"),"xmm0");	# clear stack
+	&pxor	("xmm3","xmm3");
+	&movdqa	(&QWP(16*1,"esp"),"xmm0");
+	&pxor	("xmm4","xmm4");
+	&movdqa	(&QWP(16*2,"esp"),"xmm0");
+	&pxor	("xmm5","xmm5");
+	&movdqa	(&QWP(16*3,"esp"),"xmm0");
+	&pxor	("xmm6","xmm6");
+	&movdqa	(&QWP(16*4,"esp"),"xmm0");
+	&pxor	("xmm7","xmm7");
+	&movdqa	(&QWP(16*5,"esp"),"xmm0");
+	&mov	("esp",&DWP(16*7+4,"esp"));	# restore %esp
+&function_end("aesni_xts_encrypt");
+
+&function_begin("aesni_xts_decrypt");
+	&mov	($key,&wparam(4));		# key2
+	&mov	($inp,&wparam(5));		# clear-text tweak
+
+	&mov	($rounds,&DWP(240,$key));	# key2->rounds
+	&movups	($inout0,&QWP(0,$inp));
+	if ($inline)
+	{   &aesni_inline_generate1("enc");	}
+	else
+	{   &call	("_aesni_encrypt1");	}
+
+	&mov	($inp,&wparam(0));
+	&mov	($out,&wparam(1));
+	&mov	($len,&wparam(2));
+	&mov	($key,&wparam(3));		# key1
+
+	&mov	($key_,"esp");
+	&sub	("esp",16*7+8);
+	&and	("esp",-16);			# align stack
+
+	&xor	($rounds_,$rounds_);		# if(len%16) len-=16;
+	&test	($len,15);
+	&setnz	(&LB($rounds_));
+	&shl	($rounds_,4);
+	&sub	($len,$rounds_);
+
+	&mov	(&DWP(16*6+0,"esp"),0x87);	# compose the magic constant
+	&mov	(&DWP(16*6+4,"esp"),0);
+	&mov	(&DWP(16*6+8,"esp"),1);
+	&mov	(&DWP(16*6+12,"esp"),0);
+	&mov	(&DWP(16*7+0,"esp"),$len);	# save original $len
+	&mov	(&DWP(16*7+4,"esp"),$key_);	# save original %esp
+
+	&mov	($rounds,&DWP(240,$key));	# key1->rounds
+	&mov	($key_,$key);			# backup $key
+	&mov	($rounds_,$rounds);		# backup $rounds
+
+	&movdqa	($tweak,$inout0);
+	&pxor	($twtmp,$twtmp);
+	&movdqa	($twmask,&QWP(6*16,"esp"));	# 0x0...010...87
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+
+	&and	($len,-16);
+	&sub	($len,16*6);
+	&jc	(&label("xts_dec_short"));
+
+	&shl	($rounds,4);
+	&mov	($rounds_,16);
+	&sub	($rounds_,$rounds);
+	&lea	($key,&DWP(32,$key,$rounds));
+	&jmp	(&label("xts_dec_loop6"));
+
+&set_label("xts_dec_loop6",16);
+	for ($i=0;$i<4;$i++) {
+	    &pshufd	($twres,$twtmp,0x13);
+	    &pxor	($twtmp,$twtmp);
+	    &movdqa	(&QWP(16*$i,"esp"),$tweak);
+	    &paddq	($tweak,$tweak);	# &psllq($tweak,1);
+	    &pand	($twres,$twmask);	# isolate carry and residue
+	    &pcmpgtd	($twtmp,$tweak);	# broadcast upper bits
+	    &pxor	($tweak,$twres);
+	}
+	&pshufd	($inout5,$twtmp,0x13);
+	&movdqa	(&QWP(16*$i++,"esp"),$tweak);
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	 &$movekey	($rndkey0,&QWP(0,$key_));
+	&pand	($inout5,$twmask);		# isolate carry and residue
+	 &movups	($inout0,&QWP(0,$inp));	# load input
+	&pxor	($inout5,$tweak);
+
+	# inline _aesni_encrypt6 prologue and flip xor with tweak and key[0]
+	&mov	($rounds,$rounds_);
+	&movdqu	($inout1,&QWP(16*1,$inp));
+	 &xorps		($inout0,$rndkey0);	# input^=rndkey[0]
+	&movdqu	($inout2,&QWP(16*2,$inp));
+	 &pxor		($inout1,$rndkey0);
+	&movdqu	($inout3,&QWP(16*3,$inp));
+	 &pxor		($inout2,$rndkey0);
+	&movdqu	($inout4,&QWP(16*4,$inp));
+	 &pxor		($inout3,$rndkey0);
+	&movdqu	($rndkey1,&QWP(16*5,$inp));
+	 &pxor		($inout4,$rndkey0);
+	&lea	($inp,&DWP(16*6,$inp));
+	&pxor	($inout0,&QWP(16*0,"esp"));	# input^=tweak
+	&movdqa	(&QWP(16*$i,"esp"),$inout5);	# save last tweak
+	&pxor	($inout5,$rndkey1);
+
+	 &$movekey	($rndkey1,&QWP(16,$key_));
+	&pxor	($inout1,&QWP(16*1,"esp"));
+	&pxor	($inout2,&QWP(16*2,"esp"));
+	 &aesdec	($inout0,$rndkey1);
+	&pxor	($inout3,&QWP(16*3,"esp"));
+	&pxor	($inout4,&QWP(16*4,"esp"));
+	 &aesdec	($inout1,$rndkey1);
+	&pxor		($inout5,$rndkey0);
+	 &$movekey	($rndkey0,&QWP(32,$key_));
+	 &aesdec	($inout2,$rndkey1);
+	 &aesdec	($inout3,$rndkey1);
+	 &aesdec	($inout4,$rndkey1);
+	 &aesdec	($inout5,$rndkey1);
+	&call		(&label("_aesni_decrypt6_enter"));
+
+	&movdqa	($tweak,&QWP(16*5,"esp"));	# last tweak
+       &pxor	($twtmp,$twtmp);
+	&xorps	($inout0,&QWP(16*0,"esp"));	# output^=tweak
+       &pcmpgtd	($twtmp,$tweak);		# broadcast upper bits
+	&xorps	($inout1,&QWP(16*1,"esp"));
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&xorps	($inout2,&QWP(16*2,"esp"));
+	&movups	(&QWP(16*1,$out),$inout1);
+	&xorps	($inout3,&QWP(16*3,"esp"));
+	&movups	(&QWP(16*2,$out),$inout2);
+	&xorps	($inout4,&QWP(16*4,"esp"));
+	&movups	(&QWP(16*3,$out),$inout3);
+	&xorps	($inout5,$tweak);
+	&movups	(&QWP(16*4,$out),$inout4);
+       &pshufd	($twres,$twtmp,0x13);
+	&movups	(&QWP(16*5,$out),$inout5);
+	&lea	($out,&DWP(16*6,$out));
+       &movdqa	($twmask,&QWP(16*6,"esp"));	# 0x0...010...87
+
+	&pxor	($twtmp,$twtmp);
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+
+	&sub	($len,16*6);
+	&jnc	(&label("xts_dec_loop6"));
+
+	&mov	($rounds,&DWP(240,$key_));	# restore $rounds
+	&mov	($key,$key_);			# restore $key
+	&mov	($rounds_,$rounds);
+
+&set_label("xts_dec_short");
+	&add	($len,16*6);
+	&jz	(&label("xts_dec_done6x"));
+
+	&movdqa	($inout3,$tweak);		# put aside previous tweak
+	&cmp	($len,0x20);
+	&jb	(&label("xts_dec_one"));
+
+	&pshufd	($twres,$twtmp,0x13);
+	&pxor	($twtmp,$twtmp);
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+	&je	(&label("xts_dec_two"));
+
+	&pshufd	($twres,$twtmp,0x13);
+	&pxor	($twtmp,$twtmp);
+	&movdqa	($inout4,$tweak);		# put aside previous tweak
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+	&cmp	($len,0x40);
+	&jb	(&label("xts_dec_three"));
+
+	&pshufd	($twres,$twtmp,0x13);
+	&pxor	($twtmp,$twtmp);
+	&movdqa	($inout5,$tweak);		# put aside previous tweak
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+	&movdqa	(&QWP(16*0,"esp"),$inout3);
+	&movdqa	(&QWP(16*1,"esp"),$inout4);
+	&je	(&label("xts_dec_four"));
+
+	&movdqa	(&QWP(16*2,"esp"),$inout5);
+	&pshufd	($inout5,$twtmp,0x13);
+	&movdqa	(&QWP(16*3,"esp"),$tweak);
+	&paddq	($tweak,$tweak);		# &psllq($inout0,1);
+	&pand	($inout5,$twmask);		# isolate carry and residue
+	&pxor	($inout5,$tweak);
+
+	&movdqu	($inout0,&QWP(16*0,$inp));	# load input
+	&movdqu	($inout1,&QWP(16*1,$inp));
+	&movdqu	($inout2,&QWP(16*2,$inp));
+	&pxor	($inout0,&QWP(16*0,"esp"));	# input^=tweak
+	&movdqu	($inout3,&QWP(16*3,$inp));
+	&pxor	($inout1,&QWP(16*1,"esp"));
+	&movdqu	($inout4,&QWP(16*4,$inp));
+	&pxor	($inout2,&QWP(16*2,"esp"));
+	&lea	($inp,&DWP(16*5,$inp));
+	&pxor	($inout3,&QWP(16*3,"esp"));
+	&movdqa	(&QWP(16*4,"esp"),$inout5);	# save last tweak
+	&pxor	($inout4,$inout5);
+
+	&call	("_aesni_decrypt6");
+
+	&movaps	($tweak,&QWP(16*4,"esp"));	# last tweak
+	&xorps	($inout0,&QWP(16*0,"esp"));	# output^=tweak
+	&xorps	($inout1,&QWP(16*1,"esp"));
+	&xorps	($inout2,&QWP(16*2,"esp"));
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&xorps	($inout3,&QWP(16*3,"esp"));
+	&movups	(&QWP(16*1,$out),$inout1);
+	&xorps	($inout4,$tweak);
+	&movups	(&QWP(16*2,$out),$inout2);
+	&movups	(&QWP(16*3,$out),$inout3);
+	&movups	(&QWP(16*4,$out),$inout4);
+	&lea	($out,&DWP(16*5,$out));
+	&jmp	(&label("xts_dec_done"));
+
+&set_label("xts_dec_one",16);
+	&movups	($inout0,&QWP(16*0,$inp));	# load input
+	&lea	($inp,&DWP(16*1,$inp));
+	&xorps	($inout0,$inout3);		# input^=tweak
+	if ($inline)
+	{   &aesni_inline_generate1("dec");	}
+	else
+	{   &call	("_aesni_decrypt1");	}
+	&xorps	($inout0,$inout3);		# output^=tweak
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&lea	($out,&DWP(16*1,$out));
+
+	&movdqa	($tweak,$inout3);		# last tweak
+	&jmp	(&label("xts_dec_done"));
+
+&set_label("xts_dec_two",16);
+	&movaps	($inout4,$tweak);		# put aside last tweak
+
+	&movups	($inout0,&QWP(16*0,$inp));	# load input
+	&movups	($inout1,&QWP(16*1,$inp));
+	&lea	($inp,&DWP(16*2,$inp));
+	&xorps	($inout0,$inout3);		# input^=tweak
+	&xorps	($inout1,$inout4);
+
+	&call	("_aesni_decrypt2");
+
+	&xorps	($inout0,$inout3);		# output^=tweak
+	&xorps	($inout1,$inout4);
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&movups	(&QWP(16*1,$out),$inout1);
+	&lea	($out,&DWP(16*2,$out));
+
+	&movdqa	($tweak,$inout4);		# last tweak
+	&jmp	(&label("xts_dec_done"));
+
+&set_label("xts_dec_three",16);
+	&movaps	($inout5,$tweak);		# put aside last tweak
+	&movups	($inout0,&QWP(16*0,$inp));	# load input
+	&movups	($inout1,&QWP(16*1,$inp));
+	&movups	($inout2,&QWP(16*2,$inp));
+	&lea	($inp,&DWP(16*3,$inp));
+	&xorps	($inout0,$inout3);		# input^=tweak
+	&xorps	($inout1,$inout4);
+	&xorps	($inout2,$inout5);
+
+	&call	("_aesni_decrypt3");
+
+	&xorps	($inout0,$inout3);		# output^=tweak
+	&xorps	($inout1,$inout4);
+	&xorps	($inout2,$inout5);
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&movups	(&QWP(16*1,$out),$inout1);
+	&movups	(&QWP(16*2,$out),$inout2);
+	&lea	($out,&DWP(16*3,$out));
+
+	&movdqa	($tweak,$inout5);		# last tweak
+	&jmp	(&label("xts_dec_done"));
+
+&set_label("xts_dec_four",16);
+	&movaps	($inout4,$tweak);		# put aside last tweak
+
+	&movups	($inout0,&QWP(16*0,$inp));	# load input
+	&movups	($inout1,&QWP(16*1,$inp));
+	&movups	($inout2,&QWP(16*2,$inp));
+	&xorps	($inout0,&QWP(16*0,"esp"));	# input^=tweak
+	&movups	($inout3,&QWP(16*3,$inp));
+	&lea	($inp,&DWP(16*4,$inp));
+	&xorps	($inout1,&QWP(16*1,"esp"));
+	&xorps	($inout2,$inout5);
+	&xorps	($inout3,$inout4);
+
+	&call	("_aesni_decrypt4");
+
+	&xorps	($inout0,&QWP(16*0,"esp"));	# output^=tweak
+	&xorps	($inout1,&QWP(16*1,"esp"));
+	&xorps	($inout2,$inout5);
+	&movups	(&QWP(16*0,$out),$inout0);	# write output
+	&xorps	($inout3,$inout4);
+	&movups	(&QWP(16*1,$out),$inout1);
+	&movups	(&QWP(16*2,$out),$inout2);
+	&movups	(&QWP(16*3,$out),$inout3);
+	&lea	($out,&DWP(16*4,$out));
+
+	&movdqa	($tweak,$inout4);		# last tweak
+	&jmp	(&label("xts_dec_done"));
+
+&set_label("xts_dec_done6x",16);		# $tweak is pre-calculated
+	&mov	($len,&DWP(16*7+0,"esp"));	# restore original $len
+	&and	($len,15);
+	&jz	(&label("xts_dec_ret"));
+	&mov	(&DWP(16*7+0,"esp"),$len);	# save $len%16
+	&jmp	(&label("xts_dec_only_one_more"));
+
+&set_label("xts_dec_done",16);
+	&mov	($len,&DWP(16*7+0,"esp"));	# restore original $len
+	&pxor	($twtmp,$twtmp);
+	&and	($len,15);
+	&jz	(&label("xts_dec_ret"));
+
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&mov	(&DWP(16*7+0,"esp"),$len);	# save $len%16
+	&pshufd	($twres,$twtmp,0x13);
+	&pxor	($twtmp,$twtmp);
+	&movdqa	($twmask,&QWP(16*6,"esp"));
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($twres,$twmask);		# isolate carry and residue
+	&pcmpgtd($twtmp,$tweak);		# broadcast upper bits
+	&pxor	($tweak,$twres);
+
+&set_label("xts_dec_only_one_more");
+	&pshufd	($inout3,$twtmp,0x13);
+	&movdqa	($inout4,$tweak);		# put aside previous tweak
+	&paddq	($tweak,$tweak);		# &psllq($tweak,1);
+	&pand	($inout3,$twmask);		# isolate carry and residue
+	&pxor	($inout3,$tweak);
+
+	&mov	($key,$key_);			# restore $key
+	&mov	($rounds,$rounds_);		# restore $rounds
+
+	&movups	($inout0,&QWP(0,$inp));		# load input
+	&xorps	($inout0,$inout3);		# input^=tweak
+	if ($inline)
+	{   &aesni_inline_generate1("dec");	}
+	else
+	{   &call	("_aesni_decrypt1");	}
+	&xorps	($inout0,$inout3);		# output^=tweak
+	&movups	(&QWP(0,$out),$inout0);		# write output
+
+&set_label("xts_dec_steal");
+	&movz	($rounds,&BP(16,$inp));
+	&movz	($key,&BP(0,$out));
+	&lea	($inp,&DWP(1,$inp));
+	&mov	(&BP(0,$out),&LB($rounds));
+	&mov	(&BP(16,$out),&LB($key));
+	&lea	($out,&DWP(1,$out));
+	&sub	($len,1);
+	&jnz	(&label("xts_dec_steal"));
+
+	&sub	($out,&DWP(16*7+0,"esp"));	# rewind $out
+	&mov	($key,$key_);			# restore $key
+	&mov	($rounds,$rounds_);		# restore $rounds
+
+	&movups	($inout0,&QWP(0,$out));		# load input
+	&xorps	($inout0,$inout4);		# input^=tweak
+	if ($inline)
+	{   &aesni_inline_generate1("dec");	}
+	else
+	{   &call	("_aesni_decrypt1");	}
+	&xorps	($inout0,$inout4);		# output^=tweak
+	&movups	(&QWP(0,$out),$inout0);		# write output
+
+&set_label("xts_dec_ret");
+	&pxor	("xmm0","xmm0");		# clear register bank
+	&pxor	("xmm1","xmm1");
+	&pxor	("xmm2","xmm2");
+	&movdqa	(&QWP(16*0,"esp"),"xmm0");	# clear stack
+	&pxor	("xmm3","xmm3");
+	&movdqa	(&QWP(16*1,"esp"),"xmm0");
+	&pxor	("xmm4","xmm4");
+	&movdqa	(&QWP(16*2,"esp"),"xmm0");
+	&pxor	("xmm5","xmm5");
+	&movdqa	(&QWP(16*3,"esp"),"xmm0");
+	&pxor	("xmm6","xmm6");
+	&movdqa	(&QWP(16*4,"esp"),"xmm0");
+	&pxor	("xmm7","xmm7");
+	&movdqa	(&QWP(16*5,"esp"),"xmm0");
+	&mov	("esp",&DWP(16*7+4,"esp"));	# restore %esp
+&function_end("aesni_xts_decrypt");
+}
+}
+
+######################################################################
+# void $PREFIX_cbc_encrypt (const void *inp, void *out,
+#                           size_t length, const AES_KEY *key,
+#                           unsigned char *ivp,const int enc);
+&function_begin("${PREFIX}_cbc_encrypt");
+	&mov	($inp,&wparam(0));
+	&mov	($rounds_,"esp");
+	&mov	($out,&wparam(1));
+	&sub	($rounds_,24);
+	&mov	($len,&wparam(2));
+	&and	($rounds_,-16);
+	&mov	($key,&wparam(3));
+	&mov	($key_,&wparam(4));
+	&test	($len,$len);
+	&jz	(&label("cbc_abort"));
+
+	&cmp	(&wparam(5),0);
+	&xchg	($rounds_,"esp");		# alloca
+	&movups	($ivec,&QWP(0,$key_));		# load IV
+	&mov	($rounds,&DWP(240,$key));
+	&mov	($key_,$key);			# backup $key
+	&mov	(&DWP(16,"esp"),$rounds_);	# save original %esp
+	&mov	($rounds_,$rounds);		# backup $rounds
+	&je	(&label("cbc_decrypt"));
+
+	&movaps	($inout0,$ivec);
+	&cmp	($len,16);
+	&jb	(&label("cbc_enc_tail"));
+	&sub	($len,16);
+	&jmp	(&label("cbc_enc_loop"));
+
+&set_label("cbc_enc_loop",16);
+	&movups	($ivec,&QWP(0,$inp));		# input actually
+	&lea	($inp,&DWP(16,$inp));
+	if ($inline)
+	{   &aesni_inline_generate1("enc",$inout0,$ivec);	}
+	else
+	{   &xorps($inout0,$ivec); &call("_aesni_encrypt1");	}
+	&mov	($rounds,$rounds_);	# restore $rounds
+	&mov	($key,$key_);		# restore $key
+	&movups	(&QWP(0,$out),$inout0);	# store output
+	&lea	($out,&DWP(16,$out));
+	&sub	($len,16);
+	&jnc	(&label("cbc_enc_loop"));
+	&add	($len,16);
+	&jnz	(&label("cbc_enc_tail"));
+	&movaps	($ivec,$inout0);
+	&pxor	($inout0,$inout0);
+	&jmp	(&label("cbc_ret"));
+
+&set_label("cbc_enc_tail");
+	&mov	("ecx",$len);		# zaps $rounds
+	&data_word(0xA4F3F689);		# rep movsb
+	&mov	("ecx",16);		# zero tail
+	&sub	("ecx",$len);
+	&xor	("eax","eax");		# zaps $len
+	&data_word(0xAAF3F689);		# rep stosb
+	&lea	($out,&DWP(-16,$out));	# rewind $out by 1 block
+	&mov	($rounds,$rounds_);	# restore $rounds
+	&mov	($inp,$out);		# $inp and $out are the same
+	&mov	($key,$key_);		# restore $key
+	&jmp	(&label("cbc_enc_loop"));
+######################################################################
+&set_label("cbc_decrypt",16);
+	&cmp	($len,0x50);
+	&jbe	(&label("cbc_dec_tail"));
+	&movaps	(&QWP(0,"esp"),$ivec);		# save IV
+	&sub	($len,0x50);
+	&jmp	(&label("cbc_dec_loop6_enter"));
+
+&set_label("cbc_dec_loop6",16);
+	&movaps	(&QWP(0,"esp"),$rndkey0);	# save IV
+	&movups	(&QWP(0,$out),$inout5);
+	&lea	($out,&DWP(0x10,$out));
+&set_label("cbc_dec_loop6_enter");
+	&movdqu	($inout0,&QWP(0,$inp));
+	&movdqu	($inout1,&QWP(0x10,$inp));
+	&movdqu	($inout2,&QWP(0x20,$inp));
+	&movdqu	($inout3,&QWP(0x30,$inp));
+	&movdqu	($inout4,&QWP(0x40,$inp));
+	&movdqu	($inout5,&QWP(0x50,$inp));
+
+	&call	("_aesni_decrypt6");
+
+	&movups	($rndkey1,&QWP(0,$inp));
+	&movups	($rndkey0,&QWP(0x10,$inp));
+	&xorps	($inout0,&QWP(0,"esp"));	# ^=IV
+	&xorps	($inout1,$rndkey1);
+	&movups	($rndkey1,&QWP(0x20,$inp));
+	&xorps	($inout2,$rndkey0);
+	&movups	($rndkey0,&QWP(0x30,$inp));
+	&xorps	($inout3,$rndkey1);
+	&movups	($rndkey1,&QWP(0x40,$inp));
+	&xorps	($inout4,$rndkey0);
+	&movups	($rndkey0,&QWP(0x50,$inp));	# IV
+	&xorps	($inout5,$rndkey1);
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&lea	($inp,&DWP(0x60,$inp));
+	&movups	(&QWP(0x20,$out),$inout2);
+	&mov	($rounds,$rounds_);		# restore $rounds
+	&movups	(&QWP(0x30,$out),$inout3);
+	&mov	($key,$key_);			# restore $key
+	&movups	(&QWP(0x40,$out),$inout4);
+	&lea	($out,&DWP(0x50,$out));
+	&sub	($len,0x60);
+	&ja	(&label("cbc_dec_loop6"));
+
+	&movaps	($inout0,$inout5);
+	&movaps	($ivec,$rndkey0);
+	&add	($len,0x50);
+	&jle	(&label("cbc_dec_clear_tail_collected"));
+	&movups	(&QWP(0,$out),$inout0);
+	&lea	($out,&DWP(0x10,$out));
+&set_label("cbc_dec_tail");
+	&movups	($inout0,&QWP(0,$inp));
+	&movaps	($in0,$inout0);
+	&cmp	($len,0x10);
+	&jbe	(&label("cbc_dec_one"));
+
+	&movups	($inout1,&QWP(0x10,$inp));
+	&movaps	($in1,$inout1);
+	&cmp	($len,0x20);
+	&jbe	(&label("cbc_dec_two"));
+
+	&movups	($inout2,&QWP(0x20,$inp));
+	&cmp	($len,0x30);
+	&jbe	(&label("cbc_dec_three"));
+
+	&movups	($inout3,&QWP(0x30,$inp));
+	&cmp	($len,0x40);
+	&jbe	(&label("cbc_dec_four"));
+
+	&movups	($inout4,&QWP(0x40,$inp));
+	&movaps	(&QWP(0,"esp"),$ivec);		# save IV
+	&movups	($inout0,&QWP(0,$inp));
+	&xorps	($inout5,$inout5);
+	&call	("_aesni_decrypt6");
+	&movups	($rndkey1,&QWP(0,$inp));
+	&movups	($rndkey0,&QWP(0x10,$inp));
+	&xorps	($inout0,&QWP(0,"esp"));	# ^= IV
+	&xorps	($inout1,$rndkey1);
+	&movups	($rndkey1,&QWP(0x20,$inp));
+	&xorps	($inout2,$rndkey0);
+	&movups	($rndkey0,&QWP(0x30,$inp));
+	&xorps	($inout3,$rndkey1);
+	&movups	($ivec,&QWP(0x40,$inp));	# IV
+	&xorps	($inout4,$rndkey0);
+	&movups	(&QWP(0,$out),$inout0);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&pxor	($inout1,$inout1);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&pxor	($inout2,$inout2);
+	&movups	(&QWP(0x30,$out),$inout3);
+	&pxor	($inout3,$inout3);
+	&lea	($out,&DWP(0x40,$out));
+	&movaps	($inout0,$inout4);
+	&pxor	($inout4,$inout4);
+	&sub	($len,0x50);
+	&jmp	(&label("cbc_dec_tail_collected"));
+
+&set_label("cbc_dec_one",16);
+	if ($inline)
+	{   &aesni_inline_generate1("dec");	}
+	else
+	{   &call	("_aesni_decrypt1");	}
+	&xorps	($inout0,$ivec);
+	&movaps	($ivec,$in0);
+	&sub	($len,0x10);
+	&jmp	(&label("cbc_dec_tail_collected"));
+
+&set_label("cbc_dec_two",16);
+	&call	("_aesni_decrypt2");
+	&xorps	($inout0,$ivec);
+	&xorps	($inout1,$in0);
+	&movups	(&QWP(0,$out),$inout0);
+	&movaps	($inout0,$inout1);
+	&pxor	($inout1,$inout1);
+	&lea	($out,&DWP(0x10,$out));
+	&movaps	($ivec,$in1);
+	&sub	($len,0x20);
+	&jmp	(&label("cbc_dec_tail_collected"));
+
+&set_label("cbc_dec_three",16);
+	&call	("_aesni_decrypt3");
+	&xorps	($inout0,$ivec);
+	&xorps	($inout1,$in0);
+	&xorps	($inout2,$in1);
+	&movups	(&QWP(0,$out),$inout0);
+	&movaps	($inout0,$inout2);
+	&pxor	($inout2,$inout2);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&pxor	($inout1,$inout1);
+	&lea	($out,&DWP(0x20,$out));
+	&movups	($ivec,&QWP(0x20,$inp));
+	&sub	($len,0x30);
+	&jmp	(&label("cbc_dec_tail_collected"));
+
+&set_label("cbc_dec_four",16);
+	&call	("_aesni_decrypt4");
+	&movups	($rndkey1,&QWP(0x10,$inp));
+	&movups	($rndkey0,&QWP(0x20,$inp));
+	&xorps	($inout0,$ivec);
+	&movups	($ivec,&QWP(0x30,$inp));
+	&xorps	($inout1,$in0);
+	&movups	(&QWP(0,$out),$inout0);
+	&xorps	($inout2,$rndkey1);
+	&movups	(&QWP(0x10,$out),$inout1);
+	&pxor	($inout1,$inout1);
+	&xorps	($inout3,$rndkey0);
+	&movups	(&QWP(0x20,$out),$inout2);
+	&pxor	($inout2,$inout2);
+	&lea	($out,&DWP(0x30,$out));
+	&movaps	($inout0,$inout3);
+	&pxor	($inout3,$inout3);
+	&sub	($len,0x40);
+	&jmp	(&label("cbc_dec_tail_collected"));
+
+&set_label("cbc_dec_clear_tail_collected",16);
+	&pxor	($inout1,$inout1);
+	&pxor	($inout2,$inout2);
+	&pxor	($inout3,$inout3);
+	&pxor	($inout4,$inout4);
+&set_label("cbc_dec_tail_collected");
+	&and	($len,15);
+	&jnz	(&label("cbc_dec_tail_partial"));
+	&movups	(&QWP(0,$out),$inout0);
+	&pxor	($rndkey0,$rndkey0);
+	&jmp	(&label("cbc_ret"));
+
+&set_label("cbc_dec_tail_partial",16);
+	&movaps	(&QWP(0,"esp"),$inout0);
+	&pxor	($rndkey0,$rndkey0);
+	&mov	("ecx",16);
+	&mov	($inp,"esp");
+	&sub	("ecx",$len);
+	&data_word(0xA4F3F689);		# rep movsb
+	&movdqa	(&QWP(0,"esp"),$inout0);
+
+&set_label("cbc_ret");
+	&mov	("esp",&DWP(16,"esp"));	# pull original %esp
+	&mov	($key_,&wparam(4));
+	&pxor	($inout0,$inout0);
+	&pxor	($rndkey1,$rndkey1);
+	&movups	(&QWP(0,$key_),$ivec);	# output IV
+	&pxor	($ivec,$ivec);
+&set_label("cbc_abort");
+&function_end("${PREFIX}_cbc_encrypt");
+
+######################################################################
+# Mechanical port from aesni-x86_64.pl.
+#
+# _aesni_set_encrypt_key is private interface,
+# input:
+#	"eax"	const unsigned char *userKey
+#	$rounds	int bits
+#	$key	AES_KEY *key
+# output:
+#	"eax"	return code
+#	$round	rounds
+
+&function_begin_B("_aesni_set_encrypt_key");
+	&push	("ebp");
+	&push	("ebx");
+	&test	("eax","eax");
+	&jz	(&label("bad_pointer"));
+	&test	($key,$key);
+	&jz	(&label("bad_pointer"));
+
+	&call	(&label("pic"));
+&set_label("pic");
+	&blindpop("ebx");
+	&lea	("ebx",&DWP(&label("key_const")."-".&label("pic"),"ebx"));
+
+	&picmeup("ebp","OPENSSL_ia32cap_P","ebx",&label("key_const"));
+	&movups	("xmm0",&QWP(0,"eax"));	# pull first 128 bits of *userKey
+	&xorps	("xmm4","xmm4");	# low dword of xmm4 is assumed 0
+	&mov	("ebp",&DWP(4,"ebp"));
+	&lea	($key,&DWP(16,$key));
+	&and	("ebp",1<<28|1<<11);	# AVX and XOP bits
+	&cmp	($rounds,256);
+	&je	(&label("14rounds"));
+	&cmp	($rounds,192);
+	&je	(&label("12rounds"));
+	&cmp	($rounds,128);
+	&jne	(&label("bad_keybits"));
+
+&set_label("10rounds",16);
+	&cmp		("ebp",1<<28);
+	&je		(&label("10rounds_alt"));
+
+	&mov		($rounds,9);
+	&$movekey	(&QWP(-16,$key),"xmm0");	# round 0
+	&aeskeygenassist("xmm1","xmm0",0x01);		# round 1
+	&call		(&label("key_128_cold"));
+	&aeskeygenassist("xmm1","xmm0",0x2);		# round 2
+	&call		(&label("key_128"));
+	&aeskeygenassist("xmm1","xmm0",0x04);		# round 3
+	&call		(&label("key_128"));
+	&aeskeygenassist("xmm1","xmm0",0x08);		# round 4
+	&call		(&label("key_128"));
+	&aeskeygenassist("xmm1","xmm0",0x10);		# round 5
+	&call		(&label("key_128"));
+	&aeskeygenassist("xmm1","xmm0",0x20);		# round 6
+	&call		(&label("key_128"));
+	&aeskeygenassist("xmm1","xmm0",0x40);		# round 7
+	&call		(&label("key_128"));
+	&aeskeygenassist("xmm1","xmm0",0x80);		# round 8
+	&call		(&label("key_128"));
+	&aeskeygenassist("xmm1","xmm0",0x1b);		# round 9
+	&call		(&label("key_128"));
+	&aeskeygenassist("xmm1","xmm0",0x36);		# round 10
+	&call		(&label("key_128"));
+	&$movekey	(&QWP(0,$key),"xmm0");
+	&mov		(&DWP(80,$key),$rounds);
+
+	&jmp	(&label("good_key"));
+
+&set_label("key_128",16);
+	&$movekey	(&QWP(0,$key),"xmm0");
+	&lea		($key,&DWP(16,$key));
+&set_label("key_128_cold");
+	&shufps		("xmm4","xmm0",0b00010000);
+	&xorps		("xmm0","xmm4");
+	&shufps		("xmm4","xmm0",0b10001100);
+	&xorps		("xmm0","xmm4");
+	&shufps		("xmm1","xmm1",0b11111111);	# critical path
+	&xorps		("xmm0","xmm1");
+	&ret();
+
+&set_label("10rounds_alt",16);
+	&movdqa		("xmm5",&QWP(0x00,"ebx"));
+	&mov		($rounds,8);
+	&movdqa		("xmm4",&QWP(0x20,"ebx"));
+	&movdqa		("xmm2","xmm0");
+	&movdqu		(&QWP(-16,$key),"xmm0");
+
+&set_label("loop_key128");
+	&pshufb		("xmm0","xmm5");
+	&aesenclast	("xmm0","xmm4");
+	&pslld		("xmm4",1);
+	&lea		($key,&DWP(16,$key));
+
+	&movdqa		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm2","xmm3");
+
+	&pxor		("xmm0","xmm2");
+	&movdqu		(&QWP(-16,$key),"xmm0");
+	&movdqa		("xmm2","xmm0");
+
+	&dec		($rounds);
+	&jnz		(&label("loop_key128"));
+
+	&movdqa		("xmm4",&QWP(0x30,"ebx"));
+
+	&pshufb		("xmm0","xmm5");
+	&aesenclast	("xmm0","xmm4");
+	&pslld		("xmm4",1);
+
+	&movdqa		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm2","xmm3");
+
+	&pxor		("xmm0","xmm2");
+	&movdqu		(&QWP(0,$key),"xmm0");
+
+	&movdqa		("xmm2","xmm0");
+	&pshufb		("xmm0","xmm5");
+	&aesenclast	("xmm0","xmm4");
+
+	&movdqa		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm3","xmm2");
+	&pslldq		("xmm2",4);
+	&pxor		("xmm2","xmm3");
+
+	&pxor		("xmm0","xmm2");
+	&movdqu		(&QWP(16,$key),"xmm0");
+
+	&mov		($rounds,9);
+	&mov		(&DWP(96,$key),$rounds);
+
+	&jmp	(&label("good_key"));
+
+&set_label("12rounds",16);
+	&movq		("xmm2",&QWP(16,"eax"));	# remaining 1/3 of *userKey
+	&cmp		("ebp",1<<28);
+	&je		(&label("12rounds_alt"));
+
+	&mov		($rounds,11);
+	&$movekey	(&QWP(-16,$key),"xmm0");	# round 0
+	&aeskeygenassist("xmm1","xmm2",0x01);		# round 1,2
+	&call		(&label("key_192a_cold"));
+	&aeskeygenassist("xmm1","xmm2",0x02);		# round 2,3
+	&call		(&label("key_192b"));
+	&aeskeygenassist("xmm1","xmm2",0x04);		# round 4,5
+	&call		(&label("key_192a"));
+	&aeskeygenassist("xmm1","xmm2",0x08);		# round 5,6
+	&call		(&label("key_192b"));
+	&aeskeygenassist("xmm1","xmm2",0x10);		# round 7,8
+	&call		(&label("key_192a"));
+	&aeskeygenassist("xmm1","xmm2",0x20);		# round 8,9
+	&call		(&label("key_192b"));
+	&aeskeygenassist("xmm1","xmm2",0x40);		# round 10,11
+	&call		(&label("key_192a"));
+	&aeskeygenassist("xmm1","xmm2",0x80);		# round 11,12
+	&call		(&label("key_192b"));
+	&$movekey	(&QWP(0,$key),"xmm0");
+	&mov		(&DWP(48,$key),$rounds);
+
+	&jmp	(&label("good_key"));
+
+&set_label("key_192a",16);
+	&$movekey	(&QWP(0,$key),"xmm0");
+	&lea		($key,&DWP(16,$key));
+&set_label("key_192a_cold",16);
+	&movaps		("xmm5","xmm2");
+&set_label("key_192b_warm");
+	&shufps		("xmm4","xmm0",0b00010000);
+	&movdqa		("xmm3","xmm2");
+	&xorps		("xmm0","xmm4");
+	&shufps		("xmm4","xmm0",0b10001100);
+	&pslldq		("xmm3",4);
+	&xorps		("xmm0","xmm4");
+	&pshufd		("xmm1","xmm1",0b01010101);	# critical path
+	&pxor		("xmm2","xmm3");
+	&pxor		("xmm0","xmm1");
+	&pshufd		("xmm3","xmm0",0b11111111);
+	&pxor		("xmm2","xmm3");
+	&ret();
+
+&set_label("key_192b",16);
+	&movaps		("xmm3","xmm0");
+	&shufps		("xmm5","xmm0",0b01000100);
+	&$movekey	(&QWP(0,$key),"xmm5");
+	&shufps		("xmm3","xmm2",0b01001110);
+	&$movekey	(&QWP(16,$key),"xmm3");
+	&lea		($key,&DWP(32,$key));
+	&jmp		(&label("key_192b_warm"));
+
+&set_label("12rounds_alt",16);
+	&movdqa		("xmm5",&QWP(0x10,"ebx"));
+	&movdqa		("xmm4",&QWP(0x20,"ebx"));
+	&mov		($rounds,8);
+	&movdqu		(&QWP(-16,$key),"xmm0");
+
+&set_label("loop_key192");
+	&movq		(&QWP(0,$key),"xmm2");
+	&movdqa		("xmm1","xmm2");
+	&pshufb		("xmm2","xmm5");
+	&aesenclast	("xmm2","xmm4");
+	&pslld		("xmm4",1);
+	&lea		($key,&DWP(24,$key));
+
+	&movdqa		("xmm3","xmm0");
+	&pslldq		("xmm0",4);
+	&pxor		("xmm3","xmm0");
+	&pslldq		("xmm0",4);
+	&pxor		("xmm3","xmm0");
+	&pslldq		("xmm0",4);
+	&pxor		("xmm0","xmm3");
+
+	&pshufd		("xmm3","xmm0",0xff);
+	&pxor		("xmm3","xmm1");
+	&pslldq		("xmm1",4);
+	&pxor		("xmm3","xmm1");
+
+	&pxor		("xmm0","xmm2");
+	&pxor		("xmm2","xmm3");
+	&movdqu		(&QWP(-16,$key),"xmm0");
+
+	&dec		($rounds);
+	&jnz		(&label("loop_key192"));
+
+	&mov	($rounds,11);
+	&mov	(&DWP(32,$key),$rounds);
+
+	&jmp	(&label("good_key"));
+
+&set_label("14rounds",16);
+	&movups		("xmm2",&QWP(16,"eax"));	# remaining half of *userKey
+	&lea		($key,&DWP(16,$key));
+	&cmp		("ebp",1<<28);
+	&je		(&label("14rounds_alt"));
+
+	&mov		($rounds,13);
+	&$movekey	(&QWP(-32,$key),"xmm0");	# round 0
+	&$movekey	(&QWP(-16,$key),"xmm2");	# round 1
+	&aeskeygenassist("xmm1","xmm2",0x01);		# round 2
+	&call		(&label("key_256a_cold"));
+	&aeskeygenassist("xmm1","xmm0",0x01);		# round 3
+	&call		(&label("key_256b"));
+	&aeskeygenassist("xmm1","xmm2",0x02);		# round 4
+	&call		(&label("key_256a"));
+	&aeskeygenassist("xmm1","xmm0",0x02);		# round 5
+	&call		(&label("key_256b"));
+	&aeskeygenassist("xmm1","xmm2",0x04);		# round 6
+	&call		(&label("key_256a"));
+	&aeskeygenassist("xmm1","xmm0",0x04);		# round 7
+	&call		(&label("key_256b"));
+	&aeskeygenassist("xmm1","xmm2",0x08);		# round 8
+	&call		(&label("key_256a"));
+	&aeskeygenassist("xmm1","xmm0",0x08);		# round 9
+	&call		(&label("key_256b"));
+	&aeskeygenassist("xmm1","xmm2",0x10);		# round 10
+	&call		(&label("key_256a"));
+	&aeskeygenassist("xmm1","xmm0",0x10);		# round 11
+	&call		(&label("key_256b"));
+	&aeskeygenassist("xmm1","xmm2",0x20);		# round 12
+	&call		(&label("key_256a"));
+	&aeskeygenassist("xmm1","xmm0",0x20);		# round 13
+	&call		(&label("key_256b"));
+	&aeskeygenassist("xmm1","xmm2",0x40);		# round 14
+	&call		(&label("key_256a"));
+	&$movekey	(&QWP(0,$key),"xmm0");
+	&mov		(&DWP(16,$key),$rounds);
+	&xor		("eax","eax");
+
+	&jmp	(&label("good_key"));
+
+&set_label("key_256a",16);
+	&$movekey	(&QWP(0,$key),"xmm2");
+	&lea		($key,&DWP(16,$key));
+&set_label("key_256a_cold");
+	&shufps		("xmm4","xmm0",0b00010000);
+	&xorps		("xmm0","xmm4");
+	&shufps		("xmm4","xmm0",0b10001100);
+	&xorps		("xmm0","xmm4");
+	&shufps		("xmm1","xmm1",0b11111111);	# critical path
+	&xorps		("xmm0","xmm1");
+	&ret();
+
+&set_label("key_256b",16);
+	&$movekey	(&QWP(0,$key),"xmm0");
+	&lea		($key,&DWP(16,$key));
+
+	&shufps		("xmm4","xmm2",0b00010000);
+	&xorps		("xmm2","xmm4");
+	&shufps		("xmm4","xmm2",0b10001100);
+	&xorps		("xmm2","xmm4");
+	&shufps		("xmm1","xmm1",0b10101010);	# critical path
+	&xorps		("xmm2","xmm1");
+	&ret();
+
+&set_label("14rounds_alt",16);
+	&movdqa		("xmm5",&QWP(0x00,"ebx"));
+	&movdqa		("xmm4",&QWP(0x20,"ebx"));
+	&mov		($rounds,7);
+	&movdqu		(&QWP(-32,$key),"xmm0");
+	&movdqa		("xmm1","xmm2");
+	&movdqu		(&QWP(-16,$key),"xmm2");
+
+&set_label("loop_key256");
+	&pshufb		("xmm2","xmm5");
+	&aesenclast	("xmm2","xmm4");
+
+	&movdqa		("xmm3","xmm0");
+	&pslldq		("xmm0",4);
+	&pxor		("xmm3","xmm0");
+	&pslldq		("xmm0",4);
+	&pxor		("xmm3","xmm0");
+	&pslldq		("xmm0",4);
+	&pxor		("xmm0","xmm3");
+	&pslld		("xmm4",1);
+
+	&pxor		("xmm0","xmm2");
+	&movdqu		(&QWP(0,$key),"xmm0");
+
+	&dec		($rounds);
+	&jz		(&label("done_key256"));
+
+	&pshufd		("xmm2","xmm0",0xff);
+	&pxor		("xmm3","xmm3");
+	&aesenclast	("xmm2","xmm3");
+
+	&movdqa		("xmm3","xmm1");
+	&pslldq		("xmm1",4);
+	&pxor		("xmm3","xmm1");
+	&pslldq		("xmm1",4);
+	&pxor		("xmm3","xmm1");
+	&pslldq		("xmm1",4);
+	&pxor		("xmm1","xmm3");
+
+	&pxor		("xmm2","xmm1");
+	&movdqu		(&QWP(16,$key),"xmm2");
+	&lea		($key,&DWP(32,$key));
+	&movdqa		("xmm1","xmm2");
+	&jmp		(&label("loop_key256"));
+
+&set_label("done_key256");
+	&mov		($rounds,13);
+	&mov		(&DWP(16,$key),$rounds);
+
+&set_label("good_key");
+	&pxor	("xmm0","xmm0");
+	&pxor	("xmm1","xmm1");
+	&pxor	("xmm2","xmm2");
+	&pxor	("xmm3","xmm3");
+	&pxor	("xmm4","xmm4");
+	&pxor	("xmm5","xmm5");
+	&xor	("eax","eax");
+	&pop	("ebx");
+	&pop	("ebp");
+	&ret	();
+
+&set_label("bad_pointer",4);
+	&mov	("eax",-1);
+	&pop	("ebx");
+	&pop	("ebp");
+	&ret	();
+&set_label("bad_keybits",4);
+	&pxor	("xmm0","xmm0");
+	&mov	("eax",-2);
+	&pop	("ebx");
+	&pop	("ebp");
+	&ret	();
+&function_end_B("_aesni_set_encrypt_key");
+
+# int $PREFIX_set_encrypt_key (const unsigned char *userKey, int bits,
+#                              AES_KEY *key)
+&function_begin_B("${PREFIX}_set_encrypt_key");
+	&mov	("eax",&wparam(0));
+	&mov	($rounds,&wparam(1));
+	&mov	($key,&wparam(2));
+	&call	("_aesni_set_encrypt_key");
+	&ret	();
+&function_end_B("${PREFIX}_set_encrypt_key");
+
+# int $PREFIX_set_decrypt_key (const unsigned char *userKey, int bits,
+#                              AES_KEY *key)
+&function_begin_B("${PREFIX}_set_decrypt_key");
+	&mov	("eax",&wparam(0));
+	&mov	($rounds,&wparam(1));
+	&mov	($key,&wparam(2));
+	&call	("_aesni_set_encrypt_key");
+	&mov	($key,&wparam(2));
+	&shl	($rounds,4);	# rounds-1 after _aesni_set_encrypt_key
+	&test	("eax","eax");
+	&jnz	(&label("dec_key_ret"));
+	&lea	("eax",&DWP(16,$key,$rounds));	# end of key schedule
+
+	&$movekey	("xmm0",&QWP(0,$key));	# just swap
+	&$movekey	("xmm1",&QWP(0,"eax"));
+	&$movekey	(&QWP(0,"eax"),"xmm0");
+	&$movekey	(&QWP(0,$key),"xmm1");
+	&lea		($key,&DWP(16,$key));
+	&lea		("eax",&DWP(-16,"eax"));
+
+&set_label("dec_key_inverse");
+	&$movekey	("xmm0",&QWP(0,$key));	# swap and inverse
+	&$movekey	("xmm1",&QWP(0,"eax"));
+	&aesimc		("xmm0","xmm0");
+	&aesimc		("xmm1","xmm1");
+	&lea		($key,&DWP(16,$key));
+	&lea		("eax",&DWP(-16,"eax"));
+	&$movekey	(&QWP(16,"eax"),"xmm0");
+	&$movekey	(&QWP(-16,$key),"xmm1");
+	&cmp		("eax",$key);
+	&ja		(&label("dec_key_inverse"));
+
+	&$movekey	("xmm0",&QWP(0,$key));	# inverse middle
+	&aesimc		("xmm0","xmm0");
+	&$movekey	(&QWP(0,$key),"xmm0");
+
+	&pxor		("xmm0","xmm0");
+	&pxor		("xmm1","xmm1");
+	&xor		("eax","eax");		# return success
+&set_label("dec_key_ret");
+	&ret	();
+&function_end_B("${PREFIX}_set_decrypt_key");
+
+&set_label("key_const",64);
+&data_word(0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d);
+&data_word(0x04070605,0x04070605,0x04070605,0x04070605);
+&data_word(1,1,1,1);
+&data_word(0x1b,0x1b,0x1b,0x1b);
+&asciz("AES for Intel AES-NI, CRYPTOGAMS by <appro\@openssl.org>");
+
+&asm_finish();
+
+close STDOUT;
diff --git a/crypto/aesgcm/aesni-x86_64.pl b/crypto/aesgcm/aesni-x86_64.pl
new file mode 100644
index 0000000..252c485
--- /dev/null
+++ b/crypto/aesgcm/aesni-x86_64.pl
@@ -0,0 +1,5136 @@
+#! /usr/bin/env perl
+# Copyright 2009-2016 The OpenSSL Project Authors. All Rights Reserved.
+
+# Ludde note : This is the regular AES code.
+
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# This module implements support for Intel AES-NI extension. In
+# OpenSSL context it's used with Intel engine, but can also be used as
+# drop-in replacement for crypto/aes/asm/aes-x86_64.pl [see below for
+# details].
+#
+# Performance.
+#
+# Given aes(enc|dec) instructions' latency asymptotic performance for
+# non-parallelizable modes such as CBC encrypt is 3.75 cycles per byte
+# processed with 128-bit key. And given their throughput asymptotic
+# performance for parallelizable modes is 1.25 cycles per byte. Being
+# asymptotic limit it's not something you commonly achieve in reality,
+# but how close does one get? Below are results collected for
+# different modes and block sized. Pairs of numbers are for en-/
+# decryption.
+#
+#	16-byte     64-byte     256-byte    1-KB        8-KB
+# ECB	4.25/4.25   1.38/1.38   1.28/1.28   1.26/1.26	1.26/1.26
+# CTR	5.42/5.42   1.92/1.92   1.44/1.44   1.28/1.28   1.26/1.26
+# CBC	4.38/4.43   4.15/1.43   4.07/1.32   4.07/1.29   4.06/1.28
+# CCM	5.66/9.42   4.42/5.41   4.16/4.40   4.09/4.15   4.06/4.07
+# OFB	5.42/5.42   4.64/4.64   4.44/4.44   4.39/4.39   4.38/4.38
+# CFB	5.73/5.85   5.56/5.62   5.48/5.56   5.47/5.55   5.47/5.55
+#
+# ECB, CTR, CBC and CCM results are free from EVP overhead. This means
+# that otherwise used 'openssl speed -evp aes-128-??? -engine aesni
+# [-decrypt]' will exhibit 10-15% worse results for smaller blocks.
+# The results were collected with specially crafted speed.c benchmark
+# in order to compare them with results reported in "Intel Advanced
+# Encryption Standard (AES) New Instruction Set" White Paper Revision
+# 3.0 dated May 2010. All above results are consistently better. This
+# module also provides better performance for block sizes smaller than
+# 128 bytes in points *not* represented in the above table.
+#
+# Looking at the results for 8-KB buffer.
+#
+# CFB and OFB results are far from the limit, because implementation
+# uses "generic" CRYPTO_[c|o]fb128_encrypt interfaces relying on
+# single-block aesni_encrypt, which is not the most optimal way to go.
+# CBC encrypt result is unexpectedly high and there is no documented
+# explanation for it. Seemingly there is a small penalty for feeding
+# the result back to AES unit the way it's done in CBC mode. There is
+# nothing one can do and the result appears optimal. CCM result is
+# identical to CBC, because CBC-MAC is essentially CBC encrypt without
+# saving output. CCM CTR "stays invisible," because it's neatly
+# interleaved wih CBC-MAC. This provides ~30% improvement over
+# "straightforward" CCM implementation with CTR and CBC-MAC performed
+# disjointly. Parallelizable modes practically achieve the theoretical
+# limit.
+#
+# Looking at how results vary with buffer size.
+#
+# Curves are practically saturated at 1-KB buffer size. In most cases
+# "256-byte" performance is >95%, and "64-byte" is ~90% of "8-KB" one.
+# CTR curve doesn't follow this pattern and is "slowest" changing one
+# with "256-byte" result being 87% of "8-KB." This is because overhead
+# in CTR mode is most computationally intensive. Small-block CCM
+# decrypt is slower than encrypt, because first CTR and last CBC-MAC
+# iterations can't be interleaved.
+#
+# Results for 192- and 256-bit keys.
+#
+# EVP-free results were observed to scale perfectly with number of
+# rounds for larger block sizes, i.e. 192-bit result being 10/12 times
+# lower and 256-bit one - 10/14. Well, in CBC encrypt case differences
+# are a tad smaller, because the above mentioned penalty biases all
+# results by same constant value. In similar way function call
+# overhead affects small-block performance, as well as OFB and CFB
+# results. Differences are not large, most common coefficients are
+# 10/11.7 and 10/13.4 (as opposite to 10/12.0 and 10/14.0), but one
+# observe even 10/11.2 and 10/12.4 (CTR, OFB, CFB)...
+
+# January 2011
+#
+# While Westmere processor features 6 cycles latency for aes[enc|dec]
+# instructions, which can be scheduled every second cycle, Sandy
+# Bridge spends 8 cycles per instruction, but it can schedule them
+# every cycle. This means that code targeting Westmere would perform
+# suboptimally on Sandy Bridge. Therefore this update.
+#
+# In addition, non-parallelizable CBC encrypt (as well as CCM) is
+# optimized. Relative improvement might appear modest, 8% on Westmere,
+# but in absolute terms it's 3.77 cycles per byte encrypted with
+# 128-bit key on Westmere, and 5.07 - on Sandy Bridge. These numbers
+# should be compared to asymptotic limits of 3.75 for Westmere and
+# 5.00 for Sandy Bridge. Actually, the fact that they get this close
+# to asymptotic limits is quite amazing. Indeed, the limit is
+# calculated as latency times number of rounds, 10 for 128-bit key,
+# and divided by 16, the number of bytes in block, or in other words
+# it accounts *solely* for aesenc instructions. But there are extra
+# instructions, and numbers so close to the asymptotic limits mean
+# that it's as if it takes as little as *one* additional cycle to
+# execute all of them. How is it possible? It is possible thanks to
+# out-of-order execution logic, which manages to overlap post-
+# processing of previous block, things like saving the output, with
+# actual encryption of current block, as well as pre-processing of
+# current block, things like fetching input and xor-ing it with
+# 0-round element of the key schedule, with actual encryption of
+# previous block. Keep this in mind...
+#
+# For parallelizable modes, such as ECB, CBC decrypt, CTR, higher
+# performance is achieved by interleaving instructions working on
+# independent blocks. In which case asymptotic limit for such modes
+# can be obtained by dividing above mentioned numbers by AES
+# instructions' interleave factor. Westmere can execute at most 3
+# instructions at a time, meaning that optimal interleave factor is 3,
+# and that's where the "magic" number of 1.25 come from. "Optimal
+# interleave factor" means that increase of interleave factor does
+# not improve performance. The formula has proven to reflect reality
+# pretty well on Westmere... Sandy Bridge on the other hand can
+# execute up to 8 AES instructions at a time, so how does varying
+# interleave factor affect the performance? Here is table for ECB
+# (numbers are cycles per byte processed with 128-bit key):
+#
+# instruction interleave factor		3x	6x	8x
+# theoretical asymptotic limit		1.67	0.83	0.625
+# measured performance for 8KB block	1.05	0.86	0.84
+#
+# "as if" interleave factor		4.7x	5.8x	6.0x
+#
+# Further data for other parallelizable modes:
+#
+# CBC decrypt				1.16	0.93	0.74
+# CTR					1.14	0.91	0.74
+#
+# Well, given 3x column it's probably inappropriate to call the limit
+# asymptotic, if it can be surpassed, isn't it? What happens there?
+# Rewind to CBC paragraph for the answer. Yes, out-of-order execution
+# magic is responsible for this. Processor overlaps not only the
+# additional instructions with AES ones, but even AES instructions
+# processing adjacent triplets of independent blocks. In the 6x case
+# additional instructions  still claim disproportionally small amount
+# of additional cycles, but in 8x case number of instructions must be
+# a tad too high for out-of-order logic to cope with, and AES unit
+# remains underutilized... As you can see 8x interleave is hardly
+# justifiable, so there no need to feel bad that 32-bit aesni-x86.pl
+# utilizes 6x interleave because of limited register bank capacity.
+#
+# Higher interleave factors do have negative impact on Westmere
+# performance. While for ECB mode it's negligible ~1.5%, other
+# parallelizables perform ~5% worse, which is outweighed by ~25%
+# improvement on Sandy Bridge. To balance regression on Westmere
+# CTR mode was implemented with 6x aesenc interleave factor.
+
+# April 2011
+#
+# Add aesni_xts_[en|de]crypt. Westmere spends 1.25 cycles processing
+# one byte out of 8KB with 128-bit key, Sandy Bridge - 0.90. Just like
+# in CTR mode AES instruction interleave factor was chosen to be 6x.
+
+# November 2015
+#
+# Add aesni_ocb_[en|de]crypt. AES instruction interleave factor was
+# chosen to be 6x.
+
+######################################################################
+# Current large-block performance in cycles per byte processed with
+# 128-bit key (less is better).
+#
+#		CBC en-/decrypt	CTR	XTS	ECB	OCB
+# Westmere	3.77/1.25	1.25	1.25	1.26
+# * Bridge	5.07/0.74	0.75	0.90	0.85	0.98
+# Haswell	4.44/0.63	0.63	0.73	0.63	0.70
+# Skylake	2.62/0.63	0.63	0.63	0.63
+# Silvermont	5.75/3.54	3.56	4.12	3.87(*)	4.11
+# Knights L	2.54/0.77	0.78	0.85	-	1.50
+# Goldmont	3.82/1.26	1.26	1.29	1.29	1.50
+# Bulldozer	5.77/0.70	0.72	0.90	0.70	0.95
+# Ryzen		2.71/0.35	0.35	0.44	0.38	0.49
+#
+# (*)	Atom Silvermont ECB result is suboptimal because of penalties
+#	incurred by operations on %xmm8-15. As ECB is not considered
+#	critical, nothing was done to mitigate the problem.
+
+$PREFIX="aesni";	# if $PREFIX is set to "AES", the script
+			# generates drop-in replacement for
+			# crypto/aes/asm/aes-x86_64.pl:-)
+
+$flavour = shift;
+$output  = shift;
+if ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
+
+$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/);
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../x86_64-xlate.pl" and -f $xlate) or
+die "can't locate x86_64-xlate.pl";
+
+open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
+*STDOUT=*OUT;
+
+$movkey = $PREFIX eq "aesni" ? "movups" : "movups";
+@_4args=$win64?	("%rcx","%rdx","%r8", "%r9") :	# Win64 order
+		("%rdi","%rsi","%rdx","%rcx");	# Unix order
+
+$code=".text\n";
+#$code.=".extern	OPENSSL_ia32cap_P\n";
+
+$rounds="%eax";	# input to and changed by aesni_[en|de]cryptN !!!
+# this is natural Unix argument order for public $PREFIX_[ecb|cbc]_encrypt ...
+$inp="%rdi";
+$out="%rsi";
+$len="%rdx";
+$key="%rcx";	# input to and changed by aesni_[en|de]cryptN !!!
+$ivp="%r8";	# cbc, ctr, ...
+
+$rnds_="%r10d";	# backup copy for $rounds
+$key_="%r11";	# backup copy for $key
+
+# %xmm register layout
+$rndkey0="%xmm0";	$rndkey1="%xmm1";
+$inout0="%xmm2";	$inout1="%xmm3";
+$inout2="%xmm4";	$inout3="%xmm5";
+$inout4="%xmm6";	$inout5="%xmm7";
+$inout6="%xmm8";	$inout7="%xmm9";
+
+$in2="%xmm6";		$in1="%xmm7";	# used in CBC decrypt, CTR, ...
+$in0="%xmm8";		$iv="%xmm9";
+
+# Inline version of internal aesni_[en|de]crypt1.
+#
+# Why folded loop? Because aes[enc|dec] is slow enough to accommodate
+# cycles which take care of loop variables...
+{ my $sn;
+sub aesni_generate1 {
+my ($p,$key,$rounds,$inout,$ivec)=@_;	$inout=$inout0 if (!defined($inout));
+++$sn;
+$code.=<<___;
+	$movkey	($key),$rndkey0
+	$movkey	16($key),$rndkey1
+___
+$code.=<<___ if (defined($ivec));
+	xorps	$rndkey0,$ivec
+	lea	32($key),$key
+	xorps	$ivec,$inout
+___
+$code.=<<___ if (!defined($ivec));
+	lea	32($key),$key
+	xorps	$rndkey0,$inout
+___
+$code.=<<___;
+.Loop_${p}1_$sn:
+	aes${p}	$rndkey1,$inout
+	dec	$rounds
+	$movkey	($key),$rndkey1
+	lea	16($key),$key
+	jnz	.Loop_${p}1_$sn	# loop body is 16 bytes
+	aes${p}last	$rndkey1,$inout
+___
+}}
+# void $PREFIX_[en|de]crypt (const void *inp,void *out,const AES_KEY *key);
+#
+{ my ($inp,$out,$key) = @_4args;
+
+$code.=<<___;
+.globl	${PREFIX}_encrypt
+.type	${PREFIX}_encrypt,\@abi-omnipotent
+.align	16
+${PREFIX}_encrypt:
+	movups	($inp),$inout0		# load input
+	mov	240($key),$rounds	# key->rounds
+___
+	&aesni_generate1("enc",$key,$rounds);
+$code.=<<___;
+	 pxor	$rndkey0,$rndkey0	# clear register bank
+	 pxor	$rndkey1,$rndkey1
+	movups	$inout0,($out)		# output
+	 pxor	$inout0,$inout0
+	ret
+.size	${PREFIX}_encrypt,.-${PREFIX}_encrypt
+
+.globl	${PREFIX}_decrypt
+.type	${PREFIX}_decrypt,\@abi-omnipotent
+.align	16
+${PREFIX}_decrypt:
+	movups	($inp),$inout0		# load input
+	mov	240($key),$rounds	# key->rounds
+___
+	&aesni_generate1("dec",$key,$rounds);
+$code.=<<___;
+	 pxor	$rndkey0,$rndkey0	# clear register bank
+	 pxor	$rndkey1,$rndkey1
+	movups	$inout0,($out)		# output
+	 pxor	$inout0,$inout0
+	ret
+.size	${PREFIX}_decrypt, .-${PREFIX}_decrypt
+___
+}
+
+# _aesni_[en|de]cryptN are private interfaces, N denotes interleave
+# factor. Why 3x subroutine were originally used in loops? Even though
+# aes[enc|dec] latency was originally 6, it could be scheduled only
+# every *2nd* cycle. Thus 3x interleave was the one providing optimal
+# utilization, i.e. when subroutine's throughput is virtually same as
+# of non-interleaved subroutine [for number of input blocks up to 3].
+# This is why it originally made no sense to implement 2x subroutine.
+# But times change and it became appropriate to spend extra 192 bytes
+# on 2x subroutine on Atom Silvermont account. For processors that
+# can schedule aes[enc|dec] every cycle optimal interleave factor
+# equals to corresponding instructions latency. 8x is optimal for
+# * Bridge and "super-optimal" for other Intel CPUs...
+
+sub aesni_generate2 {
+my $dir=shift;
+# As already mentioned it takes in $key and $rounds, which are *not*
+# preserved. $inout[0-1] is cipher/clear text...
+$code.=<<___;
+.type	_aesni_${dir}rypt2,\@abi-omnipotent
+.align	16
+_aesni_${dir}rypt2:
+	$movkey	($key),$rndkey0
+	shl	\$4,$rounds
+	$movkey	16($key),$rndkey1
+	xorps	$rndkey0,$inout0
+	xorps	$rndkey0,$inout1
+	$movkey	32($key),$rndkey0
+	lea	32($key,$rounds),$key
+	neg	%rax				# $rounds
+	add	\$16,%rax
+
+.L${dir}_loop2:
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+	aes${dir}	$rndkey0,$inout0
+	aes${dir}	$rndkey0,$inout1
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.L${dir}_loop2
+
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	aes${dir}last	$rndkey0,$inout0
+	aes${dir}last	$rndkey0,$inout1
+	ret
+.size	_aesni_${dir}rypt2,.-_aesni_${dir}rypt2
+___
+}
+sub aesni_generate3 {
+my $dir=shift;
+# As already mentioned it takes in $key and $rounds, which are *not*
+# preserved. $inout[0-2] is cipher/clear text...
+$code.=<<___;
+.type	_aesni_${dir}rypt3,\@abi-omnipotent
+.align	16
+_aesni_${dir}rypt3:
+	$movkey	($key),$rndkey0
+	shl	\$4,$rounds
+	$movkey	16($key),$rndkey1
+	xorps	$rndkey0,$inout0
+	xorps	$rndkey0,$inout1
+	xorps	$rndkey0,$inout2
+	$movkey	32($key),$rndkey0
+	lea	32($key,$rounds),$key
+	neg	%rax				# $rounds
+	add	\$16,%rax
+
+.L${dir}_loop3:
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	aes${dir}	$rndkey1,$inout2
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+	aes${dir}	$rndkey0,$inout0
+	aes${dir}	$rndkey0,$inout1
+	aes${dir}	$rndkey0,$inout2
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.L${dir}_loop3
+
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	aes${dir}	$rndkey1,$inout2
+	aes${dir}last	$rndkey0,$inout0
+	aes${dir}last	$rndkey0,$inout1
+	aes${dir}last	$rndkey0,$inout2
+	ret
+.size	_aesni_${dir}rypt3,.-_aesni_${dir}rypt3
+___
+}
+# 4x interleave is implemented to improve small block performance,
+# most notably [and naturally] 4 block by ~30%. One can argue that one
+# should have implemented 5x as well, but improvement would be <20%,
+# so it's not worth it...
+sub aesni_generate4 {
+my $dir=shift;
+# As already mentioned it takes in $key and $rounds, which are *not*
+# preserved. $inout[0-3] is cipher/clear text...
+$code.=<<___;
+.type	_aesni_${dir}rypt4,\@abi-omnipotent
+.align	16
+_aesni_${dir}rypt4:
+	$movkey	($key),$rndkey0
+	shl	\$4,$rounds
+	$movkey	16($key),$rndkey1
+	xorps	$rndkey0,$inout0
+	xorps	$rndkey0,$inout1
+	xorps	$rndkey0,$inout2
+	xorps	$rndkey0,$inout3
+	$movkey	32($key),$rndkey0
+	lea	32($key,$rounds),$key
+	neg	%rax				# $rounds
+	.byte	0x0f,0x1f,0x00
+	add	\$16,%rax
+
+.L${dir}_loop4:
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	aes${dir}	$rndkey1,$inout2
+	aes${dir}	$rndkey1,$inout3
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+	aes${dir}	$rndkey0,$inout0
+	aes${dir}	$rndkey0,$inout1
+	aes${dir}	$rndkey0,$inout2
+	aes${dir}	$rndkey0,$inout3
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.L${dir}_loop4
+
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	aes${dir}	$rndkey1,$inout2
+	aes${dir}	$rndkey1,$inout3
+	aes${dir}last	$rndkey0,$inout0
+	aes${dir}last	$rndkey0,$inout1
+	aes${dir}last	$rndkey0,$inout2
+	aes${dir}last	$rndkey0,$inout3
+	ret
+.size	_aesni_${dir}rypt4,.-_aesni_${dir}rypt4
+___
+}
+sub aesni_generate6 {
+my $dir=shift;
+# As already mentioned it takes in $key and $rounds, which are *not*
+# preserved. $inout[0-5] is cipher/clear text...
+$code.=<<___;
+.type	_aesni_${dir}rypt6,\@abi-omnipotent
+.align	16
+_aesni_${dir}rypt6:
+	$movkey		($key),$rndkey0
+	shl		\$4,$rounds
+	$movkey		16($key),$rndkey1
+	xorps		$rndkey0,$inout0
+	pxor		$rndkey0,$inout1
+	pxor		$rndkey0,$inout2
+	aes${dir}	$rndkey1,$inout0
+	lea		32($key,$rounds),$key
+	neg		%rax			# $rounds
+	aes${dir}	$rndkey1,$inout1
+	pxor		$rndkey0,$inout3
+	pxor		$rndkey0,$inout4
+	aes${dir}	$rndkey1,$inout2
+	pxor		$rndkey0,$inout5
+	$movkey		($key,%rax),$rndkey0
+	add		\$16,%rax
+	jmp		.L${dir}_loop6_enter
+.align	16
+.L${dir}_loop6:
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	aes${dir}	$rndkey1,$inout2
+.L${dir}_loop6_enter:
+	aes${dir}	$rndkey1,$inout3
+	aes${dir}	$rndkey1,$inout4
+	aes${dir}	$rndkey1,$inout5
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+	aes${dir}	$rndkey0,$inout0
+	aes${dir}	$rndkey0,$inout1
+	aes${dir}	$rndkey0,$inout2
+	aes${dir}	$rndkey0,$inout3
+	aes${dir}	$rndkey0,$inout4
+	aes${dir}	$rndkey0,$inout5
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.L${dir}_loop6
+
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	aes${dir}	$rndkey1,$inout2
+	aes${dir}	$rndkey1,$inout3
+	aes${dir}	$rndkey1,$inout4
+	aes${dir}	$rndkey1,$inout5
+	aes${dir}last	$rndkey0,$inout0
+	aes${dir}last	$rndkey0,$inout1
+	aes${dir}last	$rndkey0,$inout2
+	aes${dir}last	$rndkey0,$inout3
+	aes${dir}last	$rndkey0,$inout4
+	aes${dir}last	$rndkey0,$inout5
+	ret
+.size	_aesni_${dir}rypt6,.-_aesni_${dir}rypt6
+___
+}
+sub aesni_generate8 {
+my $dir=shift;
+# As already mentioned it takes in $key and $rounds, which are *not*
+# preserved. $inout[0-7] is cipher/clear text...
+$code.=<<___;
+.type	_aesni_${dir}rypt8,\@abi-omnipotent
+.align	16
+_aesni_${dir}rypt8:
+	$movkey		($key),$rndkey0
+	shl		\$4,$rounds
+	$movkey		16($key),$rndkey1
+	xorps		$rndkey0,$inout0
+	xorps		$rndkey0,$inout1
+	pxor		$rndkey0,$inout2
+	pxor		$rndkey0,$inout3
+	pxor		$rndkey0,$inout4
+	lea		32($key,$rounds),$key
+	neg		%rax			# $rounds
+	aes${dir}	$rndkey1,$inout0
+	pxor		$rndkey0,$inout5
+	pxor		$rndkey0,$inout6
+	aes${dir}	$rndkey1,$inout1
+	pxor		$rndkey0,$inout7
+	$movkey		($key,%rax),$rndkey0
+	add		\$16,%rax
+	jmp		.L${dir}_loop8_inner
+.align	16
+.L${dir}_loop8:
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+.L${dir}_loop8_inner:
+	aes${dir}	$rndkey1,$inout2
+	aes${dir}	$rndkey1,$inout3
+	aes${dir}	$rndkey1,$inout4
+	aes${dir}	$rndkey1,$inout5
+	aes${dir}	$rndkey1,$inout6
+	aes${dir}	$rndkey1,$inout7
+.L${dir}_loop8_enter:
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+	aes${dir}	$rndkey0,$inout0
+	aes${dir}	$rndkey0,$inout1
+	aes${dir}	$rndkey0,$inout2
+	aes${dir}	$rndkey0,$inout3
+	aes${dir}	$rndkey0,$inout4
+	aes${dir}	$rndkey0,$inout5
+	aes${dir}	$rndkey0,$inout6
+	aes${dir}	$rndkey0,$inout7
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.L${dir}_loop8
+
+	aes${dir}	$rndkey1,$inout0
+	aes${dir}	$rndkey1,$inout1
+	aes${dir}	$rndkey1,$inout2
+	aes${dir}	$rndkey1,$inout3
+	aes${dir}	$rndkey1,$inout4
+	aes${dir}	$rndkey1,$inout5
+	aes${dir}	$rndkey1,$inout6
+	aes${dir}	$rndkey1,$inout7
+	aes${dir}last	$rndkey0,$inout0
+	aes${dir}last	$rndkey0,$inout1
+	aes${dir}last	$rndkey0,$inout2
+	aes${dir}last	$rndkey0,$inout3
+	aes${dir}last	$rndkey0,$inout4
+	aes${dir}last	$rndkey0,$inout5
+	aes${dir}last	$rndkey0,$inout6
+	aes${dir}last	$rndkey0,$inout7
+	ret
+.size	_aesni_${dir}rypt8,.-_aesni_${dir}rypt8
+___
+}
+&aesni_generate2("enc") if ($PREFIX eq "aesni");
+&aesni_generate2("dec");
+&aesni_generate3("enc") if ($PREFIX eq "aesni");
+&aesni_generate3("dec");
+&aesni_generate4("enc") if ($PREFIX eq "aesni");
+&aesni_generate4("dec");
+&aesni_generate6("enc") if ($PREFIX eq "aesni");
+&aesni_generate6("dec");
+&aesni_generate8("enc") if ($PREFIX eq "aesni");
+&aesni_generate8("dec");
+
+if ($PREFIX eq "aesni") {
+if (0) {	
+########################################################################
+# void aesni_ecb_encrypt (const void *in, void *out,
+#			  size_t length, const AES_KEY *key,
+#			  int enc);
+$code.=<<___;
+.globl	aesni_ecb_encrypt
+.type	aesni_ecb_encrypt,\@function,5
+.align	16
+aesni_ecb_encrypt:
+___
+$code.=<<___ if ($win64);
+	lea	-0x58(%rsp),%rsp
+	movaps	%xmm6,(%rsp)		# offload $inout4..7
+	movaps	%xmm7,0x10(%rsp)
+	movaps	%xmm8,0x20(%rsp)
+	movaps	%xmm9,0x30(%rsp)
+.Lecb_enc_body:
+___
+$code.=<<___;
+	and	\$-16,$len		# if ($len<16)
+	jz	.Lecb_ret		# return
+
+	mov	240($key),$rounds	# key->rounds
+	$movkey	($key),$rndkey0
+	mov	$key,$key_		# backup $key
+	mov	$rounds,$rnds_		# backup $rounds
+	test	%r8d,%r8d		# 5th argument
+	jz	.Lecb_decrypt
+#--------------------------- ECB ENCRYPT ------------------------------#
+	cmp	\$0x80,$len		# if ($len<8*16)
+	jb	.Lecb_enc_tail		# short input
+
+	movdqu	($inp),$inout0		# load 8 input blocks
+	movdqu	0x10($inp),$inout1
+	movdqu	0x20($inp),$inout2
+	movdqu	0x30($inp),$inout3
+	movdqu	0x40($inp),$inout4
+	movdqu	0x50($inp),$inout5
+	movdqu	0x60($inp),$inout6
+	movdqu	0x70($inp),$inout7
+	lea	0x80($inp),$inp		# $inp+=8*16
+	sub	\$0x80,$len		# $len-=8*16 (can be zero)
+	jmp	.Lecb_enc_loop8_enter
+.align 16
+.Lecb_enc_loop8:
+	movups	$inout0,($out)		# store 8 output blocks
+	mov	$key_,$key		# restore $key
+	movdqu	($inp),$inout0		# load 8 input blocks
+	mov	$rnds_,$rounds		# restore $rounds
+	movups	$inout1,0x10($out)
+	movdqu	0x10($inp),$inout1
+	movups	$inout2,0x20($out)
+	movdqu	0x20($inp),$inout2
+	movups	$inout3,0x30($out)
+	movdqu	0x30($inp),$inout3
+	movups	$inout4,0x40($out)
+	movdqu	0x40($inp),$inout4
+	movups	$inout5,0x50($out)
+	movdqu	0x50($inp),$inout5
+	movups	$inout6,0x60($out)
+	movdqu	0x60($inp),$inout6
+	movups	$inout7,0x70($out)
+	lea	0x80($out),$out		# $out+=8*16
+	movdqu	0x70($inp),$inout7
+	lea	0x80($inp),$inp		# $inp+=8*16
+.Lecb_enc_loop8_enter:
+
+	call	_aesni_encrypt8
+
+	sub	\$0x80,$len
+	jnc	.Lecb_enc_loop8		# loop if $len-=8*16 didn't borrow
+
+	movups	$inout0,($out)		# store 8 output blocks
+	mov	$key_,$key		# restore $key
+	movups	$inout1,0x10($out)
+	mov	$rnds_,$rounds		# restore $rounds
+	movups	$inout2,0x20($out)
+	movups	$inout3,0x30($out)
+	movups	$inout4,0x40($out)
+	movups	$inout5,0x50($out)
+	movups	$inout6,0x60($out)
+	movups	$inout7,0x70($out)
+	lea	0x80($out),$out		# $out+=8*16
+	add	\$0x80,$len		# restore real remaining $len
+	jz	.Lecb_ret		# done if ($len==0)
+
+.Lecb_enc_tail:				# $len is less than 8*16
+	movups	($inp),$inout0
+	cmp	\$0x20,$len
+	jb	.Lecb_enc_one
+	movups	0x10($inp),$inout1
+	je	.Lecb_enc_two
+	movups	0x20($inp),$inout2
+	cmp	\$0x40,$len
+	jb	.Lecb_enc_three
+	movups	0x30($inp),$inout3
+	je	.Lecb_enc_four
+	movups	0x40($inp),$inout4
+	cmp	\$0x60,$len
+	jb	.Lecb_enc_five
+	movups	0x50($inp),$inout5
+	je	.Lecb_enc_six
+	movdqu	0x60($inp),$inout6
+	xorps	$inout7,$inout7
+	call	_aesni_encrypt8
+	movups	$inout0,($out)		# store 7 output blocks
+	movups	$inout1,0x10($out)
+	movups	$inout2,0x20($out)
+	movups	$inout3,0x30($out)
+	movups	$inout4,0x40($out)
+	movups	$inout5,0x50($out)
+	movups	$inout6,0x60($out)
+	jmp	.Lecb_ret
+.align	16
+.Lecb_enc_one:
+___
+	&aesni_generate1("enc",$key,$rounds);
+$code.=<<___;
+	movups	$inout0,($out)		# store one output block
+	jmp	.Lecb_ret
+.align	16
+.Lecb_enc_two:
+	call	_aesni_encrypt2
+	movups	$inout0,($out)		# store 2 output blocks
+	movups	$inout1,0x10($out)
+	jmp	.Lecb_ret
+.align	16
+.Lecb_enc_three:
+	call	_aesni_encrypt3
+	movups	$inout0,($out)		# store 3 output blocks
+	movups	$inout1,0x10($out)
+	movups	$inout2,0x20($out)
+	jmp	.Lecb_ret
+.align	16
+.Lecb_enc_four:
+	call	_aesni_encrypt4
+	movups	$inout0,($out)		# store 4 output blocks
+	movups	$inout1,0x10($out)
+	movups	$inout2,0x20($out)
+	movups	$inout3,0x30($out)
+	jmp	.Lecb_ret
+.align	16
+.Lecb_enc_five:
+	xorps	$inout5,$inout5
+	call	_aesni_encrypt6
+	movups	$inout0,($out)		# store 5 output blocks
+	movups	$inout1,0x10($out)
+	movups	$inout2,0x20($out)
+	movups	$inout3,0x30($out)
+	movups	$inout4,0x40($out)
+	jmp	.Lecb_ret
+.align	16
+.Lecb_enc_six:
+	call	_aesni_encrypt6
+	movups	$inout0,($out)		# store 6 output blocks
+	movups	$inout1,0x10($out)
+	movups	$inout2,0x20($out)
+	movups	$inout3,0x30($out)
+	movups	$inout4,0x40($out)
+	movups	$inout5,0x50($out)
+	jmp	.Lecb_ret
+#--------------------------- ECB DECRYPT ------------------------------#
+.align	16
+.Lecb_decrypt:
+	cmp	\$0x80,$len		# if ($len<8*16)
+	jb	.Lecb_dec_tail		# short input
+
+	movdqu	($inp),$inout0		# load 8 input blocks
+	movdqu	0x10($inp),$inout1
+	movdqu	0x20($inp),$inout2
+	movdqu	0x30($inp),$inout3
+	movdqu	0x40($inp),$inout4
+	movdqu	0x50($inp),$inout5
+	movdqu	0x60($inp),$inout6
+	movdqu	0x70($inp),$inout7
+	lea	0x80($inp),$inp		# $inp+=8*16
+	sub	\$0x80,$len		# $len-=8*16 (can be zero)
+	jmp	.Lecb_dec_loop8_enter
+.align 16
+.Lecb_dec_loop8:
+	movups	$inout0,($out)		# store 8 output blocks
+	mov	$key_,$key		# restore $key
+	movdqu	($inp),$inout0		# load 8 input blocks
+	mov	$rnds_,$rounds		# restore $rounds
+	movups	$inout1,0x10($out)
+	movdqu	0x10($inp),$inout1
+	movups	$inout2,0x20($out)
+	movdqu	0x20($inp),$inout2
+	movups	$inout3,0x30($out)
+	movdqu	0x30($inp),$inout3
+	movups	$inout4,0x40($out)
+	movdqu	0x40($inp),$inout4
+	movups	$inout5,0x50($out)
+	movdqu	0x50($inp),$inout5
+	movups	$inout6,0x60($out)
+	movdqu	0x60($inp),$inout6
+	movups	$inout7,0x70($out)
+	lea	0x80($out),$out		# $out+=8*16
+	movdqu	0x70($inp),$inout7
+	lea	0x80($inp),$inp		# $inp+=8*16
+.Lecb_dec_loop8_enter:
+
+	call	_aesni_decrypt8
+
+	$movkey	($key_),$rndkey0
+	sub	\$0x80,$len
+	jnc	.Lecb_dec_loop8		# loop if $len-=8*16 didn't borrow
+
+	movups	$inout0,($out)		# store 8 output blocks
+	 pxor	$inout0,$inout0		# clear register bank
+	mov	$key_,$key		# restore $key
+	movups	$inout1,0x10($out)
+	 pxor	$inout1,$inout1
+	mov	$rnds_,$rounds		# restore $rounds
+	movups	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	movups	$inout3,0x30($out)
+	 pxor	$inout3,$inout3
+	movups	$inout4,0x40($out)
+	 pxor	$inout4,$inout4
+	movups	$inout5,0x50($out)
+	 pxor	$inout5,$inout5
+	movups	$inout6,0x60($out)
+	 pxor	$inout6,$inout6
+	movups	$inout7,0x70($out)
+	 pxor	$inout7,$inout7
+	lea	0x80($out),$out		# $out+=8*16
+	add	\$0x80,$len		# restore real remaining $len
+	jz	.Lecb_ret		# done if ($len==0)
+
+.Lecb_dec_tail:
+	movups	($inp),$inout0
+	cmp	\$0x20,$len
+	jb	.Lecb_dec_one
+	movups	0x10($inp),$inout1
+	je	.Lecb_dec_two
+	movups	0x20($inp),$inout2
+	cmp	\$0x40,$len
+	jb	.Lecb_dec_three
+	movups	0x30($inp),$inout3
+	je	.Lecb_dec_four
+	movups	0x40($inp),$inout4
+	cmp	\$0x60,$len
+	jb	.Lecb_dec_five
+	movups	0x50($inp),$inout5
+	je	.Lecb_dec_six
+	movups	0x60($inp),$inout6
+	$movkey	($key),$rndkey0
+	xorps	$inout7,$inout7
+	call	_aesni_decrypt8
+	movups	$inout0,($out)		# store 7 output blocks
+	 pxor	$inout0,$inout0		# clear register bank
+	movups	$inout1,0x10($out)
+	 pxor	$inout1,$inout1
+	movups	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	movups	$inout3,0x30($out)
+	 pxor	$inout3,$inout3
+	movups	$inout4,0x40($out)
+	 pxor	$inout4,$inout4
+	movups	$inout5,0x50($out)
+	 pxor	$inout5,$inout5
+	movups	$inout6,0x60($out)
+	 pxor	$inout6,$inout6
+	 pxor	$inout7,$inout7
+	jmp	.Lecb_ret
+.align	16
+.Lecb_dec_one:
+___
+	&aesni_generate1("dec",$key,$rounds);
+$code.=<<___;
+	movups	$inout0,($out)		# store one output block
+	 pxor	$inout0,$inout0		# clear register bank
+	jmp	.Lecb_ret
+.align	16
+.Lecb_dec_two:
+	call	_aesni_decrypt2
+	movups	$inout0,($out)		# store 2 output blocks
+	 pxor	$inout0,$inout0		# clear register bank
+	movups	$inout1,0x10($out)
+	 pxor	$inout1,$inout1
+	jmp	.Lecb_ret
+.align	16
+.Lecb_dec_three:
+	call	_aesni_decrypt3
+	movups	$inout0,($out)		# store 3 output blocks
+	 pxor	$inout0,$inout0		# clear register bank
+	movups	$inout1,0x10($out)
+	 pxor	$inout1,$inout1
+	movups	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	jmp	.Lecb_ret
+.align	16
+.Lecb_dec_four:
+	call	_aesni_decrypt4
+	movups	$inout0,($out)		# store 4 output blocks
+	 pxor	$inout0,$inout0		# clear register bank
+	movups	$inout1,0x10($out)
+	 pxor	$inout1,$inout1
+	movups	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	movups	$inout3,0x30($out)
+	 pxor	$inout3,$inout3
+	jmp	.Lecb_ret
+.align	16
+.Lecb_dec_five:
+	xorps	$inout5,$inout5
+	call	_aesni_decrypt6
+	movups	$inout0,($out)		# store 5 output blocks
+	 pxor	$inout0,$inout0		# clear register bank
+	movups	$inout1,0x10($out)
+	 pxor	$inout1,$inout1
+	movups	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	movups	$inout3,0x30($out)
+	 pxor	$inout3,$inout3
+	movups	$inout4,0x40($out)
+	 pxor	$inout4,$inout4
+	 pxor	$inout5,$inout5
+	jmp	.Lecb_ret
+.align	16
+.Lecb_dec_six:
+	call	_aesni_decrypt6
+	movups	$inout0,($out)		# store 6 output blocks
+	 pxor	$inout0,$inout0		# clear register bank
+	movups	$inout1,0x10($out)
+	 pxor	$inout1,$inout1
+	movups	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	movups	$inout3,0x30($out)
+	 pxor	$inout3,$inout3
+	movups	$inout4,0x40($out)
+	 pxor	$inout4,$inout4
+	movups	$inout5,0x50($out)
+	 pxor	$inout5,$inout5
+
+.Lecb_ret:
+	xorps	$rndkey0,$rndkey0	# %xmm0
+	pxor	$rndkey1,$rndkey1
+___
+$code.=<<___ if ($win64);
+	movaps	(%rsp),%xmm6
+	movaps	%xmm0,(%rsp)		# clear stack
+	movaps	0x10(%rsp),%xmm7
+	movaps	%xmm0,0x10(%rsp)
+	movaps	0x20(%rsp),%xmm8
+	movaps	%xmm0,0x20(%rsp)
+	movaps	0x30(%rsp),%xmm9
+	movaps	%xmm0,0x30(%rsp)
+	lea	0x58(%rsp),%rsp
+.Lecb_enc_ret:
+___
+$code.=<<___;
+	ret
+.size	aesni_ecb_encrypt,.-aesni_ecb_encrypt
+___
+}
+{
+######################################################################
+# void aesni_ccm64_[en|de]crypt_blocks (const void *in, void *out,
+#                         size_t blocks, const AES_KEY *key,
+#                         const char *ivec,char *cmac);
+#
+# Handles only complete blocks, operates on 64-bit counter and
+# does not update *ivec! Nor does it finalize CMAC value
+# (see engine/eng_aesni.c for details)
+#
+if (0) {
+my $cmac="%r9";	# 6th argument
+
+my $increment="%xmm9";
+my $iv="%xmm6";
+my $bswap_mask="%xmm7";
+
+$code.=<<___;
+.globl	aesni_ccm64_encrypt_blocks
+.type	aesni_ccm64_encrypt_blocks,\@function,6
+.align	16
+aesni_ccm64_encrypt_blocks:
+___
+$code.=<<___ if ($win64);
+	lea	-0x58(%rsp),%rsp
+	movaps	%xmm6,(%rsp)		# $iv
+	movaps	%xmm7,0x10(%rsp)	# $bswap_mask
+	movaps	%xmm8,0x20(%rsp)	# $in0
+	movaps	%xmm9,0x30(%rsp)	# $increment
+.Lccm64_enc_body:
+___
+$code.=<<___;
+	mov	240($key),$rounds		# key->rounds
+	movdqu	($ivp),$iv
+	movdqa	.Lincrement64(%rip),$increment
+	movdqa	.Lbswap_mask(%rip),$bswap_mask
+
+	shl	\$4,$rounds
+	mov	\$16,$rnds_
+	lea	0($key),$key_
+	movdqu	($cmac),$inout1
+	movdqa	$iv,$inout0
+	lea	32($key,$rounds),$key		# end of key schedule
+	pshufb	$bswap_mask,$iv
+	sub	%rax,%r10			# twisted $rounds
+	jmp	.Lccm64_enc_outer
+.align	16
+.Lccm64_enc_outer:
+	$movkey	($key_),$rndkey0
+	mov	%r10,%rax
+	movups	($inp),$in0			# load inp
+
+	xorps	$rndkey0,$inout0		# counter
+	$movkey	16($key_),$rndkey1
+	xorps	$in0,$rndkey0
+	xorps	$rndkey0,$inout1		# cmac^=inp
+	$movkey	32($key_),$rndkey0
+
+.Lccm64_enc2_loop:
+	aesenc	$rndkey1,$inout0
+	aesenc	$rndkey1,$inout1
+	$movkey	($key,%rax),$rndkey1
+	add	\$32,%rax
+	aesenc	$rndkey0,$inout0
+	aesenc	$rndkey0,$inout1
+	$movkey	-16($key,%rax),$rndkey0
+	jnz	.Lccm64_enc2_loop
+	aesenc	$rndkey1,$inout0
+	aesenc	$rndkey1,$inout1
+	paddq	$increment,$iv
+	dec	$len				# $len-- ($len is in blocks)
+	aesenclast	$rndkey0,$inout0
+	aesenclast	$rndkey0,$inout1
+
+	lea	16($inp),$inp
+	xorps	$inout0,$in0			# inp ^= E(iv)
+	movdqa	$iv,$inout0
+	movups	$in0,($out)			# save output
+	pshufb	$bswap_mask,$inout0
+	lea	16($out),$out			# $out+=16
+	jnz	.Lccm64_enc_outer		# loop if ($len!=0)
+
+	 pxor	$rndkey0,$rndkey0		# clear register bank
+	 pxor	$rndkey1,$rndkey1
+	 pxor	$inout0,$inout0
+	movups	$inout1,($cmac)			# store resulting mac
+	 pxor	$inout1,$inout1
+	 pxor	$in0,$in0
+	 pxor	$iv,$iv
+___
+$code.=<<___ if ($win64);
+	movaps	(%rsp),%xmm6
+	movaps	%xmm0,(%rsp)			# clear stack
+	movaps	0x10(%rsp),%xmm7
+	movaps	%xmm0,0x10(%rsp)
+	movaps	0x20(%rsp),%xmm8
+	movaps	%xmm0,0x20(%rsp)
+	movaps	0x30(%rsp),%xmm9
+	movaps	%xmm0,0x30(%rsp)
+	lea	0x58(%rsp),%rsp
+.Lccm64_enc_ret:
+___
+$code.=<<___;
+	ret
+.size	aesni_ccm64_encrypt_blocks,.-aesni_ccm64_encrypt_blocks
+___
+######################################################################
+$code.=<<___;
+.globl	aesni_ccm64_decrypt_blocks
+.type	aesni_ccm64_decrypt_blocks,\@function,6
+.align	16
+aesni_ccm64_decrypt_blocks:
+___
+$code.=<<___ if ($win64);
+	lea	-0x58(%rsp),%rsp
+	movaps	%xmm6,(%rsp)		# $iv
+	movaps	%xmm7,0x10(%rsp)	# $bswap_mask
+	movaps	%xmm8,0x20(%rsp)	# $in8
+	movaps	%xmm9,0x30(%rsp)	# $increment
+.Lccm64_dec_body:
+___
+$code.=<<___;
+	mov	240($key),$rounds		# key->rounds
+	movups	($ivp),$iv
+	movdqu	($cmac),$inout1
+	movdqa	.Lincrement64(%rip),$increment
+	movdqa	.Lbswap_mask(%rip),$bswap_mask
+
+	movaps	$iv,$inout0
+	mov	$rounds,$rnds_
+	mov	$key,$key_
+	pshufb	$bswap_mask,$iv
+___
+	&aesni_generate1("enc",$key,$rounds);
+$code.=<<___;
+	shl	\$4,$rnds_
+	mov	\$16,$rounds
+	movups	($inp),$in0			# load inp
+	paddq	$increment,$iv
+	lea	16($inp),$inp			# $inp+=16
+	sub	%r10,%rax			# twisted $rounds
+	lea	32($key_,$rnds_),$key		# end of key schedule
+	mov	%rax,%r10
+	jmp	.Lccm64_dec_outer
+.align	16
+.Lccm64_dec_outer:
+	xorps	$inout0,$in0			# inp ^= E(iv)
+	movdqa	$iv,$inout0
+	movups	$in0,($out)			# save output
+	lea	16($out),$out			# $out+=16
+	pshufb	$bswap_mask,$inout0
+
+	sub	\$1,$len			# $len-- ($len is in blocks)
+	jz	.Lccm64_dec_break		# if ($len==0) break
+
+	$movkey	($key_),$rndkey0
+	mov	%r10,%rax
+	$movkey	16($key_),$rndkey1
+	xorps	$rndkey0,$in0
+	xorps	$rndkey0,$inout0
+	xorps	$in0,$inout1			# cmac^=out
+	$movkey	32($key_),$rndkey0
+	jmp	.Lccm64_dec2_loop
+.align	16
+.Lccm64_dec2_loop:
+	aesenc	$rndkey1,$inout0
+	aesenc	$rndkey1,$inout1
+	$movkey	($key,%rax),$rndkey1
+	add	\$32,%rax
+	aesenc	$rndkey0,$inout0
+	aesenc	$rndkey0,$inout1
+	$movkey	-16($key,%rax),$rndkey0
+	jnz	.Lccm64_dec2_loop
+	movups	($inp),$in0			# load input
+	paddq	$increment,$iv
+	aesenc	$rndkey1,$inout0
+	aesenc	$rndkey1,$inout1
+	aesenclast	$rndkey0,$inout0
+	aesenclast	$rndkey0,$inout1
+	lea	16($inp),$inp			# $inp+=16
+	jmp	.Lccm64_dec_outer
+
+.align	16
+.Lccm64_dec_break:
+	#xorps	$in0,$inout1			# cmac^=out
+	mov	240($key_),$rounds
+___
+	&aesni_generate1("enc",$key_,$rounds,$inout1,$in0);
+$code.=<<___;
+	 pxor	$rndkey0,$rndkey0		# clear register bank
+	 pxor	$rndkey1,$rndkey1
+	 pxor	$inout0,$inout0
+	movups	$inout1,($cmac)			# store resulting mac
+	 pxor	$inout1,$inout1
+	 pxor	$in0,$in0
+	 pxor	$iv,$iv
+___
+$code.=<<___ if ($win64);
+	movaps	(%rsp),%xmm6
+	movaps	%xmm0,(%rsp)			# clear stack
+	movaps	0x10(%rsp),%xmm7
+	movaps	%xmm0,0x10(%rsp)
+	movaps	0x20(%rsp),%xmm8
+	movaps	%xmm0,0x20(%rsp)
+	movaps	0x30(%rsp),%xmm9
+	movaps	%xmm0,0x30(%rsp)
+	lea	0x58(%rsp),%rsp
+.Lccm64_dec_ret:
+___
+$code.=<<___;
+	ret
+.size	aesni_ccm64_decrypt_blocks,.-aesni_ccm64_decrypt_blocks
+___
+}
+######################################################################
+# void aesni_ctr32_encrypt_blocks (const void *in, void *out,
+#                         size_t blocks, const AES_KEY *key,
+#                         const char *ivec);
+#
+# Handles only complete blocks, operates on 32-bit counter and
+# does not update *ivec! (see crypto/modes/ctr128.c for details)
+#
+# Overhaul based on suggestions from Shay Gueron and Vlad Krasnov,
+# http://rt.openssl.org/Ticket/Display.html?id=3021&user=guest&pass=guest.
+# Keywords are full unroll and modulo-schedule counter calculations
+# with zero-round key xor.
+{
+my ($in0,$in1,$in2,$in3,$in4,$in5)=map("%xmm$_",(10..15));
+my ($key0,$ctr)=("%ebp","${ivp}d");
+my $frame_size = 0x80 + ($win64?160:0);
+
+$code.=<<___;
+.globl	aesni_ctr32_encrypt_blocks
+.type	aesni_ctr32_encrypt_blocks,\@function,5
+.align	16
+aesni_ctr32_encrypt_blocks:
+.cfi_startproc
+	cmp	\$1,$len
+	jne	.Lctr32_bulk
+
+	# handle single block without allocating stack frame,
+	# useful when handling edges
+	movups	($ivp),$inout0
+	movups	($inp),$inout1
+	mov	240($key),%edx			# key->rounds
+___
+	&aesni_generate1("enc",$key,"%edx");
+$code.=<<___;
+	 pxor	$rndkey0,$rndkey0		# clear register bank
+	 pxor	$rndkey1,$rndkey1
+	xorps	$inout1,$inout0
+	 pxor	$inout1,$inout1
+	movups	$inout0,($out)
+	 xorps	$inout0,$inout0
+	jmp	.Lctr32_epilogue
+
+.align	16
+.Lctr32_bulk:
+	lea	(%rsp),$key_			# use $key_ as frame pointer
+.cfi_def_cfa_register	$key_
+	push	%rbp
+.cfi_push	%rbp
+	sub	\$$frame_size,%rsp
+	and	\$-16,%rsp	# Linux kernel stack can be incorrectly seeded
+___
+$code.=<<___ if ($win64);
+	movaps	%xmm6,-0xa8($key_)		# offload everything
+	movaps	%xmm7,-0x98($key_)
+	movaps	%xmm8,-0x88($key_)
+	movaps	%xmm9,-0x78($key_)
+	movaps	%xmm10,-0x68($key_)
+	movaps	%xmm11,-0x58($key_)
+	movaps	%xmm12,-0x48($key_)
+	movaps	%xmm13,-0x38($key_)
+	movaps	%xmm14,-0x28($key_)
+	movaps	%xmm15,-0x18($key_)
+.Lctr32_body:
+___
+$code.=<<___;
+
+	# 8 16-byte words on top of stack are counter values
+	# xor-ed with zero-round key
+
+	movdqu	($ivp),$inout0
+	movdqu	($key),$rndkey0
+	mov	12($ivp),$ctr			# counter LSB
+	pxor	$rndkey0,$inout0
+	mov	12($key),$key0			# 0-round key LSB
+	movdqa	$inout0,0x00(%rsp)		# populate counter block
+	bswap	$ctr
+	movdqa	$inout0,$inout1
+	movdqa	$inout0,$inout2
+	movdqa	$inout0,$inout3
+	movdqa	$inout0,0x40(%rsp)
+	movdqa	$inout0,0x50(%rsp)
+	movdqa	$inout0,0x60(%rsp)
+	mov	%rdx,%r10			# about to borrow %rdx
+	movdqa	$inout0,0x70(%rsp)
+
+	lea	1($ctr),%rax
+	 lea	2($ctr),%rdx
+	bswap	%eax
+	 bswap	%edx
+	xor	$key0,%eax
+	 xor	$key0,%edx
+	pinsrd	\$3,%eax,$inout1
+	lea	3($ctr),%rax
+	movdqa	$inout1,0x10(%rsp)
+	 pinsrd	\$3,%edx,$inout2
+	bswap	%eax
+	 mov	%r10,%rdx			# restore %rdx
+	 lea	4($ctr),%r10
+	 movdqa	$inout2,0x20(%rsp)
+	xor	$key0,%eax
+	 bswap	%r10d
+	pinsrd	\$3,%eax,$inout3
+	 xor	$key0,%r10d
+	movdqa	$inout3,0x30(%rsp)
+	lea	5($ctr),%r9
+	 mov	%r10d,0x40+12(%rsp)
+	bswap	%r9d
+	 lea	6($ctr),%r10
+	mov	240($key),$rounds		# key->rounds
+	xor	$key0,%r9d
+	 bswap	%r10d
+	mov	%r9d,0x50+12(%rsp)
+	 xor	$key0,%r10d
+	lea	7($ctr),%r9
+	 mov	%r10d,0x60+12(%rsp)
+	bswap	%r9d
+#	leaq	OPENSSL_ia32cap_P(%rip),%r10
+#	 mov	4(%r10),%r10d
+	xor	$key0,%r9d
+#	 and	\$`1<<26|1<<22`,%r10d		# isolate XSAVE+MOVBE
+	mov	%r9d,0x70+12(%rsp)
+
+	$movkey	0x10($key),$rndkey1
+
+	movdqa	0x40(%rsp),$inout4
+	movdqa	0x50(%rsp),$inout5
+
+	cmp	\$8,$len		# $len is in blocks
+	jb	.Lctr32_tail		# short input if ($len<8)
+
+	sub	\$6,$len		# $len is biased by -6
+#	cmp	\$`1<<22`,%r10d		# check for MOVBE without XSAVE
+#	je	.Lctr32_6x		# [which denotes Atom Silvermont]
+
+	lea	0x80($key),$key		# size optimization
+	sub	\$2,$len		# $len is biased by -8
+	jmp	.Lctr32_loop8
+
+#.align	16
+#.Lctr32_6x:
+#	shl	\$4,$rounds
+#	mov	\$48,$rnds_
+#	bswap	$key0
+#	lea	32($key,$rounds),$key	# end of key schedule
+#	sub	%rax,%r10		# twisted $rounds
+#	jmp	.Lctr32_loop6
+
+.align	16
+.Lctr32_loop6:
+	 add	\$6,$ctr		# next counter value
+	$movkey	-48($key,$rnds_),$rndkey0
+	aesenc	$rndkey1,$inout0
+	 mov	$ctr,%eax
+	 xor	$key0,%eax
+	aesenc	$rndkey1,$inout1
+	 movbe	%eax,`0x00+12`(%rsp)	# store next counter value
+	 lea	1($ctr),%eax
+	aesenc	$rndkey1,$inout2
+	 xor	$key0,%eax
+	 movbe	%eax,`0x10+12`(%rsp)
+	aesenc	$rndkey1,$inout3
+	 lea	2($ctr),%eax
+	 xor	$key0,%eax
+	aesenc	$rndkey1,$inout4
+	 movbe	%eax,`0x20+12`(%rsp)
+	 lea	3($ctr),%eax
+	aesenc	$rndkey1,$inout5
+	$movkey	-32($key,$rnds_),$rndkey1
+	 xor	$key0,%eax
+
+	aesenc	$rndkey0,$inout0
+	 movbe	%eax,`0x30+12`(%rsp)
+	 lea	4($ctr),%eax
+	aesenc	$rndkey0,$inout1
+	 xor	$key0,%eax
+	 movbe	%eax,`0x40+12`(%rsp)
+	aesenc	$rndkey0,$inout2
+	 lea	5($ctr),%eax
+	 xor	$key0,%eax
+	aesenc	$rndkey0,$inout3
+	 movbe	%eax,`0x50+12`(%rsp)
+	 mov	%r10,%rax		# mov	$rnds_,$rounds
+	aesenc	$rndkey0,$inout4
+	aesenc	$rndkey0,$inout5
+	$movkey	-16($key,$rnds_),$rndkey0
+
+	call	.Lenc_loop6
+
+	movdqu	($inp),$inout6		# load 6 input blocks
+	movdqu	0x10($inp),$inout7
+	movdqu	0x20($inp),$in0
+	movdqu	0x30($inp),$in1
+	movdqu	0x40($inp),$in2
+	movdqu	0x50($inp),$in3
+	lea	0x60($inp),$inp		# $inp+=6*16
+	$movkey	-64($key,$rnds_),$rndkey1
+	pxor	$inout0,$inout6		# inp^=E(ctr)
+	movaps	0x00(%rsp),$inout0	# load next counter [xor-ed with 0 round]
+	pxor	$inout1,$inout7
+	movaps	0x10(%rsp),$inout1
+	pxor	$inout2,$in0
+	movaps	0x20(%rsp),$inout2
+	pxor	$inout3,$in1
+	movaps	0x30(%rsp),$inout3
+	pxor	$inout4,$in2
+	movaps	0x40(%rsp),$inout4
+	pxor	$inout5,$in3
+	movaps	0x50(%rsp),$inout5
+	movdqu	$inout6,($out)		# store 6 output blocks
+	movdqu	$inout7,0x10($out)
+	movdqu	$in0,0x20($out)
+	movdqu	$in1,0x30($out)
+	movdqu	$in2,0x40($out)
+	movdqu	$in3,0x50($out)
+	lea	0x60($out),$out		# $out+=6*16
+
+	sub	\$6,$len
+	jnc	.Lctr32_loop6		# loop if $len-=6 didn't borrow
+
+	add	\$6,$len		# restore real remaining $len
+	jz	.Lctr32_done		# done if ($len==0)
+
+	lea	-48($rnds_),$rounds
+	lea	-80($key,$rnds_),$key	# restore $key
+	neg	$rounds
+	shr	\$4,$rounds		# restore $rounds
+	jmp	.Lctr32_tail
+
+.align	32
+.Lctr32_loop8:
+	 add		\$8,$ctr		# next counter value
+	movdqa		0x60(%rsp),$inout6
+	aesenc		$rndkey1,$inout0
+	 mov		$ctr,%r9d
+	movdqa		0x70(%rsp),$inout7
+	aesenc		$rndkey1,$inout1
+	 bswap		%r9d
+	$movkey		0x20-0x80($key),$rndkey0
+	aesenc		$rndkey1,$inout2
+	 xor		$key0,%r9d
+	 nop
+	aesenc		$rndkey1,$inout3
+	 mov		%r9d,0x00+12(%rsp)	# store next counter value
+	 lea		1($ctr),%r9
+	aesenc		$rndkey1,$inout4
+	aesenc		$rndkey1,$inout5
+	aesenc		$rndkey1,$inout6
+	aesenc		$rndkey1,$inout7
+	$movkey		0x30-0x80($key),$rndkey1
+___
+for($i=2;$i<8;$i++) {
+my $rndkeyx = ($i&1)?$rndkey1:$rndkey0;
+$code.=<<___;
+	 bswap		%r9d
+	aesenc		$rndkeyx,$inout0
+	aesenc		$rndkeyx,$inout1
+	 xor		$key0,%r9d
+	 .byte		0x66,0x90
+	aesenc		$rndkeyx,$inout2
+	aesenc		$rndkeyx,$inout3
+	 mov		%r9d,`0x10*($i-1)`+12(%rsp)
+	 lea		$i($ctr),%r9
+	aesenc		$rndkeyx,$inout4
+	aesenc		$rndkeyx,$inout5
+	aesenc		$rndkeyx,$inout6
+	aesenc		$rndkeyx,$inout7
+	$movkey		`0x20+0x10*$i`-0x80($key),$rndkeyx
+___
+}
+$code.=<<___;
+	 bswap		%r9d
+	aesenc		$rndkey0,$inout0
+	aesenc		$rndkey0,$inout1
+	aesenc		$rndkey0,$inout2
+	 xor		$key0,%r9d
+	 movdqu		0x00($inp),$in0		# start loading input
+	aesenc		$rndkey0,$inout3
+	 mov		%r9d,0x70+12(%rsp)
+	 cmp		\$11,$rounds
+	aesenc		$rndkey0,$inout4
+	aesenc		$rndkey0,$inout5
+	aesenc		$rndkey0,$inout6
+	aesenc		$rndkey0,$inout7
+	$movkey		0xa0-0x80($key),$rndkey0
+
+	jb		.Lctr32_enc_done
+
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	aesenc		$rndkey1,$inout4
+	aesenc		$rndkey1,$inout5
+	aesenc		$rndkey1,$inout6
+	aesenc		$rndkey1,$inout7
+	$movkey		0xb0-0x80($key),$rndkey1
+
+	aesenc		$rndkey0,$inout0
+	aesenc		$rndkey0,$inout1
+	aesenc		$rndkey0,$inout2
+	aesenc		$rndkey0,$inout3
+	aesenc		$rndkey0,$inout4
+	aesenc		$rndkey0,$inout5
+	aesenc		$rndkey0,$inout6
+	aesenc		$rndkey0,$inout7
+	$movkey		0xc0-0x80($key),$rndkey0
+	je		.Lctr32_enc_done
+
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	aesenc		$rndkey1,$inout4
+	aesenc		$rndkey1,$inout5
+	aesenc		$rndkey1,$inout6
+	aesenc		$rndkey1,$inout7
+	$movkey		0xd0-0x80($key),$rndkey1
+
+	aesenc		$rndkey0,$inout0
+	aesenc		$rndkey0,$inout1
+	aesenc		$rndkey0,$inout2
+	aesenc		$rndkey0,$inout3
+	aesenc		$rndkey0,$inout4
+	aesenc		$rndkey0,$inout5
+	aesenc		$rndkey0,$inout6
+	aesenc		$rndkey0,$inout7
+	$movkey		0xe0-0x80($key),$rndkey0
+	jmp		.Lctr32_enc_done
+
+.align	16
+.Lctr32_enc_done:
+	movdqu		0x10($inp),$in1
+	pxor		$rndkey0,$in0		# input^=round[last]
+	movdqu		0x20($inp),$in2
+	pxor		$rndkey0,$in1
+	movdqu		0x30($inp),$in3
+	pxor		$rndkey0,$in2
+	movdqu		0x40($inp),$in4
+	pxor		$rndkey0,$in3
+	movdqu		0x50($inp),$in5
+	pxor		$rndkey0,$in4
+	pxor		$rndkey0,$in5
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	aesenc		$rndkey1,$inout4
+	aesenc		$rndkey1,$inout5
+	aesenc		$rndkey1,$inout6
+	aesenc		$rndkey1,$inout7
+	movdqu		0x60($inp),$rndkey1	# borrow $rndkey1 for inp[6]
+	lea		0x80($inp),$inp		# $inp+=8*16
+
+	aesenclast	$in0,$inout0		# $inN is inp[N]^round[last]
+	pxor		$rndkey0,$rndkey1	# borrowed $rndkey
+	movdqu		0x70-0x80($inp),$in0
+	aesenclast	$in1,$inout1
+	pxor		$rndkey0,$in0
+	movdqa		0x00(%rsp),$in1		# load next counter block
+	aesenclast	$in2,$inout2
+	aesenclast	$in3,$inout3
+	movdqa		0x10(%rsp),$in2
+	movdqa		0x20(%rsp),$in3
+	aesenclast	$in4,$inout4
+	aesenclast	$in5,$inout5
+	movdqa		0x30(%rsp),$in4
+	movdqa		0x40(%rsp),$in5
+	aesenclast	$rndkey1,$inout6
+	movdqa		0x50(%rsp),$rndkey0
+	$movkey		0x10-0x80($key),$rndkey1#real 1st-round key
+	aesenclast	$in0,$inout7
+
+	movups		$inout0,($out)		# store 8 output blocks
+	movdqa		$in1,$inout0
+	movups		$inout1,0x10($out)
+	movdqa		$in2,$inout1
+	movups		$inout2,0x20($out)
+	movdqa		$in3,$inout2
+	movups		$inout3,0x30($out)
+	movdqa		$in4,$inout3
+	movups		$inout4,0x40($out)
+	movdqa		$in5,$inout4
+	movups		$inout5,0x50($out)
+	movdqa		$rndkey0,$inout5
+	movups		$inout6,0x60($out)
+	movups		$inout7,0x70($out)
+	lea		0x80($out),$out		# $out+=8*16
+
+	sub	\$8,$len
+	jnc	.Lctr32_loop8			# loop if $len-=8 didn't borrow
+
+	add	\$8,$len			# restore real remaining $len
+	jz	.Lctr32_done			# done if ($len==0)
+	lea	-0x80($key),$key
+
+.Lctr32_tail:
+	# note that at this point $inout0..5 are populated with
+	# counter values xor-ed with 0-round key
+	lea	16($key),$key
+	cmp	\$4,$len
+	jb	.Lctr32_loop3
+	je	.Lctr32_loop4
+
+	# if ($len>4) compute 7 E(counter)
+	shl		\$4,$rounds
+	movdqa		0x60(%rsp),$inout6
+	pxor		$inout7,$inout7
+
+	$movkey		16($key),$rndkey0
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	lea		32-16($key,$rounds),$key# prepare for .Lenc_loop8_enter
+	neg		%rax
+	aesenc		$rndkey1,$inout2
+	add		\$16,%rax		# prepare for .Lenc_loop8_enter
+	 movups		($inp),$in0
+	aesenc		$rndkey1,$inout3
+	aesenc		$rndkey1,$inout4
+	 movups		0x10($inp),$in1		# pre-load input
+	 movups		0x20($inp),$in2
+	aesenc		$rndkey1,$inout5
+	aesenc		$rndkey1,$inout6
+
+	call            .Lenc_loop8_enter
+
+	movdqu	0x30($inp),$in3
+	pxor	$in0,$inout0
+	movdqu	0x40($inp),$in0
+	pxor	$in1,$inout1
+	movdqu	$inout0,($out)			# store output
+	pxor	$in2,$inout2
+	movdqu	$inout1,0x10($out)
+	pxor	$in3,$inout3
+	movdqu	$inout2,0x20($out)
+	pxor	$in0,$inout4
+	movdqu	$inout3,0x30($out)
+	movdqu	$inout4,0x40($out)
+	cmp	\$6,$len
+	jb	.Lctr32_done			# $len was 5, stop store
+
+	movups	0x50($inp),$in1
+	xorps	$in1,$inout5
+	movups	$inout5,0x50($out)
+	je	.Lctr32_done			# $len was 6, stop store
+
+	movups	0x60($inp),$in2
+	xorps	$in2,$inout6
+	movups	$inout6,0x60($out)
+	jmp	.Lctr32_done			# $len was 7, stop store
+
+.align	32
+.Lctr32_loop4:
+	aesenc		$rndkey1,$inout0
+	lea		16($key),$key
+	dec		$rounds
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	$movkey		($key),$rndkey1
+	jnz		.Lctr32_loop4
+	aesenclast	$rndkey1,$inout0
+	aesenclast	$rndkey1,$inout1
+	 movups		($inp),$in0		# load input
+	 movups		0x10($inp),$in1
+	aesenclast	$rndkey1,$inout2
+	aesenclast	$rndkey1,$inout3
+	 movups		0x20($inp),$in2
+	 movups		0x30($inp),$in3
+
+	xorps	$in0,$inout0
+	movups	$inout0,($out)			# store output
+	xorps	$in1,$inout1
+	movups	$inout1,0x10($out)
+	pxor	$in2,$inout2
+	movdqu	$inout2,0x20($out)
+	pxor	$in3,$inout3
+	movdqu	$inout3,0x30($out)
+	jmp	.Lctr32_done			# $len was 4, stop store
+
+.align	32
+.Lctr32_loop3:
+	aesenc		$rndkey1,$inout0
+	lea		16($key),$key
+	dec		$rounds
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	$movkey		($key),$rndkey1
+	jnz		.Lctr32_loop3
+	aesenclast	$rndkey1,$inout0
+	aesenclast	$rndkey1,$inout1
+	aesenclast	$rndkey1,$inout2
+
+	movups	($inp),$in0			# load input
+	xorps	$in0,$inout0
+	movups	$inout0,($out)			# store output
+	cmp	\$2,$len
+	jb	.Lctr32_done			# $len was 1, stop store
+
+	movups	0x10($inp),$in1
+	xorps	$in1,$inout1
+	movups	$inout1,0x10($out)
+	je	.Lctr32_done			# $len was 2, stop store
+
+	movups	0x20($inp),$in2
+	xorps	$in2,$inout2
+	movups	$inout2,0x20($out)		# $len was 3, stop store
+
+.Lctr32_done:
+	xorps	%xmm0,%xmm0			# clear register bank
+	xor	$key0,$key0
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+___
+$code.=<<___ if (!$win64);
+	pxor	%xmm6,%xmm6
+	pxor	%xmm7,%xmm7
+	movaps	%xmm0,0x00(%rsp)		# clear stack
+	pxor	%xmm8,%xmm8
+	movaps	%xmm0,0x10(%rsp)
+	pxor	%xmm9,%xmm9
+	movaps	%xmm0,0x20(%rsp)
+	pxor	%xmm10,%xmm10
+	movaps	%xmm0,0x30(%rsp)
+	pxor	%xmm11,%xmm11
+	movaps	%xmm0,0x40(%rsp)
+	pxor	%xmm12,%xmm12
+	movaps	%xmm0,0x50(%rsp)
+	pxor	%xmm13,%xmm13
+	movaps	%xmm0,0x60(%rsp)
+	pxor	%xmm14,%xmm14
+	movaps	%xmm0,0x70(%rsp)
+	pxor	%xmm15,%xmm15
+___
+$code.=<<___ if ($win64);
+	movaps	-0xa8($key_),%xmm6
+	movaps	%xmm0,-0xa8($key_)		# clear stack
+	movaps	-0x98($key_),%xmm7
+	movaps	%xmm0,-0x98($key_)
+	movaps	-0x88($key_),%xmm8
+	movaps	%xmm0,-0x88($key_)
+	movaps	-0x78($key_),%xmm9
+	movaps	%xmm0,-0x78($key_)
+	movaps	-0x68($key_),%xmm10
+	movaps	%xmm0,-0x68($key_)
+	movaps	-0x58($key_),%xmm11
+	movaps	%xmm0,-0x58($key_)
+	movaps	-0x48($key_),%xmm12
+	movaps	%xmm0,-0x48($key_)
+	movaps	-0x38($key_),%xmm13
+	movaps	%xmm0,-0x38($key_)
+	movaps	-0x28($key_),%xmm14
+	movaps	%xmm0,-0x28($key_)
+	movaps	-0x18($key_),%xmm15
+	movaps	%xmm0,-0x18($key_)
+	movaps	%xmm0,0x00(%rsp)
+	movaps	%xmm0,0x10(%rsp)
+	movaps	%xmm0,0x20(%rsp)
+	movaps	%xmm0,0x30(%rsp)
+	movaps	%xmm0,0x40(%rsp)
+	movaps	%xmm0,0x50(%rsp)
+	movaps	%xmm0,0x60(%rsp)
+	movaps	%xmm0,0x70(%rsp)
+___
+$code.=<<___;
+	mov	-8($key_),%rbp
+.cfi_restore	%rbp
+	lea	($key_),%rsp
+.cfi_def_cfa_register	%rsp
+.Lctr32_epilogue:
+	ret
+.cfi_endproc
+.size	aesni_ctr32_encrypt_blocks,.-aesni_ctr32_encrypt_blocks
+___
+}
+
+######################################################################
+# void aesni_xts_[en|de]crypt(const char *inp,char *out,size_t len,
+#	const AES_KEY *key1, const AES_KEY *key2
+#	const unsigned char iv[16]);
+#
+if (0) {
+my @tweak=map("%xmm$_",(10..15));
+my ($twmask,$twres,$twtmp)=("%xmm8","%xmm9",@tweak[4]);
+my ($key2,$ivp,$len_)=("%r8","%r9","%r9");
+my $frame_size = 0x70 + ($win64?160:0);
+my $key_ = "%rbp";	# override so that we can use %r11 as FP
+
+$code.=<<___;
+.globl	aesni_xts_encrypt
+.type	aesni_xts_encrypt,\@function,6
+.align	16
+aesni_xts_encrypt:
+.cfi_startproc
+	lea	(%rsp),%r11			# frame pointer
+.cfi_def_cfa_register	%r11
+	push	%rbp
+.cfi_push	%rbp
+	sub	\$$frame_size,%rsp
+	and	\$-16,%rsp	# Linux kernel stack can be incorrectly seeded
+___
+$code.=<<___ if ($win64);
+	movaps	%xmm6,-0xa8(%r11)		# offload everything
+	movaps	%xmm7,-0x98(%r11)
+	movaps	%xmm8,-0x88(%r11)
+	movaps	%xmm9,-0x78(%r11)
+	movaps	%xmm10,-0x68(%r11)
+	movaps	%xmm11,-0x58(%r11)
+	movaps	%xmm12,-0x48(%r11)
+	movaps	%xmm13,-0x38(%r11)
+	movaps	%xmm14,-0x28(%r11)
+	movaps	%xmm15,-0x18(%r11)
+.Lxts_enc_body:
+___
+$code.=<<___;
+	movups	($ivp),$inout0			# load clear-text tweak
+	mov	240(%r8),$rounds		# key2->rounds
+	mov	240($key),$rnds_		# key1->rounds
+___
+	# generate the tweak
+	&aesni_generate1("enc",$key2,$rounds,$inout0);
+$code.=<<___;
+	$movkey	($key),$rndkey0			# zero round key
+	mov	$key,$key_			# backup $key
+	mov	$rnds_,$rounds			# backup $rounds
+	shl	\$4,$rnds_
+	mov	$len,$len_			# backup $len
+	and	\$-16,$len
+
+	$movkey	16($key,$rnds_),$rndkey1	# last round key
+
+	movdqa	.Lxts_magic(%rip),$twmask
+	movdqa	$inout0,@tweak[5]
+	pshufd	\$0x5f,$inout0,$twres
+	pxor	$rndkey0,$rndkey1
+___
+    # alternative tweak calculation algorithm is based on suggestions
+    # by Shay Gueron. psrad doesn't conflict with AES-NI instructions
+    # and should help in the future...
+    for ($i=0;$i<4;$i++) {
+    $code.=<<___;
+	movdqa	$twres,$twtmp
+	paddd	$twres,$twres
+	movdqa	@tweak[5],@tweak[$i]
+	psrad	\$31,$twtmp			# broadcast upper bits
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$twtmp
+	pxor	$rndkey0,@tweak[$i]
+	pxor	$twtmp,@tweak[5]
+___
+    }
+$code.=<<___;
+	movdqa	@tweak[5],@tweak[4]
+	psrad	\$31,$twres
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$twres
+	pxor	$rndkey0,@tweak[4]
+	pxor	$twres,@tweak[5]
+	movaps	$rndkey1,0x60(%rsp)		# save round[0]^round[last]
+
+	sub	\$16*6,$len
+	jc	.Lxts_enc_short			# if $len-=6*16 borrowed
+
+	mov	\$16+96,$rounds
+	lea	32($key_,$rnds_),$key		# end of key schedule
+	sub	%r10,%rax			# twisted $rounds
+	$movkey	16($key_),$rndkey1
+	mov	%rax,%r10			# backup twisted $rounds
+	lea	.Lxts_magic(%rip),%r8
+	jmp	.Lxts_enc_grandloop
+
+.align	32
+.Lxts_enc_grandloop:
+	movdqu	`16*0`($inp),$inout0		# load input
+	movdqa	$rndkey0,$twmask
+	movdqu	`16*1`($inp),$inout1
+	pxor	@tweak[0],$inout0		# input^=tweak^round[0]
+	movdqu	`16*2`($inp),$inout2
+	pxor	@tweak[1],$inout1
+	 aesenc		$rndkey1,$inout0
+	movdqu	`16*3`($inp),$inout3
+	pxor	@tweak[2],$inout2
+	 aesenc		$rndkey1,$inout1
+	movdqu	`16*4`($inp),$inout4
+	pxor	@tweak[3],$inout3
+	 aesenc		$rndkey1,$inout2
+	movdqu	`16*5`($inp),$inout5
+	pxor	@tweak[5],$twmask		# round[0]^=tweak[5]
+	 movdqa	0x60(%rsp),$twres		# load round[0]^round[last]
+	pxor	@tweak[4],$inout4
+	 aesenc		$rndkey1,$inout3
+	$movkey	32($key_),$rndkey0
+	lea	`16*6`($inp),$inp
+	pxor	$twmask,$inout5
+
+	 pxor	$twres,@tweak[0]		# calculate tweaks^round[last]
+	aesenc		$rndkey1,$inout4
+	 pxor	$twres,@tweak[1]
+	 movdqa	@tweak[0],`16*0`(%rsp)		# put aside tweaks^round[last]
+	aesenc		$rndkey1,$inout5
+	$movkey		48($key_),$rndkey1
+	 pxor	$twres,@tweak[2]
+
+	aesenc		$rndkey0,$inout0
+	 pxor	$twres,@tweak[3]
+	 movdqa	@tweak[1],`16*1`(%rsp)
+	aesenc		$rndkey0,$inout1
+	 pxor	$twres,@tweak[4]
+	 movdqa	@tweak[2],`16*2`(%rsp)
+	aesenc		$rndkey0,$inout2
+	aesenc		$rndkey0,$inout3
+	 pxor	$twres,$twmask
+	 movdqa	@tweak[4],`16*4`(%rsp)
+	aesenc		$rndkey0,$inout4
+	aesenc		$rndkey0,$inout5
+	$movkey		64($key_),$rndkey0
+	 movdqa	$twmask,`16*5`(%rsp)
+	pshufd	\$0x5f,@tweak[5],$twres
+	jmp	.Lxts_enc_loop6
+.align	32
+.Lxts_enc_loop6:
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	aesenc		$rndkey1,$inout4
+	aesenc		$rndkey1,$inout5
+	$movkey		-64($key,%rax),$rndkey1
+	add		\$32,%rax
+
+	aesenc		$rndkey0,$inout0
+	aesenc		$rndkey0,$inout1
+	aesenc		$rndkey0,$inout2
+	aesenc		$rndkey0,$inout3
+	aesenc		$rndkey0,$inout4
+	aesenc		$rndkey0,$inout5
+	$movkey		-80($key,%rax),$rndkey0
+	jnz		.Lxts_enc_loop6
+
+	movdqa	(%r8),$twmask			# start calculating next tweak
+	movdqa	$twres,$twtmp
+	paddd	$twres,$twres
+	 aesenc		$rndkey1,$inout0
+	paddq	@tweak[5],@tweak[5]
+	psrad	\$31,$twtmp
+	 aesenc		$rndkey1,$inout1
+	pand	$twmask,$twtmp
+	$movkey	($key_),@tweak[0]		# load round[0]
+	 aesenc		$rndkey1,$inout2
+	 aesenc		$rndkey1,$inout3
+	 aesenc		$rndkey1,$inout4
+	pxor	$twtmp,@tweak[5]
+	movaps	@tweak[0],@tweak[1]		# copy round[0]
+	 aesenc		$rndkey1,$inout5
+	 $movkey	-64($key),$rndkey1
+
+	movdqa	$twres,$twtmp
+	 aesenc		$rndkey0,$inout0
+	paddd	$twres,$twres
+	pxor	@tweak[5],@tweak[0]
+	 aesenc		$rndkey0,$inout1
+	psrad	\$31,$twtmp
+	paddq	@tweak[5],@tweak[5]
+	 aesenc		$rndkey0,$inout2
+	 aesenc		$rndkey0,$inout3
+	pand	$twmask,$twtmp
+	movaps	@tweak[1],@tweak[2]
+	 aesenc		$rndkey0,$inout4
+	pxor	$twtmp,@tweak[5]
+	movdqa	$twres,$twtmp
+	 aesenc		$rndkey0,$inout5
+	 $movkey	-48($key),$rndkey0
+
+	paddd	$twres,$twres
+	 aesenc		$rndkey1,$inout0
+	pxor	@tweak[5],@tweak[1]
+	psrad	\$31,$twtmp
+	 aesenc		$rndkey1,$inout1
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$twtmp
+	 aesenc		$rndkey1,$inout2
+	 aesenc		$rndkey1,$inout3
+	 movdqa	@tweak[3],`16*3`(%rsp)
+	pxor	$twtmp,@tweak[5]
+	 aesenc		$rndkey1,$inout4
+	movaps	@tweak[2],@tweak[3]
+	movdqa	$twres,$twtmp
+	 aesenc		$rndkey1,$inout5
+	 $movkey	-32($key),$rndkey1
+
+	paddd	$twres,$twres
+	 aesenc		$rndkey0,$inout0
+	pxor	@tweak[5],@tweak[2]
+	psrad	\$31,$twtmp
+	 aesenc		$rndkey0,$inout1
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$twtmp
+	 aesenc		$rndkey0,$inout2
+	 aesenc		$rndkey0,$inout3
+	 aesenc		$rndkey0,$inout4
+	pxor	$twtmp,@tweak[5]
+	movaps	@tweak[3],@tweak[4]
+	 aesenc		$rndkey0,$inout5
+
+	movdqa	$twres,$rndkey0
+	paddd	$twres,$twres
+	 aesenc		$rndkey1,$inout0
+	pxor	@tweak[5],@tweak[3]
+	psrad	\$31,$rndkey0
+	 aesenc		$rndkey1,$inout1
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$rndkey0
+	 aesenc		$rndkey1,$inout2
+	 aesenc		$rndkey1,$inout3
+	pxor	$rndkey0,@tweak[5]
+	$movkey		($key_),$rndkey0
+	 aesenc		$rndkey1,$inout4
+	 aesenc		$rndkey1,$inout5
+	$movkey		16($key_),$rndkey1
+
+	pxor	@tweak[5],@tweak[4]
+	 aesenclast	`16*0`(%rsp),$inout0
+	psrad	\$31,$twres
+	paddq	@tweak[5],@tweak[5]
+	 aesenclast	`16*1`(%rsp),$inout1
+	 aesenclast	`16*2`(%rsp),$inout2
+	pand	$twmask,$twres
+	mov	%r10,%rax			# restore $rounds
+	 aesenclast	`16*3`(%rsp),$inout3
+	 aesenclast	`16*4`(%rsp),$inout4
+	 aesenclast	`16*5`(%rsp),$inout5
+	pxor	$twres,@tweak[5]
+
+	lea	`16*6`($out),$out		# $out+=6*16
+	movups	$inout0,`-16*6`($out)		# store 6 output blocks
+	movups	$inout1,`-16*5`($out)
+	movups	$inout2,`-16*4`($out)
+	movups	$inout3,`-16*3`($out)
+	movups	$inout4,`-16*2`($out)
+	movups	$inout5,`-16*1`($out)
+	sub	\$16*6,$len
+	jnc	.Lxts_enc_grandloop		# loop if $len-=6*16 didn't borrow
+
+	mov	\$16+96,$rounds
+	sub	$rnds_,$rounds
+	mov	$key_,$key			# restore $key
+	shr	\$4,$rounds			# restore original value
+
+.Lxts_enc_short:
+	# at the point @tweak[0..5] are populated with tweak values
+	mov	$rounds,$rnds_			# backup $rounds
+	pxor	$rndkey0,@tweak[0]
+	add	\$16*6,$len			# restore real remaining $len
+	jz	.Lxts_enc_done			# done if ($len==0)
+
+	pxor	$rndkey0,@tweak[1]
+	cmp	\$0x20,$len
+	jb	.Lxts_enc_one			# $len is 1*16
+	pxor	$rndkey0,@tweak[2]
+	je	.Lxts_enc_two			# $len is 2*16
+
+	pxor	$rndkey0,@tweak[3]
+	cmp	\$0x40,$len
+	jb	.Lxts_enc_three			# $len is 3*16
+	pxor	$rndkey0,@tweak[4]
+	je	.Lxts_enc_four			# $len is 4*16
+
+	movdqu	($inp),$inout0			# $len is 5*16
+	movdqu	16*1($inp),$inout1
+	movdqu	16*2($inp),$inout2
+	pxor	@tweak[0],$inout0
+	movdqu	16*3($inp),$inout3
+	pxor	@tweak[1],$inout1
+	movdqu	16*4($inp),$inout4
+	lea	16*5($inp),$inp			# $inp+=5*16
+	pxor	@tweak[2],$inout2
+	pxor	@tweak[3],$inout3
+	pxor	@tweak[4],$inout4
+	pxor	$inout5,$inout5
+
+	call	_aesni_encrypt6
+
+	xorps	@tweak[0],$inout0
+	movdqa	@tweak[5],@tweak[0]
+	xorps	@tweak[1],$inout1
+	xorps	@tweak[2],$inout2
+	movdqu	$inout0,($out)			# store 5 output blocks
+	xorps	@tweak[3],$inout3
+	movdqu	$inout1,16*1($out)
+	xorps	@tweak[4],$inout4
+	movdqu	$inout2,16*2($out)
+	movdqu	$inout3,16*3($out)
+	movdqu	$inout4,16*4($out)
+	lea	16*5($out),$out			# $out+=5*16
+	jmp	.Lxts_enc_done
+
+.align	16
+.Lxts_enc_one:
+	movups	($inp),$inout0
+	lea	16*1($inp),$inp			# inp+=1*16
+	xorps	@tweak[0],$inout0
+___
+	&aesni_generate1("enc",$key,$rounds);
+$code.=<<___;
+	xorps	@tweak[0],$inout0
+	movdqa	@tweak[1],@tweak[0]
+	movups	$inout0,($out)			# store one output block
+	lea	16*1($out),$out			# $out+=1*16
+	jmp	.Lxts_enc_done
+
+.align	16
+.Lxts_enc_two:
+	movups	($inp),$inout0
+	movups	16($inp),$inout1
+	lea	32($inp),$inp			# $inp+=2*16
+	xorps	@tweak[0],$inout0
+	xorps	@tweak[1],$inout1
+
+	call	_aesni_encrypt2
+
+	xorps	@tweak[0],$inout0
+	movdqa	@tweak[2],@tweak[0]
+	xorps	@tweak[1],$inout1
+	movups	$inout0,($out)			# store 2 output blocks
+	movups	$inout1,16*1($out)
+	lea	16*2($out),$out			# $out+=2*16
+	jmp	.Lxts_enc_done
+
+.align	16
+.Lxts_enc_three:
+	movups	($inp),$inout0
+	movups	16*1($inp),$inout1
+	movups	16*2($inp),$inout2
+	lea	16*3($inp),$inp			# $inp+=3*16
+	xorps	@tweak[0],$inout0
+	xorps	@tweak[1],$inout1
+	xorps	@tweak[2],$inout2
+
+	call	_aesni_encrypt3
+
+	xorps	@tweak[0],$inout0
+	movdqa	@tweak[3],@tweak[0]
+	xorps	@tweak[1],$inout1
+	xorps	@tweak[2],$inout2
+	movups	$inout0,($out)			# store 3 output blocks
+	movups	$inout1,16*1($out)
+	movups	$inout2,16*2($out)
+	lea	16*3($out),$out			# $out+=3*16
+	jmp	.Lxts_enc_done
+
+.align	16
+.Lxts_enc_four:
+	movups	($inp),$inout0
+	movups	16*1($inp),$inout1
+	movups	16*2($inp),$inout2
+	xorps	@tweak[0],$inout0
+	movups	16*3($inp),$inout3
+	lea	16*4($inp),$inp			# $inp+=4*16
+	xorps	@tweak[1],$inout1
+	xorps	@tweak[2],$inout2
+	xorps	@tweak[3],$inout3
+
+	call	_aesni_encrypt4
+
+	pxor	@tweak[0],$inout0
+	movdqa	@tweak[4],@tweak[0]
+	pxor	@tweak[1],$inout1
+	pxor	@tweak[2],$inout2
+	movdqu	$inout0,($out)			# store 4 output blocks
+	pxor	@tweak[3],$inout3
+	movdqu	$inout1,16*1($out)
+	movdqu	$inout2,16*2($out)
+	movdqu	$inout3,16*3($out)
+	lea	16*4($out),$out			# $out+=4*16
+	jmp	.Lxts_enc_done
+
+.align	16
+.Lxts_enc_done:
+	and	\$15,$len_			# see if $len%16 is 0
+	jz	.Lxts_enc_ret
+	mov	$len_,$len
+
+.Lxts_enc_steal:
+	movzb	($inp),%eax			# borrow $rounds ...
+	movzb	-16($out),%ecx			# ... and $key
+	lea	1($inp),$inp
+	mov	%al,-16($out)
+	mov	%cl,0($out)
+	lea	1($out),$out
+	sub	\$1,$len
+	jnz	.Lxts_enc_steal
+
+	sub	$len_,$out			# rewind $out
+	mov	$key_,$key			# restore $key
+	mov	$rnds_,$rounds			# restore $rounds
+
+	movups	-16($out),$inout0
+	xorps	@tweak[0],$inout0
+___
+	&aesni_generate1("enc",$key,$rounds);
+$code.=<<___;
+	xorps	@tweak[0],$inout0
+	movups	$inout0,-16($out)
+
+.Lxts_enc_ret:
+	xorps	%xmm0,%xmm0			# clear register bank
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+___
+$code.=<<___ if (!$win64);
+	pxor	%xmm6,%xmm6
+	pxor	%xmm7,%xmm7
+	movaps	%xmm0,0x00(%rsp)		# clear stack
+	pxor	%xmm8,%xmm8
+	movaps	%xmm0,0x10(%rsp)
+	pxor	%xmm9,%xmm9
+	movaps	%xmm0,0x20(%rsp)
+	pxor	%xmm10,%xmm10
+	movaps	%xmm0,0x30(%rsp)
+	pxor	%xmm11,%xmm11
+	movaps	%xmm0,0x40(%rsp)
+	pxor	%xmm12,%xmm12
+	movaps	%xmm0,0x50(%rsp)
+	pxor	%xmm13,%xmm13
+	movaps	%xmm0,0x60(%rsp)
+	pxor	%xmm14,%xmm14
+	pxor	%xmm15,%xmm15
+___
+$code.=<<___ if ($win64);
+	movaps	-0xa8(%r11),%xmm6
+	movaps	%xmm0,-0xa8(%r11)		# clear stack
+	movaps	-0x98(%r11),%xmm7
+	movaps	%xmm0,-0x98(%r11)
+	movaps	-0x88(%r11),%xmm8
+	movaps	%xmm0,-0x88(%r11)
+	movaps	-0x78(%r11),%xmm9
+	movaps	%xmm0,-0x78(%r11)
+	movaps	-0x68(%r11),%xmm10
+	movaps	%xmm0,-0x68(%r11)
+	movaps	-0x58(%r11),%xmm11
+	movaps	%xmm0,-0x58(%r11)
+	movaps	-0x48(%r11),%xmm12
+	movaps	%xmm0,-0x48(%r11)
+	movaps	-0x38(%r11),%xmm13
+	movaps	%xmm0,-0x38(%r11)
+	movaps	-0x28(%r11),%xmm14
+	movaps	%xmm0,-0x28(%r11)
+	movaps	-0x18(%r11),%xmm15
+	movaps	%xmm0,-0x18(%r11)
+	movaps	%xmm0,0x00(%rsp)
+	movaps	%xmm0,0x10(%rsp)
+	movaps	%xmm0,0x20(%rsp)
+	movaps	%xmm0,0x30(%rsp)
+	movaps	%xmm0,0x40(%rsp)
+	movaps	%xmm0,0x50(%rsp)
+	movaps	%xmm0,0x60(%rsp)
+___
+$code.=<<___;
+	mov	-8(%r11),%rbp
+.cfi_restore	%rbp
+	lea	(%r11),%rsp
+.cfi_def_cfa_register	%rsp
+.Lxts_enc_epilogue:
+	ret
+.cfi_endproc
+.size	aesni_xts_encrypt,.-aesni_xts_encrypt
+___
+
+$code.=<<___;
+.globl	aesni_xts_decrypt
+.type	aesni_xts_decrypt,\@function,6
+.align	16
+aesni_xts_decrypt:
+.cfi_startproc
+	lea	(%rsp),%r11			# frame pointer
+.cfi_def_cfa_register	%r11
+	push	%rbp
+.cfi_push	%rbp
+	sub	\$$frame_size,%rsp
+	and	\$-16,%rsp	# Linux kernel stack can be incorrectly seeded
+___
+$code.=<<___ if ($win64);
+	movaps	%xmm6,-0xa8(%r11)		# offload everything
+	movaps	%xmm7,-0x98(%r11)
+	movaps	%xmm8,-0x88(%r11)
+	movaps	%xmm9,-0x78(%r11)
+	movaps	%xmm10,-0x68(%r11)
+	movaps	%xmm11,-0x58(%r11)
+	movaps	%xmm12,-0x48(%r11)
+	movaps	%xmm13,-0x38(%r11)
+	movaps	%xmm14,-0x28(%r11)
+	movaps	%xmm15,-0x18(%r11)
+.Lxts_dec_body:
+___
+$code.=<<___;
+	movups	($ivp),$inout0			# load clear-text tweak
+	mov	240($key2),$rounds		# key2->rounds
+	mov	240($key),$rnds_		# key1->rounds
+___
+	# generate the tweak
+	&aesni_generate1("enc",$key2,$rounds,$inout0);
+$code.=<<___;
+	xor	%eax,%eax			# if ($len%16) len-=16;
+	test	\$15,$len
+	setnz	%al
+	shl	\$4,%rax
+	sub	%rax,$len
+
+	$movkey	($key),$rndkey0			# zero round key
+	mov	$key,$key_			# backup $key
+	mov	$rnds_,$rounds			# backup $rounds
+	shl	\$4,$rnds_
+	mov	$len,$len_			# backup $len
+	and	\$-16,$len
+
+	$movkey	16($key,$rnds_),$rndkey1	# last round key
+
+	movdqa	.Lxts_magic(%rip),$twmask
+	movdqa	$inout0,@tweak[5]
+	pshufd	\$0x5f,$inout0,$twres
+	pxor	$rndkey0,$rndkey1
+___
+    for ($i=0;$i<4;$i++) {
+    $code.=<<___;
+	movdqa	$twres,$twtmp
+	paddd	$twres,$twres
+	movdqa	@tweak[5],@tweak[$i]
+	psrad	\$31,$twtmp			# broadcast upper bits
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$twtmp
+	pxor	$rndkey0,@tweak[$i]
+	pxor	$twtmp,@tweak[5]
+___
+    }
+$code.=<<___;
+	movdqa	@tweak[5],@tweak[4]
+	psrad	\$31,$twres
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$twres
+	pxor	$rndkey0,@tweak[4]
+	pxor	$twres,@tweak[5]
+	movaps	$rndkey1,0x60(%rsp)		# save round[0]^round[last]
+
+	sub	\$16*6,$len
+	jc	.Lxts_dec_short			# if $len-=6*16 borrowed
+
+	mov	\$16+96,$rounds
+	lea	32($key_,$rnds_),$key		# end of key schedule
+	sub	%r10,%rax			# twisted $rounds
+	$movkey	16($key_),$rndkey1
+	mov	%rax,%r10			# backup twisted $rounds
+	lea	.Lxts_magic(%rip),%r8
+	jmp	.Lxts_dec_grandloop
+
+.align	32
+.Lxts_dec_grandloop:
+	movdqu	`16*0`($inp),$inout0		# load input
+	movdqa	$rndkey0,$twmask
+	movdqu	`16*1`($inp),$inout1
+	pxor	@tweak[0],$inout0		# intput^=tweak^round[0]
+	movdqu	`16*2`($inp),$inout2
+	pxor	@tweak[1],$inout1
+	 aesdec		$rndkey1,$inout0
+	movdqu	`16*3`($inp),$inout3
+	pxor	@tweak[2],$inout2
+	 aesdec		$rndkey1,$inout1
+	movdqu	`16*4`($inp),$inout4
+	pxor	@tweak[3],$inout3
+	 aesdec		$rndkey1,$inout2
+	movdqu	`16*5`($inp),$inout5
+	pxor	@tweak[5],$twmask		# round[0]^=tweak[5]
+	 movdqa	0x60(%rsp),$twres		# load round[0]^round[last]
+	pxor	@tweak[4],$inout4
+	 aesdec		$rndkey1,$inout3
+	$movkey	32($key_),$rndkey0
+	lea	`16*6`($inp),$inp
+	pxor	$twmask,$inout5
+
+	 pxor	$twres,@tweak[0]		# calculate tweaks^round[last]
+	aesdec		$rndkey1,$inout4
+	 pxor	$twres,@tweak[1]
+	 movdqa	@tweak[0],`16*0`(%rsp)		# put aside tweaks^last round key
+	aesdec		$rndkey1,$inout5
+	$movkey		48($key_),$rndkey1
+	 pxor	$twres,@tweak[2]
+
+	aesdec		$rndkey0,$inout0
+	 pxor	$twres,@tweak[3]
+	 movdqa	@tweak[1],`16*1`(%rsp)
+	aesdec		$rndkey0,$inout1
+	 pxor	$twres,@tweak[4]
+	 movdqa	@tweak[2],`16*2`(%rsp)
+	aesdec		$rndkey0,$inout2
+	aesdec		$rndkey0,$inout3
+	 pxor	$twres,$twmask
+	 movdqa	@tweak[4],`16*4`(%rsp)
+	aesdec		$rndkey0,$inout4
+	aesdec		$rndkey0,$inout5
+	$movkey		64($key_),$rndkey0
+	 movdqa	$twmask,`16*5`(%rsp)
+	pshufd	\$0x5f,@tweak[5],$twres
+	jmp	.Lxts_dec_loop6
+.align	32
+.Lxts_dec_loop6:
+	aesdec		$rndkey1,$inout0
+	aesdec		$rndkey1,$inout1
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	aesdec		$rndkey1,$inout4
+	aesdec		$rndkey1,$inout5
+	$movkey		-64($key,%rax),$rndkey1
+	add		\$32,%rax
+
+	aesdec		$rndkey0,$inout0
+	aesdec		$rndkey0,$inout1
+	aesdec		$rndkey0,$inout2
+	aesdec		$rndkey0,$inout3
+	aesdec		$rndkey0,$inout4
+	aesdec		$rndkey0,$inout5
+	$movkey		-80($key,%rax),$rndkey0
+	jnz		.Lxts_dec_loop6
+
+	movdqa	(%r8),$twmask			# start calculating next tweak
+	movdqa	$twres,$twtmp
+	paddd	$twres,$twres
+	 aesdec		$rndkey1,$inout0
+	paddq	@tweak[5],@tweak[5]
+	psrad	\$31,$twtmp
+	 aesdec		$rndkey1,$inout1
+	pand	$twmask,$twtmp
+	$movkey	($key_),@tweak[0]		# load round[0]
+	 aesdec		$rndkey1,$inout2
+	 aesdec		$rndkey1,$inout3
+	 aesdec		$rndkey1,$inout4
+	pxor	$twtmp,@tweak[5]
+	movaps	@tweak[0],@tweak[1]		# copy round[0]
+	 aesdec		$rndkey1,$inout5
+	 $movkey	-64($key),$rndkey1
+
+	movdqa	$twres,$twtmp
+	 aesdec		$rndkey0,$inout0
+	paddd	$twres,$twres
+	pxor	@tweak[5],@tweak[0]
+	 aesdec		$rndkey0,$inout1
+	psrad	\$31,$twtmp
+	paddq	@tweak[5],@tweak[5]
+	 aesdec		$rndkey0,$inout2
+	 aesdec		$rndkey0,$inout3
+	pand	$twmask,$twtmp
+	movaps	@tweak[1],@tweak[2]
+	 aesdec		$rndkey0,$inout4
+	pxor	$twtmp,@tweak[5]
+	movdqa	$twres,$twtmp
+	 aesdec		$rndkey0,$inout5
+	 $movkey	-48($key),$rndkey0
+
+	paddd	$twres,$twres
+	 aesdec		$rndkey1,$inout0
+	pxor	@tweak[5],@tweak[1]
+	psrad	\$31,$twtmp
+	 aesdec		$rndkey1,$inout1
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$twtmp
+	 aesdec		$rndkey1,$inout2
+	 aesdec		$rndkey1,$inout3
+	 movdqa	@tweak[3],`16*3`(%rsp)
+	pxor	$twtmp,@tweak[5]
+	 aesdec		$rndkey1,$inout4
+	movaps	@tweak[2],@tweak[3]
+	movdqa	$twres,$twtmp
+	 aesdec		$rndkey1,$inout5
+	 $movkey	-32($key),$rndkey1
+
+	paddd	$twres,$twres
+	 aesdec		$rndkey0,$inout0
+	pxor	@tweak[5],@tweak[2]
+	psrad	\$31,$twtmp
+	 aesdec		$rndkey0,$inout1
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$twtmp
+	 aesdec		$rndkey0,$inout2
+	 aesdec		$rndkey0,$inout3
+	 aesdec		$rndkey0,$inout4
+	pxor	$twtmp,@tweak[5]
+	movaps	@tweak[3],@tweak[4]
+	 aesdec		$rndkey0,$inout5
+
+	movdqa	$twres,$rndkey0
+	paddd	$twres,$twres
+	 aesdec		$rndkey1,$inout0
+	pxor	@tweak[5],@tweak[3]
+	psrad	\$31,$rndkey0
+	 aesdec		$rndkey1,$inout1
+	paddq	@tweak[5],@tweak[5]
+	pand	$twmask,$rndkey0
+	 aesdec		$rndkey1,$inout2
+	 aesdec		$rndkey1,$inout3
+	pxor	$rndkey0,@tweak[5]
+	$movkey		($key_),$rndkey0
+	 aesdec		$rndkey1,$inout4
+	 aesdec		$rndkey1,$inout5
+	$movkey		16($key_),$rndkey1
+
+	pxor	@tweak[5],@tweak[4]
+	 aesdeclast	`16*0`(%rsp),$inout0
+	psrad	\$31,$twres
+	paddq	@tweak[5],@tweak[5]
+	 aesdeclast	`16*1`(%rsp),$inout1
+	 aesdeclast	`16*2`(%rsp),$inout2
+	pand	$twmask,$twres
+	mov	%r10,%rax			# restore $rounds
+	 aesdeclast	`16*3`(%rsp),$inout3
+	 aesdeclast	`16*4`(%rsp),$inout4
+	 aesdeclast	`16*5`(%rsp),$inout5
+	pxor	$twres,@tweak[5]
+
+	lea	`16*6`($out),$out		# $out+=6*16
+	movups	$inout0,`-16*6`($out)		# store 6 output blocks
+	movups	$inout1,`-16*5`($out)
+	movups	$inout2,`-16*4`($out)
+	movups	$inout3,`-16*3`($out)
+	movups	$inout4,`-16*2`($out)
+	movups	$inout5,`-16*1`($out)
+	sub	\$16*6,$len
+	jnc	.Lxts_dec_grandloop		# loop if $len-=6*16 didn't borrow
+
+	mov	\$16+96,$rounds
+	sub	$rnds_,$rounds
+	mov	$key_,$key			# restore $key
+	shr	\$4,$rounds			# restore original value
+
+.Lxts_dec_short:
+	# at the point @tweak[0..5] are populated with tweak values
+	mov	$rounds,$rnds_			# backup $rounds
+	pxor	$rndkey0,@tweak[0]
+	pxor	$rndkey0,@tweak[1]
+	add	\$16*6,$len			# restore real remaining $len
+	jz	.Lxts_dec_done			# done if ($len==0)
+
+	pxor	$rndkey0,@tweak[2]
+	cmp	\$0x20,$len
+	jb	.Lxts_dec_one			# $len is 1*16
+	pxor	$rndkey0,@tweak[3]
+	je	.Lxts_dec_two			# $len is 2*16
+
+	pxor	$rndkey0,@tweak[4]
+	cmp	\$0x40,$len
+	jb	.Lxts_dec_three			# $len is 3*16
+	je	.Lxts_dec_four			# $len is 4*16
+
+	movdqu	($inp),$inout0			# $len is 5*16
+	movdqu	16*1($inp),$inout1
+	movdqu	16*2($inp),$inout2
+	pxor	@tweak[0],$inout0
+	movdqu	16*3($inp),$inout3
+	pxor	@tweak[1],$inout1
+	movdqu	16*4($inp),$inout4
+	lea	16*5($inp),$inp			# $inp+=5*16
+	pxor	@tweak[2],$inout2
+	pxor	@tweak[3],$inout3
+	pxor	@tweak[4],$inout4
+
+	call	_aesni_decrypt6
+
+	xorps	@tweak[0],$inout0
+	xorps	@tweak[1],$inout1
+	xorps	@tweak[2],$inout2
+	movdqu	$inout0,($out)			# store 5 output blocks
+	xorps	@tweak[3],$inout3
+	movdqu	$inout1,16*1($out)
+	xorps	@tweak[4],$inout4
+	movdqu	$inout2,16*2($out)
+	 pxor		$twtmp,$twtmp
+	movdqu	$inout3,16*3($out)
+	 pcmpgtd	@tweak[5],$twtmp
+	movdqu	$inout4,16*4($out)
+	lea	16*5($out),$out			# $out+=5*16
+	 pshufd		\$0x13,$twtmp,@tweak[1]	# $twres
+	and	\$15,$len_
+	jz	.Lxts_dec_ret
+
+	movdqa	@tweak[5],@tweak[0]
+	paddq	@tweak[5],@tweak[5]		# psllq 1,$tweak
+	pand	$twmask,@tweak[1]		# isolate carry and residue
+	pxor	@tweak[5],@tweak[1]
+	jmp	.Lxts_dec_done2
+
+.align	16
+.Lxts_dec_one:
+	movups	($inp),$inout0
+	lea	16*1($inp),$inp			# $inp+=1*16
+	xorps	@tweak[0],$inout0
+___
+	&aesni_generate1("dec",$key,$rounds);
+$code.=<<___;
+	xorps	@tweak[0],$inout0
+	movdqa	@tweak[1],@tweak[0]
+	movups	$inout0,($out)			# store one output block
+	movdqa	@tweak[2],@tweak[1]
+	lea	16*1($out),$out			# $out+=1*16
+	jmp	.Lxts_dec_done
+
+.align	16
+.Lxts_dec_two:
+	movups	($inp),$inout0
+	movups	16($inp),$inout1
+	lea	32($inp),$inp			# $inp+=2*16
+	xorps	@tweak[0],$inout0
+	xorps	@tweak[1],$inout1
+
+	call	_aesni_decrypt2
+
+	xorps	@tweak[0],$inout0
+	movdqa	@tweak[2],@tweak[0]
+	xorps	@tweak[1],$inout1
+	movdqa	@tweak[3],@tweak[1]
+	movups	$inout0,($out)			# store 2 output blocks
+	movups	$inout1,16*1($out)
+	lea	16*2($out),$out			# $out+=2*16
+	jmp	.Lxts_dec_done
+
+.align	16
+.Lxts_dec_three:
+	movups	($inp),$inout0
+	movups	16*1($inp),$inout1
+	movups	16*2($inp),$inout2
+	lea	16*3($inp),$inp			# $inp+=3*16
+	xorps	@tweak[0],$inout0
+	xorps	@tweak[1],$inout1
+	xorps	@tweak[2],$inout2
+
+	call	_aesni_decrypt3
+
+	xorps	@tweak[0],$inout0
+	movdqa	@tweak[3],@tweak[0]
+	xorps	@tweak[1],$inout1
+	movdqa	@tweak[4],@tweak[1]
+	xorps	@tweak[2],$inout2
+	movups	$inout0,($out)			# store 3 output blocks
+	movups	$inout1,16*1($out)
+	movups	$inout2,16*2($out)
+	lea	16*3($out),$out			# $out+=3*16
+	jmp	.Lxts_dec_done
+
+.align	16
+.Lxts_dec_four:
+	movups	($inp),$inout0
+	movups	16*1($inp),$inout1
+	movups	16*2($inp),$inout2
+	xorps	@tweak[0],$inout0
+	movups	16*3($inp),$inout3
+	lea	16*4($inp),$inp			# $inp+=4*16
+	xorps	@tweak[1],$inout1
+	xorps	@tweak[2],$inout2
+	xorps	@tweak[3],$inout3
+
+	call	_aesni_decrypt4
+
+	pxor	@tweak[0],$inout0
+	movdqa	@tweak[4],@tweak[0]
+	pxor	@tweak[1],$inout1
+	movdqa	@tweak[5],@tweak[1]
+	pxor	@tweak[2],$inout2
+	movdqu	$inout0,($out)			# store 4 output blocks
+	pxor	@tweak[3],$inout3
+	movdqu	$inout1,16*1($out)
+	movdqu	$inout2,16*2($out)
+	movdqu	$inout3,16*3($out)
+	lea	16*4($out),$out			# $out+=4*16
+	jmp	.Lxts_dec_done
+
+.align	16
+.Lxts_dec_done:
+	and	\$15,$len_			# see if $len%16 is 0
+	jz	.Lxts_dec_ret
+.Lxts_dec_done2:
+	mov	$len_,$len
+	mov	$key_,$key			# restore $key
+	mov	$rnds_,$rounds			# restore $rounds
+
+	movups	($inp),$inout0
+	xorps	@tweak[1],$inout0
+___
+	&aesni_generate1("dec",$key,$rounds);
+$code.=<<___;
+	xorps	@tweak[1],$inout0
+	movups	$inout0,($out)
+
+.Lxts_dec_steal:
+	movzb	16($inp),%eax			# borrow $rounds ...
+	movzb	($out),%ecx			# ... and $key
+	lea	1($inp),$inp
+	mov	%al,($out)
+	mov	%cl,16($out)
+	lea	1($out),$out
+	sub	\$1,$len
+	jnz	.Lxts_dec_steal
+
+	sub	$len_,$out			# rewind $out
+	mov	$key_,$key			# restore $key
+	mov	$rnds_,$rounds			# restore $rounds
+
+	movups	($out),$inout0
+	xorps	@tweak[0],$inout0
+___
+	&aesni_generate1("dec",$key,$rounds);
+$code.=<<___;
+	xorps	@tweak[0],$inout0
+	movups	$inout0,($out)
+
+.Lxts_dec_ret:
+	xorps	%xmm0,%xmm0			# clear register bank
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+___
+$code.=<<___ if (!$win64);
+	pxor	%xmm6,%xmm6
+	pxor	%xmm7,%xmm7
+	movaps	%xmm0,0x00(%rsp)		# clear stack
+	pxor	%xmm8,%xmm8
+	movaps	%xmm0,0x10(%rsp)
+	pxor	%xmm9,%xmm9
+	movaps	%xmm0,0x20(%rsp)
+	pxor	%xmm10,%xmm10
+	movaps	%xmm0,0x30(%rsp)
+	pxor	%xmm11,%xmm11
+	movaps	%xmm0,0x40(%rsp)
+	pxor	%xmm12,%xmm12
+	movaps	%xmm0,0x50(%rsp)
+	pxor	%xmm13,%xmm13
+	movaps	%xmm0,0x60(%rsp)
+	pxor	%xmm14,%xmm14
+	pxor	%xmm15,%xmm15
+___
+$code.=<<___ if ($win64);
+	movaps	-0xa8(%r11),%xmm6
+	movaps	%xmm0,-0xa8(%r11)		# clear stack
+	movaps	-0x98(%r11),%xmm7
+	movaps	%xmm0,-0x98(%r11)
+	movaps	-0x88(%r11),%xmm8
+	movaps	%xmm0,-0x88(%r11)
+	movaps	-0x78(%r11),%xmm9
+	movaps	%xmm0,-0x78(%r11)
+	movaps	-0x68(%r11),%xmm10
+	movaps	%xmm0,-0x68(%r11)
+	movaps	-0x58(%r11),%xmm11
+	movaps	%xmm0,-0x58(%r11)
+	movaps	-0x48(%r11),%xmm12
+	movaps	%xmm0,-0x48(%r11)
+	movaps	-0x38(%r11),%xmm13
+	movaps	%xmm0,-0x38(%r11)
+	movaps	-0x28(%r11),%xmm14
+	movaps	%xmm0,-0x28(%r11)
+	movaps	-0x18(%r11),%xmm15
+	movaps	%xmm0,-0x18(%r11)
+	movaps	%xmm0,0x00(%rsp)
+	movaps	%xmm0,0x10(%rsp)
+	movaps	%xmm0,0x20(%rsp)
+	movaps	%xmm0,0x30(%rsp)
+	movaps	%xmm0,0x40(%rsp)
+	movaps	%xmm0,0x50(%rsp)
+	movaps	%xmm0,0x60(%rsp)
+___
+$code.=<<___;
+	mov	-8(%r11),%rbp
+.cfi_restore	%rbp
+	lea	(%r11),%rsp
+.cfi_def_cfa_register	%rsp
+.Lxts_dec_epilogue:
+	ret
+.cfi_endproc
+.size	aesni_xts_decrypt,.-aesni_xts_decrypt
+___
+}
+
+######################################################################
+# void aesni_ocb_[en|de]crypt(const char *inp, char *out, size_t blocks,
+#	const AES_KEY *key, unsigned int start_block_num,
+#	unsigned char offset_i[16], const unsigned char L_[][16],
+#	unsigned char checksum[16]);
+#
+if (0) {
+my @offset=map("%xmm$_",(10..15));
+my ($checksum,$rndkey0l)=("%xmm8","%xmm9");
+my ($block_num,$offset_p)=("%r8","%r9");		# 5th and 6th arguments
+my ($L_p,$checksum_p) = ("%rbx","%rbp");
+my ($i1,$i3,$i5) = ("%r12","%r13","%r14");
+my $seventh_arg = $win64 ? 56 : 8;
+my $blocks = $len;
+
+$code.=<<___;
+.globl	aesni_ocb_encrypt
+.type	aesni_ocb_encrypt,\@function,6
+.align	32
+aesni_ocb_encrypt:
+.cfi_startproc
+	lea	(%rsp),%rax
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+___
+$code.=<<___ if ($win64);
+	lea	-0xa0(%rsp),%rsp
+	movaps	%xmm6,0x00(%rsp)		# offload everything
+	movaps	%xmm7,0x10(%rsp)
+	movaps	%xmm8,0x20(%rsp)
+	movaps	%xmm9,0x30(%rsp)
+	movaps	%xmm10,0x40(%rsp)
+	movaps	%xmm11,0x50(%rsp)
+	movaps	%xmm12,0x60(%rsp)
+	movaps	%xmm13,0x70(%rsp)
+	movaps	%xmm14,0x80(%rsp)
+	movaps	%xmm15,0x90(%rsp)
+.Locb_enc_body:
+___
+$code.=<<___;
+	mov	$seventh_arg(%rax),$L_p		# 7th argument
+	mov	$seventh_arg+8(%rax),$checksum_p# 8th argument
+
+	mov	240($key),$rnds_
+	mov	$key,$key_
+	shl	\$4,$rnds_
+	$movkey	($key),$rndkey0l		# round[0]
+	$movkey	16($key,$rnds_),$rndkey1	# round[last]
+
+	movdqu	($offset_p),@offset[5]		# load last offset_i
+	pxor	$rndkey1,$rndkey0l		# round[0] ^ round[last]
+	pxor	$rndkey1,@offset[5]		# offset_i ^ round[last]
+
+	mov	\$16+32,$rounds
+	lea	32($key_,$rnds_),$key
+	$movkey	16($key_),$rndkey1		# round[1]
+	sub	%r10,%rax			# twisted $rounds
+	mov	%rax,%r10			# backup twisted $rounds
+
+	movdqu	($L_p),@offset[0]		# L_0 for all odd-numbered blocks
+	movdqu	($checksum_p),$checksum		# load checksum
+
+	test	\$1,$block_num			# is first block number odd?
+	jnz	.Locb_enc_odd
+
+	bsf	$block_num,$i1
+	add	\$1,$block_num
+	shl	\$4,$i1
+	movdqu	($L_p,$i1),$inout5		# borrow
+	movdqu	($inp),$inout0
+	lea	16($inp),$inp
+
+	call	__ocb_encrypt1
+
+	movdqa	$inout5,@offset[5]
+	movups	$inout0,($out)
+	lea	16($out),$out
+	sub	\$1,$blocks
+	jz	.Locb_enc_done
+
+.Locb_enc_odd:
+	lea	1($block_num),$i1		# even-numbered blocks
+	lea	3($block_num),$i3
+	lea	5($block_num),$i5
+	lea	6($block_num),$block_num
+	bsf	$i1,$i1				# ntz(block)
+	bsf	$i3,$i3
+	bsf	$i5,$i5
+	shl	\$4,$i1				# ntz(block) -> table offset
+	shl	\$4,$i3
+	shl	\$4,$i5
+
+	sub	\$6,$blocks
+	jc	.Locb_enc_short
+	jmp	.Locb_enc_grandloop
+
+.align	32
+.Locb_enc_grandloop:
+	movdqu	`16*0`($inp),$inout0		# load input
+	movdqu	`16*1`($inp),$inout1
+	movdqu	`16*2`($inp),$inout2
+	movdqu	`16*3`($inp),$inout3
+	movdqu	`16*4`($inp),$inout4
+	movdqu	`16*5`($inp),$inout5
+	lea	`16*6`($inp),$inp
+
+	call	__ocb_encrypt6
+
+	movups	$inout0,`16*0`($out)		# store output
+	movups	$inout1,`16*1`($out)
+	movups	$inout2,`16*2`($out)
+	movups	$inout3,`16*3`($out)
+	movups	$inout4,`16*4`($out)
+	movups	$inout5,`16*5`($out)
+	lea	`16*6`($out),$out
+	sub	\$6,$blocks
+	jnc	.Locb_enc_grandloop
+
+.Locb_enc_short:
+	add	\$6,$blocks
+	jz	.Locb_enc_done
+
+	movdqu	`16*0`($inp),$inout0
+	cmp	\$2,$blocks
+	jb	.Locb_enc_one
+	movdqu	`16*1`($inp),$inout1
+	je	.Locb_enc_two
+
+	movdqu	`16*2`($inp),$inout2
+	cmp	\$4,$blocks
+	jb	.Locb_enc_three
+	movdqu	`16*3`($inp),$inout3
+	je	.Locb_enc_four
+
+	movdqu	`16*4`($inp),$inout4
+	pxor	$inout5,$inout5
+
+	call	__ocb_encrypt6
+
+	movdqa	@offset[4],@offset[5]
+	movups	$inout0,`16*0`($out)
+	movups	$inout1,`16*1`($out)
+	movups	$inout2,`16*2`($out)
+	movups	$inout3,`16*3`($out)
+	movups	$inout4,`16*4`($out)
+
+	jmp	.Locb_enc_done
+
+.align	16
+.Locb_enc_one:
+	movdqa	@offset[0],$inout5		# borrow
+
+	call	__ocb_encrypt1
+
+	movdqa	$inout5,@offset[5]
+	movups	$inout0,`16*0`($out)
+	jmp	.Locb_enc_done
+
+.align	16
+.Locb_enc_two:
+	pxor	$inout2,$inout2
+	pxor	$inout3,$inout3
+
+	call	__ocb_encrypt4
+
+	movdqa	@offset[1],@offset[5]
+	movups	$inout0,`16*0`($out)
+	movups	$inout1,`16*1`($out)
+
+	jmp	.Locb_enc_done
+
+.align	16
+.Locb_enc_three:
+	pxor	$inout3,$inout3
+
+	call	__ocb_encrypt4
+
+	movdqa	@offset[2],@offset[5]
+	movups	$inout0,`16*0`($out)
+	movups	$inout1,`16*1`($out)
+	movups	$inout2,`16*2`($out)
+
+	jmp	.Locb_enc_done
+
+.align	16
+.Locb_enc_four:
+	call	__ocb_encrypt4
+
+	movdqa	@offset[3],@offset[5]
+	movups	$inout0,`16*0`($out)
+	movups	$inout1,`16*1`($out)
+	movups	$inout2,`16*2`($out)
+	movups	$inout3,`16*3`($out)
+
+.Locb_enc_done:
+	pxor	$rndkey0,@offset[5]		# "remove" round[last]
+	movdqu	$checksum,($checksum_p)		# store checksum
+	movdqu	@offset[5],($offset_p)		# store last offset_i
+
+	xorps	%xmm0,%xmm0			# clear register bank
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+___
+$code.=<<___ if (!$win64);
+	pxor	%xmm6,%xmm6
+	pxor	%xmm7,%xmm7
+	pxor	%xmm8,%xmm8
+	pxor	%xmm9,%xmm9
+	pxor	%xmm10,%xmm10
+	pxor	%xmm11,%xmm11
+	pxor	%xmm12,%xmm12
+	pxor	%xmm13,%xmm13
+	pxor	%xmm14,%xmm14
+	pxor	%xmm15,%xmm15
+	lea	0x28(%rsp),%rax
+.cfi_def_cfa	%rax,8
+___
+$code.=<<___ if ($win64);
+	movaps	0x00(%rsp),%xmm6
+	movaps	%xmm0,0x00(%rsp)		# clear stack
+	movaps	0x10(%rsp),%xmm7
+	movaps	%xmm0,0x10(%rsp)
+	movaps	0x20(%rsp),%xmm8
+	movaps	%xmm0,0x20(%rsp)
+	movaps	0x30(%rsp),%xmm9
+	movaps	%xmm0,0x30(%rsp)
+	movaps	0x40(%rsp),%xmm10
+	movaps	%xmm0,0x40(%rsp)
+	movaps	0x50(%rsp),%xmm11
+	movaps	%xmm0,0x50(%rsp)
+	movaps	0x60(%rsp),%xmm12
+	movaps	%xmm0,0x60(%rsp)
+	movaps	0x70(%rsp),%xmm13
+	movaps	%xmm0,0x70(%rsp)
+	movaps	0x80(%rsp),%xmm14
+	movaps	%xmm0,0x80(%rsp)
+	movaps	0x90(%rsp),%xmm15
+	movaps	%xmm0,0x90(%rsp)
+	lea	0xa0+0x28(%rsp),%rax
+.Locb_enc_pop:
+___
+$code.=<<___;
+	mov	-40(%rax),%r14
+.cfi_restore	%r14
+	mov	-32(%rax),%r13
+.cfi_restore	%r13
+	mov	-24(%rax),%r12
+.cfi_restore	%r12
+	mov	-16(%rax),%rbp
+.cfi_restore	%rbp
+	mov	-8(%rax),%rbx
+.cfi_restore	%rbx
+	lea	(%rax),%rsp
+.cfi_def_cfa_register	%rsp
+.Locb_enc_epilogue:
+	ret
+.cfi_endproc
+.size	aesni_ocb_encrypt,.-aesni_ocb_encrypt
+
+.type	__ocb_encrypt6,\@abi-omnipotent
+.align	32
+__ocb_encrypt6:
+	 pxor		$rndkey0l,@offset[5]	# offset_i ^ round[0]
+	 movdqu		($L_p,$i1),@offset[1]
+	 movdqa		@offset[0],@offset[2]
+	 movdqu		($L_p,$i3),@offset[3]
+	 movdqa		@offset[0],@offset[4]
+	 pxor		@offset[5],@offset[0]
+	 movdqu		($L_p,$i5),@offset[5]
+	 pxor		@offset[0],@offset[1]
+	pxor		$inout0,$checksum	# accumulate checksum
+	pxor		@offset[0],$inout0	# input ^ round[0] ^ offset_i
+	 pxor		@offset[1],@offset[2]
+	pxor		$inout1,$checksum
+	pxor		@offset[1],$inout1
+	 pxor		@offset[2],@offset[3]
+	pxor		$inout2,$checksum
+	pxor		@offset[2],$inout2
+	 pxor		@offset[3],@offset[4]
+	pxor		$inout3,$checksum
+	pxor		@offset[3],$inout3
+	 pxor		@offset[4],@offset[5]
+	pxor		$inout4,$checksum
+	pxor		@offset[4],$inout4
+	pxor		$inout5,$checksum
+	pxor		@offset[5],$inout5
+	$movkey		32($key_),$rndkey0
+
+	lea		1($block_num),$i1	# even-numbered blocks
+	lea		3($block_num),$i3
+	lea		5($block_num),$i5
+	add		\$6,$block_num
+	 pxor		$rndkey0l,@offset[0]	# offset_i ^ round[last]
+	bsf		$i1,$i1			# ntz(block)
+	bsf		$i3,$i3
+	bsf		$i5,$i5
+
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	 pxor		$rndkey0l,@offset[1]
+	 pxor		$rndkey0l,@offset[2]
+	aesenc		$rndkey1,$inout4
+	 pxor		$rndkey0l,@offset[3]
+	 pxor		$rndkey0l,@offset[4]
+	aesenc		$rndkey1,$inout5
+	$movkey		48($key_),$rndkey1
+	 pxor		$rndkey0l,@offset[5]
+
+	aesenc		$rndkey0,$inout0
+	aesenc		$rndkey0,$inout1
+	aesenc		$rndkey0,$inout2
+	aesenc		$rndkey0,$inout3
+	aesenc		$rndkey0,$inout4
+	aesenc		$rndkey0,$inout5
+	$movkey		64($key_),$rndkey0
+	shl		\$4,$i1			# ntz(block) -> table offset
+	shl		\$4,$i3
+	jmp		.Locb_enc_loop6
+
+.align	32
+.Locb_enc_loop6:
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	aesenc		$rndkey1,$inout4
+	aesenc		$rndkey1,$inout5
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+
+	aesenc		$rndkey0,$inout0
+	aesenc		$rndkey0,$inout1
+	aesenc		$rndkey0,$inout2
+	aesenc		$rndkey0,$inout3
+	aesenc		$rndkey0,$inout4
+	aesenc		$rndkey0,$inout5
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.Locb_enc_loop6
+
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	aesenc		$rndkey1,$inout4
+	aesenc		$rndkey1,$inout5
+	$movkey		16($key_),$rndkey1
+	shl		\$4,$i5
+
+	aesenclast	@offset[0],$inout0
+	movdqu		($L_p),@offset[0]	# L_0 for all odd-numbered blocks
+	mov		%r10,%rax		# restore twisted rounds
+	aesenclast	@offset[1],$inout1
+	aesenclast	@offset[2],$inout2
+	aesenclast	@offset[3],$inout3
+	aesenclast	@offset[4],$inout4
+	aesenclast	@offset[5],$inout5
+	ret
+.size	__ocb_encrypt6,.-__ocb_encrypt6
+
+.type	__ocb_encrypt4,\@abi-omnipotent
+.align	32
+__ocb_encrypt4:
+	 pxor		$rndkey0l,@offset[5]	# offset_i ^ round[0]
+	 movdqu		($L_p,$i1),@offset[1]
+	 movdqa		@offset[0],@offset[2]
+	 movdqu		($L_p,$i3),@offset[3]
+	 pxor		@offset[5],@offset[0]
+	 pxor		@offset[0],@offset[1]
+	pxor		$inout0,$checksum	# accumulate checksum
+	pxor		@offset[0],$inout0	# input ^ round[0] ^ offset_i
+	 pxor		@offset[1],@offset[2]
+	pxor		$inout1,$checksum
+	pxor		@offset[1],$inout1
+	 pxor		@offset[2],@offset[3]
+	pxor		$inout2,$checksum
+	pxor		@offset[2],$inout2
+	pxor		$inout3,$checksum
+	pxor		@offset[3],$inout3
+	$movkey		32($key_),$rndkey0
+
+	 pxor		$rndkey0l,@offset[0]	# offset_i ^ round[last]
+	 pxor		$rndkey0l,@offset[1]
+	 pxor		$rndkey0l,@offset[2]
+	 pxor		$rndkey0l,@offset[3]
+
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	$movkey		48($key_),$rndkey1
+
+	aesenc		$rndkey0,$inout0
+	aesenc		$rndkey0,$inout1
+	aesenc		$rndkey0,$inout2
+	aesenc		$rndkey0,$inout3
+	$movkey		64($key_),$rndkey0
+	jmp		.Locb_enc_loop4
+
+.align	32
+.Locb_enc_loop4:
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+
+	aesenc		$rndkey0,$inout0
+	aesenc		$rndkey0,$inout1
+	aesenc		$rndkey0,$inout2
+	aesenc		$rndkey0,$inout3
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.Locb_enc_loop4
+
+	aesenc		$rndkey1,$inout0
+	aesenc		$rndkey1,$inout1
+	aesenc		$rndkey1,$inout2
+	aesenc		$rndkey1,$inout3
+	$movkey		16($key_),$rndkey1
+	mov		%r10,%rax		# restore twisted rounds
+
+	aesenclast	@offset[0],$inout0
+	aesenclast	@offset[1],$inout1
+	aesenclast	@offset[2],$inout2
+	aesenclast	@offset[3],$inout3
+	ret
+.size	__ocb_encrypt4,.-__ocb_encrypt4
+
+.type	__ocb_encrypt1,\@abi-omnipotent
+.align	32
+__ocb_encrypt1:
+	 pxor		@offset[5],$inout5	# offset_i
+	 pxor		$rndkey0l,$inout5	# offset_i ^ round[0]
+	pxor		$inout0,$checksum	# accumulate checksum
+	pxor		$inout5,$inout0		# input ^ round[0] ^ offset_i
+	$movkey		32($key_),$rndkey0
+
+	aesenc		$rndkey1,$inout0
+	$movkey		48($key_),$rndkey1
+	pxor		$rndkey0l,$inout5	# offset_i ^ round[last]
+
+	aesenc		$rndkey0,$inout0
+	$movkey		64($key_),$rndkey0
+	jmp		.Locb_enc_loop1
+
+.align	32
+.Locb_enc_loop1:
+	aesenc		$rndkey1,$inout0
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+
+	aesenc		$rndkey0,$inout0
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.Locb_enc_loop1
+
+	aesenc		$rndkey1,$inout0
+	$movkey		16($key_),$rndkey1	# redundant in tail
+	mov		%r10,%rax		# restore twisted rounds
+
+	aesenclast	$inout5,$inout0
+	ret
+.size	__ocb_encrypt1,.-__ocb_encrypt1
+
+.globl	aesni_ocb_decrypt
+.type	aesni_ocb_decrypt,\@function,6
+.align	32
+aesni_ocb_decrypt:
+.cfi_startproc
+	lea	(%rsp),%rax
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+___
+$code.=<<___ if ($win64);
+	lea	-0xa0(%rsp),%rsp
+	movaps	%xmm6,0x00(%rsp)		# offload everything
+	movaps	%xmm7,0x10(%rsp)
+	movaps	%xmm8,0x20(%rsp)
+	movaps	%xmm9,0x30(%rsp)
+	movaps	%xmm10,0x40(%rsp)
+	movaps	%xmm11,0x50(%rsp)
+	movaps	%xmm12,0x60(%rsp)
+	movaps	%xmm13,0x70(%rsp)
+	movaps	%xmm14,0x80(%rsp)
+	movaps	%xmm15,0x90(%rsp)
+.Locb_dec_body:
+___
+$code.=<<___;
+	mov	$seventh_arg(%rax),$L_p		# 7th argument
+	mov	$seventh_arg+8(%rax),$checksum_p# 8th argument
+
+	mov	240($key),$rnds_
+	mov	$key,$key_
+	shl	\$4,$rnds_
+	$movkey	($key),$rndkey0l		# round[0]
+	$movkey	16($key,$rnds_),$rndkey1	# round[last]
+
+	movdqu	($offset_p),@offset[5]		# load last offset_i
+	pxor	$rndkey1,$rndkey0l		# round[0] ^ round[last]
+	pxor	$rndkey1,@offset[5]		# offset_i ^ round[last]
+
+	mov	\$16+32,$rounds
+	lea	32($key_,$rnds_),$key
+	$movkey	16($key_),$rndkey1		# round[1]
+	sub	%r10,%rax			# twisted $rounds
+	mov	%rax,%r10			# backup twisted $rounds
+
+	movdqu	($L_p),@offset[0]		# L_0 for all odd-numbered blocks
+	movdqu	($checksum_p),$checksum		# load checksum
+
+	test	\$1,$block_num			# is first block number odd?
+	jnz	.Locb_dec_odd
+
+	bsf	$block_num,$i1
+	add	\$1,$block_num
+	shl	\$4,$i1
+	movdqu	($L_p,$i1),$inout5		# borrow
+	movdqu	($inp),$inout0
+	lea	16($inp),$inp
+
+	call	__ocb_decrypt1
+
+	movdqa	$inout5,@offset[5]
+	movups	$inout0,($out)
+	xorps	$inout0,$checksum		# accumulate checksum
+	lea	16($out),$out
+	sub	\$1,$blocks
+	jz	.Locb_dec_done
+
+.Locb_dec_odd:
+	lea	1($block_num),$i1		# even-numbered blocks
+	lea	3($block_num),$i3
+	lea	5($block_num),$i5
+	lea	6($block_num),$block_num
+	bsf	$i1,$i1				# ntz(block)
+	bsf	$i3,$i3
+	bsf	$i5,$i5
+	shl	\$4,$i1				# ntz(block) -> table offset
+	shl	\$4,$i3
+	shl	\$4,$i5
+
+	sub	\$6,$blocks
+	jc	.Locb_dec_short
+	jmp	.Locb_dec_grandloop
+
+.align	32
+.Locb_dec_grandloop:
+	movdqu	`16*0`($inp),$inout0		# load input
+	movdqu	`16*1`($inp),$inout1
+	movdqu	`16*2`($inp),$inout2
+	movdqu	`16*3`($inp),$inout3
+	movdqu	`16*4`($inp),$inout4
+	movdqu	`16*5`($inp),$inout5
+	lea	`16*6`($inp),$inp
+
+	call	__ocb_decrypt6
+
+	movups	$inout0,`16*0`($out)		# store output
+	pxor	$inout0,$checksum		# accumulate checksum
+	movups	$inout1,`16*1`($out)
+	pxor	$inout1,$checksum
+	movups	$inout2,`16*2`($out)
+	pxor	$inout2,$checksum
+	movups	$inout3,`16*3`($out)
+	pxor	$inout3,$checksum
+	movups	$inout4,`16*4`($out)
+	pxor	$inout4,$checksum
+	movups	$inout5,`16*5`($out)
+	pxor	$inout5,$checksum
+	lea	`16*6`($out),$out
+	sub	\$6,$blocks
+	jnc	.Locb_dec_grandloop
+
+.Locb_dec_short:
+	add	\$6,$blocks
+	jz	.Locb_dec_done
+
+	movdqu	`16*0`($inp),$inout0
+	cmp	\$2,$blocks
+	jb	.Locb_dec_one
+	movdqu	`16*1`($inp),$inout1
+	je	.Locb_dec_two
+
+	movdqu	`16*2`($inp),$inout2
+	cmp	\$4,$blocks
+	jb	.Locb_dec_three
+	movdqu	`16*3`($inp),$inout3
+	je	.Locb_dec_four
+
+	movdqu	`16*4`($inp),$inout4
+	pxor	$inout5,$inout5
+
+	call	__ocb_decrypt6
+
+	movdqa	@offset[4],@offset[5]
+	movups	$inout0,`16*0`($out)		# store output
+	pxor	$inout0,$checksum		# accumulate checksum
+	movups	$inout1,`16*1`($out)
+	pxor	$inout1,$checksum
+	movups	$inout2,`16*2`($out)
+	pxor	$inout2,$checksum
+	movups	$inout3,`16*3`($out)
+	pxor	$inout3,$checksum
+	movups	$inout4,`16*4`($out)
+	pxor	$inout4,$checksum
+
+	jmp	.Locb_dec_done
+
+.align	16
+.Locb_dec_one:
+	movdqa	@offset[0],$inout5		# borrow
+
+	call	__ocb_decrypt1
+
+	movdqa	$inout5,@offset[5]
+	movups	$inout0,`16*0`($out)		# store output
+	xorps	$inout0,$checksum		# accumulate checksum
+	jmp	.Locb_dec_done
+
+.align	16
+.Locb_dec_two:
+	pxor	$inout2,$inout2
+	pxor	$inout3,$inout3
+
+	call	__ocb_decrypt4
+
+	movdqa	@offset[1],@offset[5]
+	movups	$inout0,`16*0`($out)		# store output
+	xorps	$inout0,$checksum		# accumulate checksum
+	movups	$inout1,`16*1`($out)
+	xorps	$inout1,$checksum
+
+	jmp	.Locb_dec_done
+
+.align	16
+.Locb_dec_three:
+	pxor	$inout3,$inout3
+
+	call	__ocb_decrypt4
+
+	movdqa	@offset[2],@offset[5]
+	movups	$inout0,`16*0`($out)		# store output
+	xorps	$inout0,$checksum		# accumulate checksum
+	movups	$inout1,`16*1`($out)
+	xorps	$inout1,$checksum
+	movups	$inout2,`16*2`($out)
+	xorps	$inout2,$checksum
+
+	jmp	.Locb_dec_done
+
+.align	16
+.Locb_dec_four:
+	call	__ocb_decrypt4
+
+	movdqa	@offset[3],@offset[5]
+	movups	$inout0,`16*0`($out)		# store output
+	pxor	$inout0,$checksum		# accumulate checksum
+	movups	$inout1,`16*1`($out)
+	pxor	$inout1,$checksum
+	movups	$inout2,`16*2`($out)
+	pxor	$inout2,$checksum
+	movups	$inout3,`16*3`($out)
+	pxor	$inout3,$checksum
+
+.Locb_dec_done:
+	pxor	$rndkey0,@offset[5]		# "remove" round[last]
+	movdqu	$checksum,($checksum_p)		# store checksum
+	movdqu	@offset[5],($offset_p)		# store last offset_i
+
+	xorps	%xmm0,%xmm0			# clear register bank
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+___
+$code.=<<___ if (!$win64);
+	pxor	%xmm6,%xmm6
+	pxor	%xmm7,%xmm7
+	pxor	%xmm8,%xmm8
+	pxor	%xmm9,%xmm9
+	pxor	%xmm10,%xmm10
+	pxor	%xmm11,%xmm11
+	pxor	%xmm12,%xmm12
+	pxor	%xmm13,%xmm13
+	pxor	%xmm14,%xmm14
+	pxor	%xmm15,%xmm15
+	lea	0x28(%rsp),%rax
+.cfi_def_cfa	%rax,8
+___
+$code.=<<___ if ($win64);
+	movaps	0x00(%rsp),%xmm6
+	movaps	%xmm0,0x00(%rsp)		# clear stack
+	movaps	0x10(%rsp),%xmm7
+	movaps	%xmm0,0x10(%rsp)
+	movaps	0x20(%rsp),%xmm8
+	movaps	%xmm0,0x20(%rsp)
+	movaps	0x30(%rsp),%xmm9
+	movaps	%xmm0,0x30(%rsp)
+	movaps	0x40(%rsp),%xmm10
+	movaps	%xmm0,0x40(%rsp)
+	movaps	0x50(%rsp),%xmm11
+	movaps	%xmm0,0x50(%rsp)
+	movaps	0x60(%rsp),%xmm12
+	movaps	%xmm0,0x60(%rsp)
+	movaps	0x70(%rsp),%xmm13
+	movaps	%xmm0,0x70(%rsp)
+	movaps	0x80(%rsp),%xmm14
+	movaps	%xmm0,0x80(%rsp)
+	movaps	0x90(%rsp),%xmm15
+	movaps	%xmm0,0x90(%rsp)
+	lea	0xa0+0x28(%rsp),%rax
+.Locb_dec_pop:
+___
+$code.=<<___;
+	mov	-40(%rax),%r14
+.cfi_restore	%r14
+	mov	-32(%rax),%r13
+.cfi_restore	%r13
+	mov	-24(%rax),%r12
+.cfi_restore	%r12
+	mov	-16(%rax),%rbp
+.cfi_restore	%rbp
+	mov	-8(%rax),%rbx
+.cfi_restore	%rbx
+	lea	(%rax),%rsp
+.cfi_def_cfa_register	%rsp
+.Locb_dec_epilogue:
+	ret
+.cfi_endproc
+.size	aesni_ocb_decrypt,.-aesni_ocb_decrypt
+
+.type	__ocb_decrypt6,\@abi-omnipotent
+.align	32
+__ocb_decrypt6:
+	 pxor		$rndkey0l,@offset[5]	# offset_i ^ round[0]
+	 movdqu		($L_p,$i1),@offset[1]
+	 movdqa		@offset[0],@offset[2]
+	 movdqu		($L_p,$i3),@offset[3]
+	 movdqa		@offset[0],@offset[4]
+	 pxor		@offset[5],@offset[0]
+	 movdqu		($L_p,$i5),@offset[5]
+	 pxor		@offset[0],@offset[1]
+	pxor		@offset[0],$inout0	# input ^ round[0] ^ offset_i
+	 pxor		@offset[1],@offset[2]
+	pxor		@offset[1],$inout1
+	 pxor		@offset[2],@offset[3]
+	pxor		@offset[2],$inout2
+	 pxor		@offset[3],@offset[4]
+	pxor		@offset[3],$inout3
+	 pxor		@offset[4],@offset[5]
+	pxor		@offset[4],$inout4
+	pxor		@offset[5],$inout5
+	$movkey		32($key_),$rndkey0
+
+	lea		1($block_num),$i1	# even-numbered blocks
+	lea		3($block_num),$i3
+	lea		5($block_num),$i5
+	add		\$6,$block_num
+	 pxor		$rndkey0l,@offset[0]	# offset_i ^ round[last]
+	bsf		$i1,$i1			# ntz(block)
+	bsf		$i3,$i3
+	bsf		$i5,$i5
+
+	aesdec		$rndkey1,$inout0
+	aesdec		$rndkey1,$inout1
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	 pxor		$rndkey0l,@offset[1]
+	 pxor		$rndkey0l,@offset[2]
+	aesdec		$rndkey1,$inout4
+	 pxor		$rndkey0l,@offset[3]
+	 pxor		$rndkey0l,@offset[4]
+	aesdec		$rndkey1,$inout5
+	$movkey		48($key_),$rndkey1
+	 pxor		$rndkey0l,@offset[5]
+
+	aesdec		$rndkey0,$inout0
+	aesdec		$rndkey0,$inout1
+	aesdec		$rndkey0,$inout2
+	aesdec		$rndkey0,$inout3
+	aesdec		$rndkey0,$inout4
+	aesdec		$rndkey0,$inout5
+	$movkey		64($key_),$rndkey0
+	shl		\$4,$i1			# ntz(block) -> table offset
+	shl		\$4,$i3
+	jmp		.Locb_dec_loop6
+
+.align	32
+.Locb_dec_loop6:
+	aesdec		$rndkey1,$inout0
+	aesdec		$rndkey1,$inout1
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	aesdec		$rndkey1,$inout4
+	aesdec		$rndkey1,$inout5
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+
+	aesdec		$rndkey0,$inout0
+	aesdec		$rndkey0,$inout1
+	aesdec		$rndkey0,$inout2
+	aesdec		$rndkey0,$inout3
+	aesdec		$rndkey0,$inout4
+	aesdec		$rndkey0,$inout5
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.Locb_dec_loop6
+
+	aesdec		$rndkey1,$inout0
+	aesdec		$rndkey1,$inout1
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	aesdec		$rndkey1,$inout4
+	aesdec		$rndkey1,$inout5
+	$movkey		16($key_),$rndkey1
+	shl		\$4,$i5
+
+	aesdeclast	@offset[0],$inout0
+	movdqu		($L_p),@offset[0]	# L_0 for all odd-numbered blocks
+	mov		%r10,%rax		# restore twisted rounds
+	aesdeclast	@offset[1],$inout1
+	aesdeclast	@offset[2],$inout2
+	aesdeclast	@offset[3],$inout3
+	aesdeclast	@offset[4],$inout4
+	aesdeclast	@offset[5],$inout5
+	ret
+.size	__ocb_decrypt6,.-__ocb_decrypt6
+
+.type	__ocb_decrypt4,\@abi-omnipotent
+.align	32
+__ocb_decrypt4:
+	 pxor		$rndkey0l,@offset[5]	# offset_i ^ round[0]
+	 movdqu		($L_p,$i1),@offset[1]
+	 movdqa		@offset[0],@offset[2]
+	 movdqu		($L_p,$i3),@offset[3]
+	 pxor		@offset[5],@offset[0]
+	 pxor		@offset[0],@offset[1]
+	pxor		@offset[0],$inout0	# input ^ round[0] ^ offset_i
+	 pxor		@offset[1],@offset[2]
+	pxor		@offset[1],$inout1
+	 pxor		@offset[2],@offset[3]
+	pxor		@offset[2],$inout2
+	pxor		@offset[3],$inout3
+	$movkey		32($key_),$rndkey0
+
+	 pxor		$rndkey0l,@offset[0]	# offset_i ^ round[last]
+	 pxor		$rndkey0l,@offset[1]
+	 pxor		$rndkey0l,@offset[2]
+	 pxor		$rndkey0l,@offset[3]
+
+	aesdec		$rndkey1,$inout0
+	aesdec		$rndkey1,$inout1
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	$movkey		48($key_),$rndkey1
+
+	aesdec		$rndkey0,$inout0
+	aesdec		$rndkey0,$inout1
+	aesdec		$rndkey0,$inout2
+	aesdec		$rndkey0,$inout3
+	$movkey		64($key_),$rndkey0
+	jmp		.Locb_dec_loop4
+
+.align	32
+.Locb_dec_loop4:
+	aesdec		$rndkey1,$inout0
+	aesdec		$rndkey1,$inout1
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+
+	aesdec		$rndkey0,$inout0
+	aesdec		$rndkey0,$inout1
+	aesdec		$rndkey0,$inout2
+	aesdec		$rndkey0,$inout3
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.Locb_dec_loop4
+
+	aesdec		$rndkey1,$inout0
+	aesdec		$rndkey1,$inout1
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	$movkey		16($key_),$rndkey1
+	mov		%r10,%rax		# restore twisted rounds
+
+	aesdeclast	@offset[0],$inout0
+	aesdeclast	@offset[1],$inout1
+	aesdeclast	@offset[2],$inout2
+	aesdeclast	@offset[3],$inout3
+	ret
+.size	__ocb_decrypt4,.-__ocb_decrypt4
+
+.type	__ocb_decrypt1,\@abi-omnipotent
+.align	32
+__ocb_decrypt1:
+	 pxor		@offset[5],$inout5	# offset_i
+	 pxor		$rndkey0l,$inout5	# offset_i ^ round[0]
+	pxor		$inout5,$inout0		# input ^ round[0] ^ offset_i
+	$movkey		32($key_),$rndkey0
+
+	aesdec		$rndkey1,$inout0
+	$movkey		48($key_),$rndkey1
+	pxor		$rndkey0l,$inout5	# offset_i ^ round[last]
+
+	aesdec		$rndkey0,$inout0
+	$movkey		64($key_),$rndkey0
+	jmp		.Locb_dec_loop1
+
+.align	32
+.Locb_dec_loop1:
+	aesdec		$rndkey1,$inout0
+	$movkey		($key,%rax),$rndkey1
+	add		\$32,%rax
+
+	aesdec		$rndkey0,$inout0
+	$movkey		-16($key,%rax),$rndkey0
+	jnz		.Locb_dec_loop1
+
+	aesdec		$rndkey1,$inout0
+	$movkey		16($key_),$rndkey1	# redundant in tail
+	mov		%r10,%rax		# restore twisted rounds
+
+	aesdeclast	$inout5,$inout0
+	ret
+.size	__ocb_decrypt1,.-__ocb_decrypt1
+___
+} }}
+
+########################################################################
+# void $PREFIX_cbc_encrypt (const void *inp, void *out,
+#			    size_t length, const AES_KEY *key,
+#			    unsigned char *ivp,const int enc);
+if (0) {
+my $frame_size = 0x10 + ($win64?0xa0:0);	# used in decrypt
+my ($iv,$in0,$in1,$in2,$in3,$in4)=map("%xmm$_",(10..15));
+
+$code.=<<___;
+.globl	${PREFIX}_cbc_encrypt
+.type	${PREFIX}_cbc_encrypt,\@function,6
+.align	16
+${PREFIX}_cbc_encrypt:
+.cfi_startproc
+	test	$len,$len		# check length
+	jz	.Lcbc_ret
+
+	mov	240($key),$rnds_	# key->rounds
+	mov	$key,$key_		# backup $key
+	test	%r9d,%r9d		# 6th argument
+	jz	.Lcbc_decrypt
+#--------------------------- CBC ENCRYPT ------------------------------#
+	movups	($ivp),$inout0		# load iv as initial state
+	mov	$rnds_,$rounds
+	cmp	\$16,$len
+	jb	.Lcbc_enc_tail
+	sub	\$16,$len
+	jmp	.Lcbc_enc_loop
+.align	16
+.Lcbc_enc_loop:
+	movups	($inp),$inout1		# load input
+	lea	16($inp),$inp
+	#xorps	$inout1,$inout0
+___
+	&aesni_generate1("enc",$key,$rounds,$inout0,$inout1);
+$code.=<<___;
+	mov	$rnds_,$rounds		# restore $rounds
+	mov	$key_,$key		# restore $key
+	movups	$inout0,0($out)		# store output
+	lea	16($out),$out
+	sub	\$16,$len
+	jnc	.Lcbc_enc_loop
+	add	\$16,$len
+	jnz	.Lcbc_enc_tail
+	 pxor	$rndkey0,$rndkey0	# clear register bank
+	 pxor	$rndkey1,$rndkey1
+	movups	$inout0,($ivp)
+	 pxor	$inout0,$inout0
+	 pxor	$inout1,$inout1
+	jmp	.Lcbc_ret
+
+.Lcbc_enc_tail:
+	mov	$len,%rcx	# zaps $key
+	xchg	$inp,$out	# $inp is %rsi and $out is %rdi now
+	.long	0x9066A4F3	# rep movsb
+	mov	\$16,%ecx	# zero tail
+	sub	$len,%rcx
+	xor	%eax,%eax
+	.long	0x9066AAF3	# rep stosb
+	lea	-16(%rdi),%rdi	# rewind $out by 1 block
+	mov	$rnds_,$rounds	# restore $rounds
+	mov	%rdi,%rsi	# $inp and $out are the same
+	mov	$key_,$key	# restore $key
+	xor	$len,$len	# len=16
+	jmp	.Lcbc_enc_loop	# one more spin
+#--------------------------- CBC DECRYPT ------------------------------#
+.align	16
+.Lcbc_decrypt:
+	cmp	\$16,$len
+	jne	.Lcbc_decrypt_bulk
+
+	# handle single block without allocating stack frame,
+	# useful in ciphertext stealing mode
+	movdqu	($inp),$inout0		# load input
+	movdqu	($ivp),$inout1		# load iv
+	movdqa	$inout0,$inout2		# future iv
+___
+	&aesni_generate1("dec",$key,$rnds_);
+$code.=<<___;
+	 pxor	$rndkey0,$rndkey0	# clear register bank
+	 pxor	$rndkey1,$rndkey1
+	movdqu	$inout2,($ivp)		# store iv
+	xorps	$inout1,$inout0		# ^=iv
+	 pxor	$inout1,$inout1
+	movups	$inout0,($out)		# store output
+	 pxor	$inout0,$inout0
+	jmp	.Lcbc_ret
+.align	16
+.Lcbc_decrypt_bulk:
+	lea	(%rsp),%r11		# frame pointer
+.cfi_def_cfa_register	%r11
+	push	%rbp
+.cfi_push	%rbp
+	sub	\$$frame_size,%rsp
+	and	\$-16,%rsp	# Linux kernel stack can be incorrectly seeded
+___
+$code.=<<___ if ($win64);
+	movaps	%xmm6,0x10(%rsp)
+	movaps	%xmm7,0x20(%rsp)
+	movaps	%xmm8,0x30(%rsp)
+	movaps	%xmm9,0x40(%rsp)
+	movaps	%xmm10,0x50(%rsp)
+	movaps	%xmm11,0x60(%rsp)
+	movaps	%xmm12,0x70(%rsp)
+	movaps	%xmm13,0x80(%rsp)
+	movaps	%xmm14,0x90(%rsp)
+	movaps	%xmm15,0xa0(%rsp)
+.Lcbc_decrypt_body:
+___
+
+my $inp_=$key_="%rbp";			# reassign $key_
+
+$code.=<<___;
+	mov	$key,$key_		# [re-]backup $key [after reassignment]
+	movups	($ivp),$iv
+	mov	$rnds_,$rounds
+	cmp	\$0x50,$len
+	jbe	.Lcbc_dec_tail
+
+	$movkey	($key),$rndkey0
+	movdqu	0x00($inp),$inout0	# load input
+	movdqu	0x10($inp),$inout1
+	movdqa	$inout0,$in0
+	movdqu	0x20($inp),$inout2
+	movdqa	$inout1,$in1
+	movdqu	0x30($inp),$inout3
+	movdqa	$inout2,$in2
+	movdqu	0x40($inp),$inout4
+	movdqa	$inout3,$in3
+	movdqu	0x50($inp),$inout5
+	movdqa	$inout4,$in4
+	leaq	OPENSSL_ia32cap_P(%rip),%r9
+	mov	4(%r9),%r9d
+	cmp	\$0x70,$len
+	jbe	.Lcbc_dec_six_or_seven
+
+	and	\$`1<<26|1<<22`,%r9d	# isolate XSAVE+MOVBE
+	sub	\$0x50,$len		# $len is biased by -5*16
+	cmp	\$`1<<22`,%r9d		# check for MOVBE without XSAVE
+	je	.Lcbc_dec_loop6_enter	# [which denotes Atom Silvermont]
+	sub	\$0x20,$len		# $len is biased by -7*16
+	lea	0x70($key),$key		# size optimization
+	jmp	.Lcbc_dec_loop8_enter
+.align	16
+.Lcbc_dec_loop8:
+	movups	$inout7,($out)
+	lea	0x10($out),$out
+.Lcbc_dec_loop8_enter:
+	movdqu		0x60($inp),$inout6
+	pxor		$rndkey0,$inout0
+	movdqu		0x70($inp),$inout7
+	pxor		$rndkey0,$inout1
+	$movkey		0x10-0x70($key),$rndkey1
+	pxor		$rndkey0,$inout2
+	mov		\$-1,$inp_
+	cmp		\$0x70,$len	# is there at least 0x60 bytes ahead?
+	pxor		$rndkey0,$inout3
+	pxor		$rndkey0,$inout4
+	pxor		$rndkey0,$inout5
+	pxor		$rndkey0,$inout6
+
+	aesdec		$rndkey1,$inout0
+	pxor		$rndkey0,$inout7
+	$movkey		0x20-0x70($key),$rndkey0
+	aesdec		$rndkey1,$inout1
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	aesdec		$rndkey1,$inout4
+	aesdec		$rndkey1,$inout5
+	aesdec		$rndkey1,$inout6
+	adc		\$0,$inp_
+	and		\$128,$inp_
+	aesdec		$rndkey1,$inout7
+	add		$inp,$inp_
+	$movkey		0x30-0x70($key),$rndkey1
+___
+for($i=1;$i<12;$i++) {
+my $rndkeyx = ($i&1)?$rndkey0:$rndkey1;
+$code.=<<___	if ($i==7);
+	cmp		\$11,$rounds
+___
+$code.=<<___;
+	aesdec		$rndkeyx,$inout0
+	aesdec		$rndkeyx,$inout1
+	aesdec		$rndkeyx,$inout2
+	aesdec		$rndkeyx,$inout3
+	aesdec		$rndkeyx,$inout4
+	aesdec		$rndkeyx,$inout5
+	aesdec		$rndkeyx,$inout6
+	aesdec		$rndkeyx,$inout7
+	$movkey		`0x30+0x10*$i`-0x70($key),$rndkeyx
+___
+$code.=<<___	if ($i<6 || (!($i&1) && $i>7));
+	nop
+___
+$code.=<<___	if ($i==7);
+	jb		.Lcbc_dec_done
+___
+$code.=<<___	if ($i==9);
+	je		.Lcbc_dec_done
+___
+$code.=<<___	if ($i==11);
+	jmp		.Lcbc_dec_done
+___
+}
+$code.=<<___;
+.align	16
+.Lcbc_dec_done:
+	aesdec		$rndkey1,$inout0
+	aesdec		$rndkey1,$inout1
+	pxor		$rndkey0,$iv
+	pxor		$rndkey0,$in0
+	aesdec		$rndkey1,$inout2
+	aesdec		$rndkey1,$inout3
+	pxor		$rndkey0,$in1
+	pxor		$rndkey0,$in2
+	aesdec		$rndkey1,$inout4
+	aesdec		$rndkey1,$inout5
+	pxor		$rndkey0,$in3
+	pxor		$rndkey0,$in4
+	aesdec		$rndkey1,$inout6
+	aesdec		$rndkey1,$inout7
+	movdqu		0x50($inp),$rndkey1
+
+	aesdeclast	$iv,$inout0
+	movdqu		0x60($inp),$iv		# borrow $iv
+	pxor		$rndkey0,$rndkey1
+	aesdeclast	$in0,$inout1
+	pxor		$rndkey0,$iv
+	movdqu		0x70($inp),$rndkey0	# next IV
+	aesdeclast	$in1,$inout2
+	lea		0x80($inp),$inp
+	movdqu		0x00($inp_),$in0
+	aesdeclast	$in2,$inout3
+	aesdeclast	$in3,$inout4
+	movdqu		0x10($inp_),$in1
+	movdqu		0x20($inp_),$in2
+	aesdeclast	$in4,$inout5
+	aesdeclast	$rndkey1,$inout6
+	movdqu		0x30($inp_),$in3
+	movdqu		0x40($inp_),$in4
+	aesdeclast	$iv,$inout7
+	movdqa		$rndkey0,$iv		# return $iv
+	movdqu		0x50($inp_),$rndkey1
+	$movkey		-0x70($key),$rndkey0
+
+	movups		$inout0,($out)		# store output
+	movdqa		$in0,$inout0
+	movups		$inout1,0x10($out)
+	movdqa		$in1,$inout1
+	movups		$inout2,0x20($out)
+	movdqa		$in2,$inout2
+	movups		$inout3,0x30($out)
+	movdqa		$in3,$inout3
+	movups		$inout4,0x40($out)
+	movdqa		$in4,$inout4
+	movups		$inout5,0x50($out)
+	movdqa		$rndkey1,$inout5
+	movups		$inout6,0x60($out)
+	lea		0x70($out),$out
+
+	sub	\$0x80,$len
+	ja	.Lcbc_dec_loop8
+
+	movaps	$inout7,$inout0
+	lea	-0x70($key),$key
+	add	\$0x70,$len
+	jle	.Lcbc_dec_clear_tail_collected
+	movups	$inout7,($out)
+	lea	0x10($out),$out
+	cmp	\$0x50,$len
+	jbe	.Lcbc_dec_tail
+
+	movaps	$in0,$inout0
+.Lcbc_dec_six_or_seven:
+	cmp	\$0x60,$len
+	ja	.Lcbc_dec_seven
+
+	movaps	$inout5,$inout6
+	call	_aesni_decrypt6
+	pxor	$iv,$inout0		# ^= IV
+	movaps	$inout6,$iv
+	pxor	$in0,$inout1
+	movdqu	$inout0,($out)
+	pxor	$in1,$inout2
+	movdqu	$inout1,0x10($out)
+	 pxor	$inout1,$inout1		# clear register bank
+	pxor	$in2,$inout3
+	movdqu	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	pxor	$in3,$inout4
+	movdqu	$inout3,0x30($out)
+	 pxor	$inout3,$inout3
+	pxor	$in4,$inout5
+	movdqu	$inout4,0x40($out)
+	 pxor	$inout4,$inout4
+	lea	0x50($out),$out
+	movdqa	$inout5,$inout0
+	 pxor	$inout5,$inout5
+	jmp	.Lcbc_dec_tail_collected
+
+.align	16
+.Lcbc_dec_seven:
+	movups	0x60($inp),$inout6
+	xorps	$inout7,$inout7
+	call	_aesni_decrypt8
+	movups	0x50($inp),$inout7
+	pxor	$iv,$inout0		# ^= IV
+	movups	0x60($inp),$iv
+	pxor	$in0,$inout1
+	movdqu	$inout0,($out)
+	pxor	$in1,$inout2
+	movdqu	$inout1,0x10($out)
+	 pxor	$inout1,$inout1		# clear register bank
+	pxor	$in2,$inout3
+	movdqu	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	pxor	$in3,$inout4
+	movdqu	$inout3,0x30($out)
+	 pxor	$inout3,$inout3
+	pxor	$in4,$inout5
+	movdqu	$inout4,0x40($out)
+	 pxor	$inout4,$inout4
+	pxor	$inout7,$inout6
+	movdqu	$inout5,0x50($out)
+	 pxor	$inout5,$inout5
+	lea	0x60($out),$out
+	movdqa	$inout6,$inout0
+	 pxor	$inout6,$inout6
+	 pxor	$inout7,$inout7
+	jmp	.Lcbc_dec_tail_collected
+
+.align	16
+.Lcbc_dec_loop6:
+	movups	$inout5,($out)
+	lea	0x10($out),$out
+	movdqu	0x00($inp),$inout0	# load input
+	movdqu	0x10($inp),$inout1
+	movdqa	$inout0,$in0
+	movdqu	0x20($inp),$inout2
+	movdqa	$inout1,$in1
+	movdqu	0x30($inp),$inout3
+	movdqa	$inout2,$in2
+	movdqu	0x40($inp),$inout4
+	movdqa	$inout3,$in3
+	movdqu	0x50($inp),$inout5
+	movdqa	$inout4,$in4
+.Lcbc_dec_loop6_enter:
+	lea	0x60($inp),$inp
+	movdqa	$inout5,$inout6
+
+	call	_aesni_decrypt6
+
+	pxor	$iv,$inout0		# ^= IV
+	movdqa	$inout6,$iv
+	pxor	$in0,$inout1
+	movdqu	$inout0,($out)
+	pxor	$in1,$inout2
+	movdqu	$inout1,0x10($out)
+	pxor	$in2,$inout3
+	movdqu	$inout2,0x20($out)
+	pxor	$in3,$inout4
+	mov	$key_,$key
+	movdqu	$inout3,0x30($out)
+	pxor	$in4,$inout5
+	mov	$rnds_,$rounds
+	movdqu	$inout4,0x40($out)
+	lea	0x50($out),$out
+	sub	\$0x60,$len
+	ja	.Lcbc_dec_loop6
+
+	movdqa	$inout5,$inout0
+	add	\$0x50,$len
+	jle	.Lcbc_dec_clear_tail_collected
+	movups	$inout5,($out)
+	lea	0x10($out),$out
+
+.Lcbc_dec_tail:
+	movups	($inp),$inout0
+	sub	\$0x10,$len
+	jbe	.Lcbc_dec_one		# $len is 1*16 or less
+
+	movups	0x10($inp),$inout1
+	movaps	$inout0,$in0
+	sub	\$0x10,$len
+	jbe	.Lcbc_dec_two		# $len is 2*16 or less
+
+	movups	0x20($inp),$inout2
+	movaps	$inout1,$in1
+	sub	\$0x10,$len
+	jbe	.Lcbc_dec_three		# $len is 3*16 or less
+
+	movups	0x30($inp),$inout3
+	movaps	$inout2,$in2
+	sub	\$0x10,$len
+	jbe	.Lcbc_dec_four		# $len is 4*16 or less
+
+	movups	0x40($inp),$inout4	# $len is 5*16 or less
+	movaps	$inout3,$in3
+	movaps	$inout4,$in4
+	xorps	$inout5,$inout5
+	call	_aesni_decrypt6
+	pxor	$iv,$inout0
+	movaps	$in4,$iv
+	pxor	$in0,$inout1
+	movdqu	$inout0,($out)
+	pxor	$in1,$inout2
+	movdqu	$inout1,0x10($out)
+	 pxor	$inout1,$inout1		# clear register bank
+	pxor	$in2,$inout3
+	movdqu	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	pxor	$in3,$inout4
+	movdqu	$inout3,0x30($out)
+	 pxor	$inout3,$inout3
+	lea	0x40($out),$out
+	movdqa	$inout4,$inout0
+	 pxor	$inout4,$inout4
+	 pxor	$inout5,$inout5
+	sub	\$0x10,$len
+	jmp	.Lcbc_dec_tail_collected
+
+.align	16
+.Lcbc_dec_one:
+	movaps	$inout0,$in0
+___
+	&aesni_generate1("dec",$key,$rounds);
+$code.=<<___;
+	xorps	$iv,$inout0
+	movaps	$in0,$iv
+	jmp	.Lcbc_dec_tail_collected
+.align	16
+.Lcbc_dec_two:
+	movaps	$inout1,$in1
+	call	_aesni_decrypt2
+	pxor	$iv,$inout0
+	movaps	$in1,$iv
+	pxor	$in0,$inout1
+	movdqu	$inout0,($out)
+	movdqa	$inout1,$inout0
+	 pxor	$inout1,$inout1		# clear register bank
+	lea	0x10($out),$out
+	jmp	.Lcbc_dec_tail_collected
+.align	16
+.Lcbc_dec_three:
+	movaps	$inout2,$in2
+	call	_aesni_decrypt3
+	pxor	$iv,$inout0
+	movaps	$in2,$iv
+	pxor	$in0,$inout1
+	movdqu	$inout0,($out)
+	pxor	$in1,$inout2
+	movdqu	$inout1,0x10($out)
+	 pxor	$inout1,$inout1		# clear register bank
+	movdqa	$inout2,$inout0
+	 pxor	$inout2,$inout2
+	lea	0x20($out),$out
+	jmp	.Lcbc_dec_tail_collected
+.align	16
+.Lcbc_dec_four:
+	movaps	$inout3,$in3
+	call	_aesni_decrypt4
+	pxor	$iv,$inout0
+	movaps	$in3,$iv
+	pxor	$in0,$inout1
+	movdqu	$inout0,($out)
+	pxor	$in1,$inout2
+	movdqu	$inout1,0x10($out)
+	 pxor	$inout1,$inout1		# clear register bank
+	pxor	$in2,$inout3
+	movdqu	$inout2,0x20($out)
+	 pxor	$inout2,$inout2
+	movdqa	$inout3,$inout0
+	 pxor	$inout3,$inout3
+	lea	0x30($out),$out
+	jmp	.Lcbc_dec_tail_collected
+
+.align	16
+.Lcbc_dec_clear_tail_collected:
+	pxor	$inout1,$inout1		# clear register bank
+	pxor	$inout2,$inout2
+	pxor	$inout3,$inout3
+___
+$code.=<<___ if (!$win64);
+	pxor	$inout4,$inout4		# %xmm6..9
+	pxor	$inout5,$inout5
+	pxor	$inout6,$inout6
+	pxor	$inout7,$inout7
+___
+$code.=<<___;
+.Lcbc_dec_tail_collected:
+	movups	$iv,($ivp)
+	and	\$15,$len
+	jnz	.Lcbc_dec_tail_partial
+	movups	$inout0,($out)
+	pxor	$inout0,$inout0
+	jmp	.Lcbc_dec_ret
+.align	16
+.Lcbc_dec_tail_partial:
+	movaps	$inout0,(%rsp)
+	pxor	$inout0,$inout0
+	mov	\$16,%rcx
+	mov	$out,%rdi
+	sub	$len,%rcx
+	lea	(%rsp),%rsi
+	.long	0x9066A4F3		# rep movsb
+	movdqa	$inout0,(%rsp)
+
+.Lcbc_dec_ret:
+	xorps	$rndkey0,$rndkey0	# %xmm0
+	pxor	$rndkey1,$rndkey1
+___
+$code.=<<___ if ($win64);
+	movaps	0x10(%rsp),%xmm6
+	movaps	%xmm0,0x10(%rsp)	# clear stack
+	movaps	0x20(%rsp),%xmm7
+	movaps	%xmm0,0x20(%rsp)
+	movaps	0x30(%rsp),%xmm8
+	movaps	%xmm0,0x30(%rsp)
+	movaps	0x40(%rsp),%xmm9
+	movaps	%xmm0,0x40(%rsp)
+	movaps	0x50(%rsp),%xmm10
+	movaps	%xmm0,0x50(%rsp)
+	movaps	0x60(%rsp),%xmm11
+	movaps	%xmm0,0x60(%rsp)
+	movaps	0x70(%rsp),%xmm12
+	movaps	%xmm0,0x70(%rsp)
+	movaps	0x80(%rsp),%xmm13
+	movaps	%xmm0,0x80(%rsp)
+	movaps	0x90(%rsp),%xmm14
+	movaps	%xmm0,0x90(%rsp)
+	movaps	0xa0(%rsp),%xmm15
+	movaps	%xmm0,0xa0(%rsp)
+___
+$code.=<<___;
+	mov	-8(%r11),%rbp
+.cfi_restore	%rbp
+	lea	(%r11),%rsp
+.cfi_def_cfa_register	%rsp
+.Lcbc_ret:
+	ret
+.cfi_endproc
+.size	${PREFIX}_cbc_encrypt,.-${PREFIX}_cbc_encrypt
+___
+} 
+# int ${PREFIX}_set_decrypt_key(const unsigned char *inp,
+#				int bits, AES_KEY *key)
+#
+# input:	$inp	user-supplied key
+#		$bits	$inp length in bits
+#		$key	pointer to key schedule
+# output:	%eax	0 denoting success, -1 or -2 - failure (see C)
+#		*$key	key schedule
+#
+{ my ($inp,$bits,$key) = @_4args;
+  $bits =~ s/%r/%e/;
+
+$code.=<<___;
+.globl	${PREFIX}_set_decrypt_key
+.type	${PREFIX}_set_decrypt_key,\@abi-omnipotent
+.align	16
+${PREFIX}_set_decrypt_key:
+.cfi_startproc
+	.byte	0x48,0x83,0xEC,0x08	# sub rsp,8
+.cfi_adjust_cfa_offset	8
+	call	__aesni_set_encrypt_key
+	shl	\$4,$bits		# rounds-1 after _aesni_set_encrypt_key
+	test	%eax,%eax
+	jnz	.Ldec_key_ret
+	lea	16($key,$bits),$inp	# points at the end of key schedule
+
+	$movkey	($key),%xmm0		# just swap
+	$movkey	($inp),%xmm1
+	$movkey	%xmm0,($inp)
+	$movkey	%xmm1,($key)
+	lea	16($key),$key
+	lea	-16($inp),$inp
+
+.Ldec_key_inverse:
+	$movkey	($key),%xmm0		# swap and inverse
+	$movkey	($inp),%xmm1
+	aesimc	%xmm0,%xmm0
+	aesimc	%xmm1,%xmm1
+	lea	16($key),$key
+	lea	-16($inp),$inp
+	$movkey	%xmm0,16($inp)
+	$movkey	%xmm1,-16($key)
+	cmp	$key,$inp
+	ja	.Ldec_key_inverse
+
+	$movkey	($key),%xmm0		# inverse middle
+	aesimc	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	$movkey	%xmm0,($inp)
+	pxor	%xmm0,%xmm0
+.Ldec_key_ret:
+	add	\$8,%rsp
+.cfi_adjust_cfa_offset	-8
+	ret
+.cfi_endproc
+.LSEH_end_set_decrypt_key:
+.size	${PREFIX}_set_decrypt_key,.-${PREFIX}_set_decrypt_key
+___
+
+# This is based on submission from Intel by
+#	Huang Ying
+#	Vinodh Gopal
+#	Kahraman Akdemir
+#
+# Aggressively optimized in respect to aeskeygenassist's critical path
+# and is contained in %xmm0-5 to meet Win64 ABI requirement.
+#
+# int ${PREFIX}_set_encrypt_key(const unsigned char *inp,
+#				int bits, AES_KEY * const key);
+#
+# input:	$inp	user-supplied key
+#		$bits	$inp length in bits
+#		$key	pointer to key schedule
+# output:	%eax	0 denoting success, -1 or -2 - failure (see C)
+#		$bits	rounds-1 (used in aesni_set_decrypt_key)
+#		*$key	key schedule
+#		$key	pointer to key schedule (used in
+#			aesni_set_decrypt_key)
+#
+# Subroutine is frame-less, which means that only volatile registers
+# are used. Note that it's declared "abi-omnipotent", which means that
+# amount of volatile registers is smaller on Windows.
+#
+$code.=<<___;
+.globl	${PREFIX}_set_encrypt_key
+.type	${PREFIX}_set_encrypt_key,\@abi-omnipotent
+.align	16
+${PREFIX}_set_encrypt_key:
+__aesni_set_encrypt_key:
+.cfi_startproc
+	.byte	0x48,0x83,0xEC,0x08	# sub rsp,8
+.cfi_adjust_cfa_offset	8
+	mov	\$-1,%rax
+	test	$inp,$inp
+	jz	.Lenc_key_ret
+	test	$key,$key
+	jz	.Lenc_key_ret
+
+	movups	($inp),%xmm0		# pull first 128 bits of *userKey
+	xorps	%xmm4,%xmm4		# low dword of xmm4 is assumed 0
+#	leaq	OPENSSL_ia32cap_P(%rip),%r10
+#	movl	4(%r10),%r10d
+#	and	\$`1<<28|1<<11`,%r10d	# AVX and XOP bits
+	lea	16($key),%rax		# %rax is used as modifiable copy of $key
+	cmp	\$256,$bits
+	je	.L14rounds
+	cmp	\$192,$bits
+	je	.L12rounds
+	cmp	\$128,$bits
+	jne	.Lbad_keybits
+
+.L10rounds:
+	mov	\$9,$bits			# 10 rounds for 128-bit key
+#	cmp	\$`1<<28`,%r10d			# AVX, bit no XOP
+#	je	.L10rounds_alt
+#  jmp	.L10rounds_alt
+#	$movkey	%xmm0,($key)			# round 0
+#	aeskeygenassist	\$0x1,%xmm0,%xmm1	# round 1
+#	call		.Lkey_expansion_128_cold
+#	aeskeygenassist	\$0x2,%xmm0,%xmm1	# round 2
+#	call		.Lkey_expansion_128
+#	aeskeygenassist	\$0x4,%xmm0,%xmm1	# round 3
+#	call		.Lkey_expansion_128
+#	aeskeygenassist	\$0x8,%xmm0,%xmm1	# round 4
+#	call		.Lkey_expansion_128
+#	aeskeygenassist	\$0x10,%xmm0,%xmm1	# round 5
+#	call		.Lkey_expansion_128
+#	aeskeygenassist	\$0x20,%xmm0,%xmm1	# round 6
+#	call		.Lkey_expansion_128
+#	aeskeygenassist	\$0x40,%xmm0,%xmm1	# round 7
+#	call		.Lkey_expansion_128
+#	aeskeygenassist	\$0x80,%xmm0,%xmm1	# round 8
+#	call		.Lkey_expansion_128
+#	aeskeygenassist	\$0x1b,%xmm0,%xmm1	# round 9
+#	call		.Lkey_expansion_128
+#	aeskeygenassist	\$0x36,%xmm0,%xmm1	# round 10
+#	call		.Lkey_expansion_128
+#	$movkey	%xmm0,(%rax)
+#	mov	$bits,80(%rax)	# 240(%rdx)
+#	xor	%eax,%eax
+#	jmp	.Lenc_key_ret
+
+#.align	16
+#.L10rounds_alt:
+	movdqa	.Lkey_rotate(%rip),%xmm5
+	mov	\$8,%r10d
+	movdqa	.Lkey_rcon1(%rip),%xmm4
+	movdqa	%xmm0,%xmm2
+	movdqu	%xmm0,($key)
+	jmp	.Loop_key128
+
+.align	16
+.Loop_key128:
+	pshufb		%xmm5,%xmm0
+	aesenclast	%xmm4,%xmm0
+	pslld		\$1,%xmm4
+	lea		16(%rax),%rax
+
+	movdqa		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm3,%xmm2
+
+	pxor		%xmm2,%xmm0
+	movdqu		%xmm0,-16(%rax)
+	movdqa		%xmm0,%xmm2
+
+	dec	%r10d
+	jnz	.Loop_key128
+
+	movdqa		.Lkey_rcon1b(%rip),%xmm4
+
+	pshufb		%xmm5,%xmm0
+	aesenclast	%xmm4,%xmm0
+	pslld		\$1,%xmm4
+
+	movdqa		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm3,%xmm2
+
+	pxor		%xmm2,%xmm0
+	movdqu		%xmm0,(%rax)
+
+	movdqa		%xmm0,%xmm2
+	pshufb		%xmm5,%xmm0
+	aesenclast	%xmm4,%xmm0
+
+	movdqa		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm2,%xmm3
+	pslldq		\$4,%xmm2
+	pxor		%xmm3,%xmm2
+
+	pxor		%xmm2,%xmm0
+	movdqu		%xmm0,16(%rax)
+
+	mov	$bits,96(%rax)	# 240($key)
+	xor	%eax,%eax
+	jmp	.Lenc_key_ret
+
+.align	16
+.L12rounds:
+	movq	16($inp),%xmm2			# remaining 1/3 of *userKey
+	mov	\$11,$bits			# 12 rounds for 192
+#	cmp	\$`1<<28`,%r10d			# AVX, but no XOP
+#	je	.L12rounds_alt
+
+#	$movkey	%xmm0,($key)			# round 0
+#	aeskeygenassist	\$0x1,%xmm2,%xmm1	# round 1,2
+#	call		.Lkey_expansion_192a_cold
+#	aeskeygenassist	\$0x2,%xmm2,%xmm1	# round 2,3
+#	call		.Lkey_expansion_192b
+#	aeskeygenassist	\$0x4,%xmm2,%xmm1	# round 4,5
+#	call		.Lkey_expansion_192a
+#	aeskeygenassist	\$0x8,%xmm2,%xmm1	# round 5,6
+#	call		.Lkey_expansion_192b
+#	aeskeygenassist	\$0x10,%xmm2,%xmm1	# round 7,8
+#	call		.Lkey_expansion_192a
+#	aeskeygenassist	\$0x20,%xmm2,%xmm1	# round 8,9
+#	call		.Lkey_expansion_192b
+#	aeskeygenassist	\$0x40,%xmm2,%xmm1	# round 10,11
+#	call		.Lkey_expansion_192a
+#	aeskeygenassist	\$0x80,%xmm2,%xmm1	# round 11,12
+#	call		.Lkey_expansion_192b
+#	$movkey	%xmm0,(%rax)
+#	mov	$bits,48(%rax)	# 240(%rdx)
+#	xor	%rax, %rax
+#	jmp	.Lenc_key_ret
+
+#.align	16
+#.L12rounds_alt:
+	movdqa	.Lkey_rotate192(%rip),%xmm5
+	movdqa	.Lkey_rcon1(%rip),%xmm4
+	mov	\$8,%r10d
+	movdqu	%xmm0,($key)
+	jmp	.Loop_key192
+
+.align	16
+.Loop_key192:
+	movq		%xmm2,0(%rax)
+	movdqa		%xmm2,%xmm1
+	pshufb		%xmm5,%xmm2
+	aesenclast	%xmm4,%xmm2
+	pslld		\$1, %xmm4
+	lea		24(%rax),%rax
+
+	movdqa		%xmm0,%xmm3
+	pslldq		\$4,%xmm0
+	pxor		%xmm0,%xmm3
+	pslldq		\$4,%xmm0
+	pxor		%xmm0,%xmm3
+	pslldq		\$4,%xmm0
+	pxor		%xmm3,%xmm0
+
+	pshufd		\$0xff,%xmm0,%xmm3
+	pxor		%xmm1,%xmm3
+	pslldq		\$4,%xmm1
+	pxor		%xmm1,%xmm3
+
+	pxor		%xmm2,%xmm0
+	pxor		%xmm3,%xmm2
+	movdqu		%xmm0,-16(%rax)
+
+	dec	%r10d
+	jnz	.Loop_key192
+
+	mov	$bits,32(%rax)	# 240($key)
+	xor	%eax,%eax
+	jmp	.Lenc_key_ret
+
+.align	16
+.L14rounds:
+	movups	16($inp),%xmm2			# remaining half of *userKey
+	mov	\$13,$bits			# 14 rounds for 256
+	lea	16(%rax),%rax
+#	cmp	\$`1<<28`,%r10d			# AVX, but no XOP
+#	je	.L14rounds_alt
+#
+#	$movkey	%xmm0,($key)			# round 0
+#	$movkey	%xmm2,16($key)			# round 1
+#	aeskeygenassist	\$0x1,%xmm2,%xmm1	# round 2
+#	call		.Lkey_expansion_256a_cold
+#	aeskeygenassist	\$0x1,%xmm0,%xmm1	# round 3
+#	call		.Lkey_expansion_256b
+#	aeskeygenassist	\$0x2,%xmm2,%xmm1	# round 4
+#	call		.Lkey_expansion_256a
+#	aeskeygenassist	\$0x2,%xmm0,%xmm1	# round 5
+#	call		.Lkey_expansion_256b
+#	aeskeygenassist	\$0x4,%xmm2,%xmm1	# round 6
+#	call		.Lkey_expansion_256a
+#	aeskeygenassist	\$0x4,%xmm0,%xmm1	# round 7
+#	call		.Lkey_expansion_256b
+#	aeskeygenassist	\$0x8,%xmm2,%xmm1	# round 8
+#	call		.Lkey_expansion_256a
+#	aeskeygenassist	\$0x8,%xmm0,%xmm1	# round 9
+#	call		.Lkey_expansion_256b
+#	aeskeygenassist	\$0x10,%xmm2,%xmm1	# round 10
+#	call		.Lkey_expansion_256a
+#	aeskeygenassist	\$0x10,%xmm0,%xmm1	# round 11
+#	call		.Lkey_expansion_256b
+#	aeskeygenassist	\$0x20,%xmm2,%xmm1	# round 12
+#	call		.Lkey_expansion_256a
+#	aeskeygenassist	\$0x20,%xmm0,%xmm1	# round 13
+#	call		.Lkey_expansion_256b
+#	aeskeygenassist	\$0x40,%xmm2,%xmm1	# round 14
+#	call		.Lkey_expansion_256a
+#	$movkey	%xmm0,(%rax)
+#	mov	$bits,16(%rax)	# 240(%rdx)
+#	xor	%rax,%rax
+#	jmp	.Lenc_key_ret
+
+#.align	16
+#.L14rounds_alt:
+	movdqa	.Lkey_rotate(%rip),%xmm5
+	movdqa	.Lkey_rcon1(%rip),%xmm4
+	mov	\$7,%r10d
+	movdqu	%xmm0,0($key)
+	movdqa	%xmm2,%xmm1
+	movdqu	%xmm2,16($key)
+	jmp	.Loop_key256
+
+.align	16
+.Loop_key256:
+	pshufb		%xmm5,%xmm2
+	aesenclast	%xmm4,%xmm2
+
+	movdqa		%xmm0,%xmm3
+	pslldq		\$4,%xmm0
+	pxor		%xmm0,%xmm3
+	pslldq		\$4,%xmm0
+	pxor		%xmm0,%xmm3
+	pslldq		\$4,%xmm0
+	pxor		%xmm3,%xmm0
+	pslld		\$1,%xmm4
+
+	pxor		%xmm2,%xmm0
+	movdqu		%xmm0,(%rax)
+
+	dec	%r10d
+	jz	.Ldone_key256
+
+	pshufd		\$0xff,%xmm0,%xmm2
+	pxor		%xmm3,%xmm3
+	aesenclast	%xmm3,%xmm2
+
+	movdqa		%xmm1,%xmm3
+	pslldq		\$4,%xmm1
+	pxor		%xmm1,%xmm3
+	pslldq		\$4,%xmm1
+	pxor		%xmm1,%xmm3
+	pslldq		\$4,%xmm1
+	pxor		%xmm3,%xmm1
+
+	pxor		%xmm1,%xmm2
+	movdqu		%xmm2,16(%rax)
+	lea		32(%rax),%rax
+	movdqa		%xmm2,%xmm1
+
+	jmp	.Loop_key256
+
+.Ldone_key256:
+	mov	$bits,16(%rax)	# 240($key)
+	xor	%eax,%eax
+	jmp	.Lenc_key_ret
+
+.align	16
+.Lbad_keybits:
+	mov	\$-2,%rax
+.Lenc_key_ret:
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+	add	\$8,%rsp
+.cfi_adjust_cfa_offset	-8
+	ret
+.cfi_endproc
+.LSEH_end_set_encrypt_key:
+
+#.align	16
+#.Lkey_expansion_128:
+#	$movkey	%xmm0,(%rax)
+#	lea	16(%rax),%rax
+#.Lkey_expansion_128_cold:
+#	shufps	\$0b00010000,%xmm0,%xmm4
+#	xorps	%xmm4, %xmm0
+#	shufps	\$0b10001100,%xmm0,%xmm4
+#	xorps	%xmm4, %xmm0
+#	shufps	\$0b11111111,%xmm1,%xmm1	# critical path
+#	xorps	%xmm1,%xmm0
+#	ret
+
+#.align 16
+#.Lkey_expansion_192a:
+#	$movkey	%xmm0,(%rax)
+#	lea	16(%rax),%rax
+#.Lkey_expansion_192a_cold:
+#	movaps	%xmm2, %xmm5
+#.Lkey_expansion_192b_warm:
+#	shufps	\$0b00010000,%xmm0,%xmm4
+#	movdqa	%xmm2,%xmm3
+#	xorps	%xmm4,%xmm0
+#	shufps	\$0b10001100,%xmm0,%xmm4
+#	pslldq	\$4,%xmm3
+#	xorps	%xmm4,%xmm0
+#	pshufd	\$0b01010101,%xmm1,%xmm1	# critical path
+#	pxor	%xmm3,%xmm2
+#	pxor	%xmm1,%xmm0
+#	pshufd	\$0b11111111,%xmm0,%xmm3
+#	pxor	%xmm3,%xmm2
+#	ret
+#
+#.align 16
+#.Lkey_expansion_192b:
+#	movaps	%xmm0,%xmm3
+#	shufps	\$0b01000100,%xmm0,%xmm5
+#	$movkey	%xmm5,(%rax)
+#	shufps	\$0b01001110,%xmm2,%xmm3
+#	$movkey	%xmm3,16(%rax)
+#	lea	32(%rax),%rax
+#	jmp	.Lkey_expansion_192b_warm
+#
+#.align	16
+#.Lkey_expansion_256a:
+#	$movkey	%xmm2,(%rax)
+#	lea	16(%rax),%rax
+#.Lkey_expansion_256a_cold:
+#	shufps	\$0b00010000,%xmm0,%xmm4
+#	xorps	%xmm4,%xmm0
+#	shufps	\$0b10001100,%xmm0,%xmm4
+#	xorps	%xmm4,%xmm0
+#	shufps	\$0b11111111,%xmm1,%xmm1	# critical path
+#	xorps	%xmm1,%xmm0
+#	ret
+#
+#.align 16
+#.Lkey_expansion_256b:
+#	$movkey	%xmm0,(%rax)
+#	lea	16(%rax),%rax
+#
+#	shufps	\$0b00010000,%xmm2,%xmm4
+#	xorps	%xmm4,%xmm2
+#	shufps	\$0b10001100,%xmm2,%xmm4
+#	xorps	%xmm4,%xmm2
+#	shufps	\$0b10101010,%xmm1,%xmm1	# critical path
+#	xorps	%xmm1,%xmm2
+#	ret
+.size	${PREFIX}_set_encrypt_key,.-${PREFIX}_set_encrypt_key
+.size	__aesni_set_encrypt_key,.-__aesni_set_encrypt_key
+___
+}
+
+$code.=<<___;
+.align	64
+.Lbswap_mask:
+	.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+.Lincrement32:
+	.long	6,6,6,0
+.Lincrement64:
+	.long	1,0,0,0
+.Lxts_magic:
+	.long	0x87,0,1,0
+.Lincrement1:
+	.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+.Lkey_rotate:
+	.long	0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d
+.Lkey_rotate192:
+	.long	0x04070605,0x04070605,0x04070605,0x04070605
+.Lkey_rcon1:
+	.long	1,1,1,1
+.Lkey_rcon1b:
+	.long	0x1b,0x1b,0x1b,0x1b
+
+.align	64
+___
+
+# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
+#		CONTEXT *context,DISPATCHER_CONTEXT *disp)
+if ($win64) {
+$rec="%rcx";
+$frame="%rdx";
+$context="%r8";
+$disp="%r9";
+
+$code.=<<___;
+.extern	__imp_RtlVirtualUnwind
+___
+$code.=<<___ if ($PREFIX eq "aesni" && 0);
+.type	ecb_ccm64_se_handler,\@abi-omnipotent
+.align	16
+ecb_ccm64_se_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue label
+	cmp	%r10,%rbx		# context->Rip<prologue label
+	jb	.Lcommon_seh_tail
+
+	mov	152($context),%rax	# pull context->Rsp
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=epilogue label
+	jae	.Lcommon_seh_tail
+
+	lea	0(%rax),%rsi		# %xmm save area
+	lea	512($context),%rdi	# &context.Xmm6
+	mov	\$8,%ecx		# 4*sizeof(%xmm0)/sizeof(%rax)
+	.long	0xa548f3fc		# cld; rep movsq
+	lea	0x58(%rax),%rax		# adjust stack pointer
+
+	jmp	.Lcommon_seh_tail
+.size	ecb_ccm64_se_handler,.-ecb_ccm64_se_handler
+___
+$code.=<<___ if ($PREFIX eq "aesni");
+.type	ctr_xts_se_handler,\@abi-omnipotent
+.align	16
+ctr_xts_se_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue lable
+	cmp	%r10,%rbx		# context->Rip<prologue label
+	jb	.Lcommon_seh_tail
+
+	mov	152($context),%rax	# pull context->Rsp
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=epilogue label
+	jae	.Lcommon_seh_tail
+
+	mov	208($context),%rax	# pull context->R11
+
+	lea	-0xa8(%rax),%rsi	# %xmm save area
+	lea	512($context),%rdi	# & context.Xmm6
+	mov	\$20,%ecx		# 10*sizeof(%xmm0)/sizeof(%rax)
+	.long	0xa548f3fc		# cld; rep movsq
+
+	mov	-8(%rax),%rbp		# restore saved %rbp
+	mov	%rbp,160($context)	# restore context->Rbp
+	jmp	.Lcommon_seh_tail
+.size	ctr_xts_se_handler,.-ctr_xts_se_handler
+___
+$code.=<<___ if ($PREFIX eq "aesni" && 0);
+.type	ocb_se_handler,\@abi-omnipotent
+.align	16
+ocb_se_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue lable
+	cmp	%r10,%rbx		# context->Rip<prologue label
+	jb	.Lcommon_seh_tail
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=epilogue label
+	jae	.Lcommon_seh_tail
+
+	mov	8(%r11),%r10d		# HandlerData[2]
+	lea	(%rsi,%r10),%r10
+	cmp	%r10,%rbx		# context->Rip>=pop label
+	jae	.Locb_no_xmm
+
+	mov	152($context),%rax	# pull context->Rsp
+
+	lea	(%rax),%rsi		# %xmm save area
+	lea	512($context),%rdi	# & context.Xmm6
+	mov	\$20,%ecx		# 10*sizeof(%xmm0)/sizeof(%rax)
+	.long	0xa548f3fc		# cld; rep movsq
+	lea	0xa0+0x28(%rax),%rax
+
+.Locb_no_xmm:
+	mov	-8(%rax),%rbx
+	mov	-16(%rax),%rbp
+	mov	-24(%rax),%r12
+	mov	-32(%rax),%r13
+	mov	-40(%rax),%r14
+
+	mov	%rbx,144($context)	# restore context->Rbx
+	mov	%rbp,160($context)	# restore context->Rbp
+	mov	%r12,216($context)	# restore context->R12
+	mov	%r13,224($context)	# restore context->R13
+	mov	%r14,232($context)	# restore context->R14
+
+	jmp	.Lcommon_seh_tail
+.size	ocb_se_handler,.-ocb_se_handler
+___
+$code.=<<___;
+.type	cbc_se_handler,\@abi-omnipotent
+.align	16
+cbc_se_handler:
+___
+$code.=<<___ if (0);
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	152($context),%rax	# pull context->Rsp
+	mov	248($context),%rbx	# pull context->Rip
+
+	lea	.Lcbc_decrypt_bulk(%rip),%r10
+	cmp	%r10,%rbx		# context->Rip<"prologue" label
+	jb	.Lcommon_seh_tail
+
+	mov	120($context),%rax	# pull context->Rax
+
+	lea	.Lcbc_decrypt_body(%rip),%r10
+	cmp	%r10,%rbx		# context->Rip<cbc_decrypt_body
+	jb	.Lcommon_seh_tail
+
+	mov	152($context),%rax	# pull context->Rsp
+
+	lea	.Lcbc_ret(%rip),%r10
+	cmp	%r10,%rbx		# context->Rip>="epilogue" label
+	jae	.Lcommon_seh_tail
+
+	lea	16(%rax),%rsi		# %xmm save area
+	lea	512($context),%rdi	# &context.Xmm6
+	mov	\$20,%ecx		# 10*sizeof(%xmm0)/sizeof(%rax)
+	.long	0xa548f3fc		# cld; rep movsq
+
+	mov	208($context),%rax	# pull context->R11
+
+	mov	-8(%rax),%rbp		# restore saved %rbp
+	mov	%rbp,160($context)	# restore context->Rbp
+
+___
+$code.=<<___;
+.Lcommon_seh_tail:
+	mov	8(%rax),%rdi
+	mov	16(%rax),%rsi
+	mov	%rax,152($context)	# restore context->Rsp
+	mov	%rsi,168($context)	# restore context->Rsi
+	mov	%rdi,176($context)	# restore context->Rdi
+
+	mov	40($disp),%rdi		# disp->ContextRecord
+	mov	$context,%rsi		# context
+	mov	\$154,%ecx		# sizeof(CONTEXT)
+	.long	0xa548f3fc		# cld; rep movsq
+
+	mov	$disp,%rsi
+	xor	%rcx,%rcx		# arg1, UNW_FLAG_NHANDLER
+	mov	8(%rsi),%rdx		# arg2, disp->ImageBase
+	mov	0(%rsi),%r8		# arg3, disp->ControlPc
+	mov	16(%rsi),%r9		# arg4, disp->FunctionEntry
+	mov	40(%rsi),%r10		# disp->ContextRecord
+	lea	56(%rsi),%r11		# &disp->HandlerData
+	lea	24(%rsi),%r12		# &disp->EstablisherFrame
+	mov	%r10,32(%rsp)		# arg5
+	mov	%r11,40(%rsp)		# arg6
+	mov	%r12,48(%rsp)		# arg7
+	mov	%rcx,56(%rsp)		# arg8, (NULL)
+	call	*__imp_RtlVirtualUnwind(%rip)
+
+	mov	\$1,%eax		# ExceptionContinueSearch
+	add	\$64,%rsp
+	popfq
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+	pop	%rbp
+	pop	%rbx
+	pop	%rdi
+	pop	%rsi
+	ret
+.size	cbc_se_handler,.-cbc_se_handler
+___
+$code.=<<___ if ($PREFIX eq "aesni");
+.section	.pdata
+.align	4
+#	.rva	.LSEH_begin_aesni_ecb_encrypt
+#	.rva	.LSEH_end_aesni_ecb_encrypt
+#	.rva	.LSEH_info_ecb
+
+#	.rva	.LSEH_begin_aesni_ccm64_encrypt_blocks
+#	.rva	.LSEH_end_aesni_ccm64_encrypt_blocks
+#	.rva	.LSEH_info_ccm64_enc
+
+#	.rva	.LSEH_begin_aesni_ccm64_decrypt_blocks
+#	.rva	.LSEH_end_aesni_ccm64_decrypt_blocks
+#	.rva	.LSEH_info_ccm64_dec
+
+	.rva	.LSEH_begin_aesni_ctr32_encrypt_blocks
+	.rva	.LSEH_end_aesni_ctr32_encrypt_blocks
+	.rva	.LSEH_info_ctr32
+
+#	.rva	.LSEH_begin_aesni_xts_encrypt
+#	.rva	.LSEH_end_aesni_xts_encrypt
+#	.rva	.LSEH_info_xts_enc
+
+#	.rva	.LSEH_begin_aesni_xts_decrypt
+#	.rva	.LSEH_end_aesni_xts_decrypt
+#	.rva	.LSEH_info_xts_dec
+
+#	.rva	.LSEH_begin_aesni_ocb_encrypt
+#	.rva	.LSEH_end_aesni_ocb_encrypt
+#	.rva	.LSEH_info_ocb_enc
+
+#	.rva	.LSEH_begin_aesni_ocb_decrypt
+#	.rva	.LSEH_end_aesni_ocb_decrypt
+#	.rva	.LSEH_info_ocb_dec
+___
+$code.=<<___;
+#	.rva	.LSEH_begin_${PREFIX}_cbc_encrypt
+#	.rva	.LSEH_end_${PREFIX}_cbc_encrypt
+#	.rva	.LSEH_info_cbc
+
+	.rva	${PREFIX}_set_decrypt_key
+	.rva	.LSEH_end_set_decrypt_key
+	.rva	.LSEH_info_key
+
+	.rva	${PREFIX}_set_encrypt_key
+	.rva	.LSEH_end_set_encrypt_key
+	.rva	.LSEH_info_key
+.section	.xdata
+.align	8
+___
+$code.=<<___ if ($PREFIX eq "aesni");
+#.LSEH_info_ecb:
+#	.byte	9,0,0,0
+#	.rva	ecb_ccm64_se_handler
+#	.rva	.Lecb_enc_body,.Lecb_enc_ret		# HandlerData[]
+#.LSEH_info_ccm64_enc:
+#	.byte	9,0,0,0
+#	.rva	ecb_ccm64_se_handler
+#	.rva	.Lccm64_enc_body,.Lccm64_enc_ret	# HandlerData[]
+#.LSEH_info_ccm64_dec:
+#	.byte	9,0,0,0
+#	.rva	ecb_ccm64_se_handler
+#	.rva	.Lccm64_dec_body,.Lccm64_dec_ret	# HandlerData[]
+.LSEH_info_ctr32:
+	.byte	9,0,0,0
+	.rva	ctr_xts_se_handler
+	.rva	.Lctr32_body,.Lctr32_epilogue		# HandlerData[]
+#.LSEH_info_xts_enc:
+#	.byte	9,0,0,0
+#	.rva	ctr_xts_se_handler
+#	.rva	.Lxts_enc_body,.Lxts_enc_epilogue	# HandlerData[]
+#.LSEH_info_xts_dec:
+#	.byte	9,0,0,0
+#	.rva	ctr_xts_se_handler
+#	.rva	.Lxts_dec_body,.Lxts_dec_epilogue	# HandlerData[]
+#.LSEH_info_ocb_enc:
+#	.byte	9,0,0,0
+#	.rva	ocb_se_handler
+#	.rva	.Locb_enc_body,.Locb_enc_epilogue	# HandlerData[]
+#	.rva	.Locb_enc_pop
+#	.long	0
+#.LSEH_info_ocb_dec:
+#	.byte	9,0,0,0
+#	.rva	ocb_se_handler
+#	.rva	.Locb_dec_body,.Locb_dec_epilogue	# HandlerData[]
+#	.rva	.Locb_dec_pop
+#	.long	0
+___
+$code.=<<___;
+#.LSEH_info_cbc:
+#	.byte	9,0,0,0
+#	.rva	cbc_se_handler
+.LSEH_info_key:
+	.byte	0x01,0x04,0x01,0x00
+	.byte	0x04,0x02,0x00,0x00	# sub rsp,8
+___
+}
+
+sub rex {
+  local *opcode=shift;
+  my ($dst,$src)=@_;
+  my $rex=0;
+
+    $rex|=0x04			if($dst>=8);
+    $rex|=0x01			if($src>=8);
+    push @opcode,$rex|0x40	if($rex);
+}
+
+sub aesni {
+  my $line=shift;
+  my @opcode=(0x66);
+
+    if ($line=~/(aeskeygenassist)\s+\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) {
+	rex(\@opcode,$4,$3);
+	push @opcode,0x0f,0x3a,0xdf;
+	push @opcode,0xc0|($3&7)|(($4&7)<<3);	# ModR/M
+	my $c=$2;
+	push @opcode,$c=~/^0/?oct($c):$c;
+	return ".byte\t".join(',',@opcode);
+    }
+    elsif ($line=~/(aes[a-z]+)\s+%xmm([0-9]+),\s*%xmm([0-9]+)/) {
+	my %opcodelet = (
+		"aesimc" => 0xdb,
+		"aesenc" => 0xdc,	"aesenclast" => 0xdd,
+		"aesdec" => 0xde,	"aesdeclast" => 0xdf
+	);
+	return undef if (!defined($opcodelet{$1}));
+	rex(\@opcode,$3,$2);
+	push @opcode,0x0f,0x38,$opcodelet{$1};
+	push @opcode,0xc0|($2&7)|(($3&7)<<3);	# ModR/M
+	return ".byte\t".join(',',@opcode);
+    }
+    elsif ($line=~/(aes[a-z]+)\s+([0x1-9a-fA-F]*)\(%rsp\),\s*%xmm([0-9]+)/) {
+	my %opcodelet = (
+		"aesenc" => 0xdc,	"aesenclast" => 0xdd,
+		"aesdec" => 0xde,	"aesdeclast" => 0xdf
+	);
+	return undef if (!defined($opcodelet{$1}));
+	my $off = $2;
+	push @opcode,0x44 if ($3>=8);
+	push @opcode,0x0f,0x38,$opcodelet{$1};
+	push @opcode,0x44|(($3&7)<<3),0x24;	# ModR/M
+	push @opcode,($off=~/^0/?oct($off):$off)&0xff;
+	return ".byte\t".join(',',@opcode);
+    }
+    return $line;
+}
+
+sub movbe {
+	".byte	0x0f,0x38,0xf1,0x44,0x24,".shift;
+}
+
+$code =~ s/\`([^\`]*)\`/eval($1)/gem;
+$code =~ s/\b(aes.*%xmm[0-9]+).*$/aesni($1)/gem;
+#$code =~ s/\bmovbe\s+%eax/bswap %eax; mov %eax/gm;	# debugging artefact
+$code =~ s/\bmovbe\s+%eax,\s*([0-9]+)\(%rsp\)/movbe($1)/gem;
+
+print $code;
+
+close STDOUT;
diff --git a/crypto/aesgcm/aesni_gcm_x64_gas.s b/crypto/aesgcm/aesni_gcm_x64_gas.s
new file mode 100644
index 0000000..993e81b
--- /dev/null
+++ b/crypto/aesgcm/aesni_gcm_x64_gas.s
@@ -0,0 +1,831 @@
+.text	
+
+.type	_aesni_ctr32_ghash_6x,@function
+.align	32
+_aesni_ctr32_ghash_6x:
+.cfi_startproc	
+	vmovdqu	32(%r11),%xmm2
+	subq	$6,%rdx
+	vpxor	%xmm4,%xmm4,%xmm4
+	vmovdqu	0-128(%rcx),%xmm15
+	vpaddb	%xmm2,%xmm1,%xmm10
+	vpaddb	%xmm2,%xmm10,%xmm11
+	vpaddb	%xmm2,%xmm11,%xmm12
+	vpaddb	%xmm2,%xmm12,%xmm13
+	vpaddb	%xmm2,%xmm13,%xmm14
+	vpxor	%xmm15,%xmm1,%xmm9
+	vmovdqu	%xmm4,16+8(%rsp)
+	jmp	.Loop6x
+
+.align	32
+.Loop6x:
+	addl	$100663296,%ebx
+	jc	.Lhandle_ctr32
+	vmovdqu	0-32(%r9),%xmm3
+	vpaddb	%xmm2,%xmm14,%xmm1
+	vpxor	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm15,%xmm11,%xmm11
+
+.Lresume_ctr32:
+	vmovdqu	%xmm1,(%r8)
+	vpclmulqdq	$0x10,%xmm3,%xmm7,%xmm5
+	vpxor	%xmm15,%xmm12,%xmm12
+	vmovups	16-128(%rcx),%xmm2
+	vpclmulqdq	$0x01,%xmm3,%xmm7,%xmm6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	xorq	%r12,%r12
+	cmpq	%r14,%r15
+
+	vaesenc	%xmm2,%xmm9,%xmm9
+	vmovdqu	48+8(%rsp),%xmm0
+	vpxor	%xmm15,%xmm13,%xmm13
+	vpclmulqdq	$0x00,%xmm3,%xmm7,%xmm1
+	vaesenc	%xmm2,%xmm10,%xmm10
+	vpxor	%xmm15,%xmm14,%xmm14
+	setnc	%r12b
+	vpclmulqdq	$0x11,%xmm3,%xmm7,%xmm7
+	vaesenc	%xmm2,%xmm11,%xmm11
+	vmovdqu	16-32(%r9),%xmm3
+	negq	%r12
+	vaesenc	%xmm2,%xmm12,%xmm12
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm3,%xmm0,%xmm5
+	vpxor	%xmm4,%xmm8,%xmm8
+	vaesenc	%xmm2,%xmm13,%xmm13
+	vpxor	%xmm5,%xmm1,%xmm4
+	andq	$0x60,%r12
+	vmovups	32-128(%rcx),%xmm15
+	vpclmulqdq	$0x10,%xmm3,%xmm0,%xmm1
+	vaesenc	%xmm2,%xmm14,%xmm14
+
+	vpclmulqdq	$0x01,%xmm3,%xmm0,%xmm2
+	leaq	(%r14,%r12,1),%r14
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	16+8(%rsp),%xmm8,%xmm8
+	vpclmulqdq	$0x11,%xmm3,%xmm0,%xmm3
+	vmovdqu	64+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	88(%r14),%r13
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	80(%r14),%r12
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,32+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,40+8(%rsp)
+	vmovdqu	48-32(%r9),%xmm5
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	48-128(%rcx),%xmm15
+	vpxor	%xmm1,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm5,%xmm0,%xmm1
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm5,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm3,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm5,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpclmulqdq	$0x11,%xmm5,%xmm0,%xmm5
+	vmovdqu	80+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	%xmm1,%xmm4,%xmm4
+	vmovdqu	64-32(%r9),%xmm1
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	64-128(%rcx),%xmm15
+	vpxor	%xmm2,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm1,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm3,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm1,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	72(%r14),%r13
+	vpxor	%xmm5,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm1,%xmm0,%xmm5
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	64(%r14),%r12
+	vpclmulqdq	$0x11,%xmm1,%xmm0,%xmm1
+	vmovdqu	96+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,48+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,56+8(%rsp)
+	vpxor	%xmm2,%xmm4,%xmm4
+	vmovdqu	96-32(%r9),%xmm2
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	80-128(%rcx),%xmm15
+	vpxor	%xmm3,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm2,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm2,%xmm0,%xmm5
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	56(%r14),%r13
+	vpxor	%xmm1,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm2,%xmm0,%xmm1
+	vpxor	112+8(%rsp),%xmm8,%xmm8
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	48(%r14),%r12
+	vpclmulqdq	$0x11,%xmm2,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,64+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,72+8(%rsp)
+	vpxor	%xmm3,%xmm4,%xmm4
+	vmovdqu	112-32(%r9),%xmm3
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	96-128(%rcx),%xmm15
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm5
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm6,%xmm6
+	vpclmulqdq	$0x01,%xmm3,%xmm8,%xmm1
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	40(%r14),%r13
+	vpxor	%xmm2,%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm3,%xmm8,%xmm2
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	32(%r14),%r12
+	vpclmulqdq	$0x11,%xmm3,%xmm8,%xmm8
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,80+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,88+8(%rsp)
+	vpxor	%xmm5,%xmm6,%xmm6
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	%xmm1,%xmm6,%xmm6
+
+	vmovups	112-128(%rcx),%xmm15
+	vpslldq	$8,%xmm6,%xmm5
+	vpxor	%xmm2,%xmm4,%xmm4
+	vmovdqu	16(%r11),%xmm3
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm8,%xmm7,%xmm7
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm5,%xmm4,%xmm4
+	movbeq	24(%r14),%r13
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	16(%r14),%r12
+	vpalignr	$8,%xmm4,%xmm4,%xmm0
+	vpclmulqdq	$0x10,%xmm3,%xmm4,%xmm4
+	movq	%r13,96+8(%rsp)
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r12,104+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vmovups	128-128(%rcx),%xmm1
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vmovups	144-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vpsrldq	$8,%xmm6,%xmm6
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vpxor	%xmm6,%xmm7,%xmm7
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vpxor	%xmm0,%xmm4,%xmm4
+	movbeq	8(%r14),%r13
+	vaesenc	%xmm1,%xmm13,%xmm13
+	movbeq	0(%r14),%r12
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	160-128(%rcx),%xmm1
+	cmpl	$11,%ebp
+	jb	.Lenc_tail
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vaesenc	%xmm1,%xmm13,%xmm13
+	vmovups	176-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	192-128(%rcx),%xmm1
+	je	.Lenc_tail
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vaesenc	%xmm1,%xmm13,%xmm13
+	vmovups	208-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	224-128(%rcx),%xmm1
+	jmp	.Lenc_tail
+
+.align	32
+.Lhandle_ctr32:
+	vmovdqu	(%r11),%xmm0
+	vpshufb	%xmm0,%xmm1,%xmm6
+	vmovdqu	48(%r11),%xmm5
+	vpaddd	64(%r11),%xmm6,%xmm10
+	vpaddd	%xmm5,%xmm6,%xmm11
+	vmovdqu	0-32(%r9),%xmm3
+	vpaddd	%xmm5,%xmm10,%xmm12
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm11,%xmm13
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm15,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm12,%xmm14
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vpxor	%xmm15,%xmm11,%xmm11
+	vpaddd	%xmm5,%xmm13,%xmm1
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vpshufb	%xmm0,%xmm1,%xmm1
+	jmp	.Lresume_ctr32
+
+.align	32
+.Lenc_tail:
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vmovdqu	%xmm7,16+8(%rsp)
+	vpalignr	$8,%xmm4,%xmm4,%xmm8
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpclmulqdq	$0x10,%xmm3,%xmm4,%xmm4
+	vpxor	0(%rdi),%xmm1,%xmm2
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpxor	16(%rdi),%xmm1,%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vpxor	32(%rdi),%xmm1,%xmm5
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	48(%rdi),%xmm1,%xmm6
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	64(%rdi),%xmm1,%xmm7
+	vpxor	80(%rdi),%xmm1,%xmm3
+	vmovdqu	(%r8),%xmm1
+
+	vaesenclast	%xmm2,%xmm9,%xmm9
+	vmovdqu	32(%r11),%xmm2
+	vaesenclast	%xmm0,%xmm10,%xmm10
+	vpaddb	%xmm2,%xmm1,%xmm0
+	movq	%r13,112+8(%rsp)
+	leaq	96(%rdi),%rdi
+	vaesenclast	%xmm5,%xmm11,%xmm11
+	vpaddb	%xmm2,%xmm0,%xmm5
+	movq	%r12,120+8(%rsp)
+	leaq	96(%rsi),%rsi
+	vmovdqu	0-128(%rcx),%xmm15
+	vaesenclast	%xmm6,%xmm12,%xmm12
+	vpaddb	%xmm2,%xmm5,%xmm6
+	vaesenclast	%xmm7,%xmm13,%xmm13
+	vpaddb	%xmm2,%xmm6,%xmm7
+	vaesenclast	%xmm3,%xmm14,%xmm14
+	vpaddb	%xmm2,%xmm7,%xmm3
+
+	addq	$0x60,%r10
+	subq	$0x6,%rdx
+	jc	.L6x_done
+
+	vmovups	%xmm9,-96(%rsi)
+	vpxor	%xmm15,%xmm1,%xmm9
+	vmovups	%xmm10,-80(%rsi)
+	vmovdqa	%xmm0,%xmm10
+	vmovups	%xmm11,-64(%rsi)
+	vmovdqa	%xmm5,%xmm11
+	vmovups	%xmm12,-48(%rsi)
+	vmovdqa	%xmm6,%xmm12
+	vmovups	%xmm13,-32(%rsi)
+	vmovdqa	%xmm7,%xmm13
+	vmovups	%xmm14,-16(%rsi)
+	vmovdqa	%xmm3,%xmm14
+	vmovdqu	32+8(%rsp),%xmm7
+	jmp	.Loop6x
+
+.L6x_done:
+	vpxor	16+8(%rsp),%xmm8,%xmm8
+	vpxor	%xmm4,%xmm8,%xmm8
+
+	ret
+.cfi_endproc	
+.size	_aesni_ctr32_ghash_6x,.-_aesni_ctr32_ghash_6x
+.globl	aesni_gcm_decrypt
+.type	aesni_gcm_decrypt,@function
+.align	32
+aesni_gcm_decrypt:
+.cfi_startproc	
+	xorq	%r10,%r10
+
+
+
+	cmpq	$0x60,%rdx
+	jb	.Lgcm_dec_abort
+
+	leaq	(%rsp),%rax
+.cfi_def_cfa_register	%rax
+	pushq	%rbx
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_offset	%r15,-56
+	vzeroupper
+
+	vmovdqu	(%r8),%xmm1
+	addq	$-128,%rsp
+	movl	12(%r8),%ebx
+	leaq	.Lbswap_mask(%rip),%r11
+	leaq	-128(%rcx),%r14
+	movq	$0xf80,%r15
+	vmovdqu	16(%r8),%xmm8
+	andq	$-128,%rsp
+	vmovdqu	(%r11),%xmm0
+	leaq	128(%rcx),%rcx
+	leaq	16+32(%r9),%r9
+	movl	240-128(%rcx),%ebp
+	vpshufb	%xmm0,%xmm8,%xmm8
+
+	andq	%r15,%r14
+	andq	%rsp,%r15
+	subq	%r14,%r15
+	jc	.Ldec_no_key_aliasing
+	cmpq	$768,%r15
+	jnc	.Ldec_no_key_aliasing
+	subq	%r15,%rsp
+.Ldec_no_key_aliasing:
+
+	vmovdqu	80(%rdi),%xmm7
+	leaq	(%rdi),%r14
+	vmovdqu	64(%rdi),%xmm4
+
+
+
+
+
+
+
+	leaq	-192(%rdi,%rdx,1),%r15
+
+	vmovdqu	48(%rdi),%xmm5
+	shrq	$4,%rdx
+	xorq	%r10,%r10
+	vmovdqu	32(%rdi),%xmm6
+	vpshufb	%xmm0,%xmm7,%xmm7
+	vmovdqu	16(%rdi),%xmm2
+	vpshufb	%xmm0,%xmm4,%xmm4
+	vmovdqu	(%rdi),%xmm3
+	vpshufb	%xmm0,%xmm5,%xmm5
+	vmovdqu	%xmm4,48(%rsp)
+	vpshufb	%xmm0,%xmm6,%xmm6
+	vmovdqu	%xmm5,64(%rsp)
+	vpshufb	%xmm0,%xmm2,%xmm2
+	vmovdqu	%xmm6,80(%rsp)
+	vpshufb	%xmm0,%xmm3,%xmm3
+	vmovdqu	%xmm2,96(%rsp)
+	vmovdqu	%xmm3,112(%rsp)
+
+	call	_aesni_ctr32_ghash_6x
+
+	vmovups	%xmm9,-96(%rsi)
+	vmovups	%xmm10,-80(%rsi)
+	vmovups	%xmm11,-64(%rsi)
+	vmovups	%xmm12,-48(%rsi)
+	vmovups	%xmm13,-32(%rsi)
+	vmovups	%xmm14,-16(%rsi)
+
+	vpshufb	(%r11),%xmm8,%xmm8
+	vmovdqu	%xmm8,16(%r8)
+
+	vzeroupper
+	movq	-48(%rax),%r15
+.cfi_restore	%r15
+	movq	-40(%rax),%r14
+.cfi_restore	%r14
+	movq	-32(%rax),%r13
+.cfi_restore	%r13
+	movq	-24(%rax),%r12
+.cfi_restore	%r12
+	movq	-16(%rax),%rbp
+.cfi_restore	%rbp
+	movq	-8(%rax),%rbx
+.cfi_restore	%rbx
+	leaq	(%rax),%rsp
+.cfi_def_cfa_register	%rsp
+.Lgcm_dec_abort:
+	movq	%r10,%rax
+	ret
+.cfi_endproc	
+.size	aesni_gcm_decrypt,.-aesni_gcm_decrypt
+.type	_aesni_ctr32_6x,@function
+.align	32
+_aesni_ctr32_6x:
+.cfi_startproc	
+	vmovdqu	0-128(%rcx),%xmm4
+	vmovdqu	32(%r11),%xmm2
+	leaq	-1(%rbp),%r13
+	vmovups	16-128(%rcx),%xmm15
+	leaq	32-128(%rcx),%r12
+	vpxor	%xmm4,%xmm1,%xmm9
+	addl	$100663296,%ebx
+	jc	.Lhandle_ctr32_2
+	vpaddb	%xmm2,%xmm1,%xmm10
+	vpaddb	%xmm2,%xmm10,%xmm11
+	vpxor	%xmm4,%xmm10,%xmm10
+	vpaddb	%xmm2,%xmm11,%xmm12
+	vpxor	%xmm4,%xmm11,%xmm11
+	vpaddb	%xmm2,%xmm12,%xmm13
+	vpxor	%xmm4,%xmm12,%xmm12
+	vpaddb	%xmm2,%xmm13,%xmm14
+	vpxor	%xmm4,%xmm13,%xmm13
+	vpaddb	%xmm2,%xmm14,%xmm1
+	vpxor	%xmm4,%xmm14,%xmm14
+	jmp	.Loop_ctr32
+
+.align	16
+.Loop_ctr32:
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vmovups	(%r12),%xmm15
+	leaq	16(%r12),%r12
+	decl	%r13d
+	jnz	.Loop_ctr32
+
+	vmovdqu	(%r12),%xmm3
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	0(%rdi),%xmm3,%xmm4
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	16(%rdi),%xmm3,%xmm5
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpxor	32(%rdi),%xmm3,%xmm6
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vpxor	48(%rdi),%xmm3,%xmm8
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	64(%rdi),%xmm3,%xmm2
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	80(%rdi),%xmm3,%xmm3
+	leaq	96(%rdi),%rdi
+
+	vaesenclast	%xmm4,%xmm9,%xmm9
+	vaesenclast	%xmm5,%xmm10,%xmm10
+	vaesenclast	%xmm6,%xmm11,%xmm11
+	vaesenclast	%xmm8,%xmm12,%xmm12
+	vaesenclast	%xmm2,%xmm13,%xmm13
+	vaesenclast	%xmm3,%xmm14,%xmm14
+	vmovups	%xmm9,0(%rsi)
+	vmovups	%xmm10,16(%rsi)
+	vmovups	%xmm11,32(%rsi)
+	vmovups	%xmm12,48(%rsi)
+	vmovups	%xmm13,64(%rsi)
+	vmovups	%xmm14,80(%rsi)
+	leaq	96(%rsi),%rsi
+
+	ret
+.align	32
+.Lhandle_ctr32_2:
+	vpshufb	%xmm0,%xmm1,%xmm6
+	vmovdqu	48(%r11),%xmm5
+	vpaddd	64(%r11),%xmm6,%xmm10
+	vpaddd	%xmm5,%xmm6,%xmm11
+	vpaddd	%xmm5,%xmm10,%xmm12
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm11,%xmm13
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm4,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm12,%xmm14
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vpxor	%xmm4,%xmm11,%xmm11
+	vpaddd	%xmm5,%xmm13,%xmm1
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vpxor	%xmm4,%xmm12,%xmm12
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vpxor	%xmm4,%xmm13,%xmm13
+	vpshufb	%xmm0,%xmm1,%xmm1
+	vpxor	%xmm4,%xmm14,%xmm14
+	jmp	.Loop_ctr32
+.cfi_endproc	
+.size	_aesni_ctr32_6x,.-_aesni_ctr32_6x
+
+.globl	aesni_gcm_encrypt
+.type	aesni_gcm_encrypt,@function
+.align	32
+aesni_gcm_encrypt:
+.cfi_startproc	
+	xorq	%r10,%r10
+
+
+
+
+	cmpq	$288,%rdx
+	jb	.Lgcm_enc_abort
+
+	leaq	(%rsp),%rax
+.cfi_def_cfa_register	%rax
+	pushq	%rbx
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_offset	%r15,-56
+	vzeroupper
+
+	vmovdqu	(%r8),%xmm1
+	addq	$-128,%rsp
+	movl	12(%r8),%ebx
+	leaq	.Lbswap_mask(%rip),%r11
+	leaq	-128(%rcx),%r14
+	movq	$0xf80,%r15
+	leaq	128(%rcx),%rcx
+	vmovdqu	(%r11),%xmm0
+	andq	$-128,%rsp
+	movl	240-128(%rcx),%ebp
+
+	andq	%r15,%r14
+	andq	%rsp,%r15
+	subq	%r14,%r15
+	jc	.Lenc_no_key_aliasing
+	cmpq	$768,%r15
+	jnc	.Lenc_no_key_aliasing
+	subq	%r15,%rsp
+.Lenc_no_key_aliasing:
+
+	leaq	(%rsi),%r14
+
+
+
+
+
+
+
+
+	leaq	-192(%rsi,%rdx,1),%r15
+
+	shrq	$4,%rdx
+
+	call	_aesni_ctr32_6x
+
+	vpshufb	%xmm0,%xmm9,%xmm8
+	vpshufb	%xmm0,%xmm10,%xmm2
+	vmovdqu	%xmm8,112(%rsp)
+	vpshufb	%xmm0,%xmm11,%xmm4
+	vmovdqu	%xmm2,96(%rsp)
+	vpshufb	%xmm0,%xmm12,%xmm5
+	vmovdqu	%xmm4,80(%rsp)
+	vpshufb	%xmm0,%xmm13,%xmm6
+	vmovdqu	%xmm5,64(%rsp)
+	vpshufb	%xmm0,%xmm14,%xmm7
+	vmovdqu	%xmm6,48(%rsp)
+
+	call	_aesni_ctr32_6x
+
+	vmovdqu	16(%r8),%xmm8
+	leaq	16+32(%r9),%r9
+	subq	$12,%rdx
+	movq	$192,%r10
+	vpshufb	%xmm0,%xmm8,%xmm8
+
+	call	_aesni_ctr32_ghash_6x
+	vmovdqu	32(%rsp),%xmm7
+	vmovdqu	(%r11),%xmm0
+	vmovdqu	0-32(%r9),%xmm3
+	vpunpckhqdq	%xmm7,%xmm7,%xmm1
+	vmovdqu	32-32(%r9),%xmm15
+	vmovups	%xmm9,-96(%rsi)
+	vpshufb	%xmm0,%xmm9,%xmm9
+	vpxor	%xmm7,%xmm1,%xmm1
+	vmovups	%xmm10,-80(%rsi)
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vmovups	%xmm11,-64(%rsi)
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vmovups	%xmm12,-48(%rsi)
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vmovups	%xmm13,-32(%rsi)
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vmovups	%xmm14,-16(%rsi)
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vmovdqu	%xmm9,16(%rsp)
+	vmovdqu	48(%rsp),%xmm6
+	vmovdqu	16-32(%r9),%xmm0
+	vpunpckhqdq	%xmm6,%xmm6,%xmm2
+	vpclmulqdq	$0x00,%xmm3,%xmm7,%xmm5
+	vpxor	%xmm6,%xmm2,%xmm2
+	vpclmulqdq	$0x11,%xmm3,%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm15,%xmm1,%xmm1
+
+	vmovdqu	64(%rsp),%xmm9
+	vpclmulqdq	$0x00,%xmm0,%xmm6,%xmm4
+	vmovdqu	48-32(%r9),%xmm3
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm9,%xmm9,%xmm5
+	vpclmulqdq	$0x11,%xmm0,%xmm6,%xmm6
+	vpxor	%xmm9,%xmm5,%xmm5
+	vpxor	%xmm7,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm15,%xmm2,%xmm2
+	vmovdqu	80-32(%r9),%xmm15
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vmovdqu	80(%rsp),%xmm1
+	vpclmulqdq	$0x00,%xmm3,%xmm9,%xmm7
+	vmovdqu	64-32(%r9),%xmm0
+	vpxor	%xmm4,%xmm7,%xmm7
+	vpunpckhqdq	%xmm1,%xmm1,%xmm4
+	vpclmulqdq	$0x11,%xmm3,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpxor	%xmm6,%xmm9,%xmm9
+	vpclmulqdq	$0x00,%xmm15,%xmm5,%xmm5
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	96(%rsp),%xmm2
+	vpclmulqdq	$0x00,%xmm0,%xmm1,%xmm6
+	vmovdqu	96-32(%r9),%xmm3
+	vpxor	%xmm7,%xmm6,%xmm6
+	vpunpckhqdq	%xmm2,%xmm2,%xmm7
+	vpclmulqdq	$0x11,%xmm0,%xmm1,%xmm1
+	vpxor	%xmm2,%xmm7,%xmm7
+	vpxor	%xmm9,%xmm1,%xmm1
+	vpclmulqdq	$0x10,%xmm15,%xmm4,%xmm4
+	vmovdqu	128-32(%r9),%xmm15
+	vpxor	%xmm5,%xmm4,%xmm4
+
+	vpxor	112(%rsp),%xmm8,%xmm8
+	vpclmulqdq	$0x00,%xmm3,%xmm2,%xmm5
+	vmovdqu	112-32(%r9),%xmm0
+	vpunpckhqdq	%xmm8,%xmm8,%xmm9
+	vpxor	%xmm6,%xmm5,%xmm5
+	vpclmulqdq	$0x11,%xmm3,%xmm2,%xmm2
+	vpxor	%xmm8,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm2,%xmm2
+	vpclmulqdq	$0x00,%xmm15,%xmm7,%xmm7
+	vpxor	%xmm4,%xmm7,%xmm4
+
+	vpclmulqdq	$0x00,%xmm0,%xmm8,%xmm6
+	vmovdqu	0-32(%r9),%xmm3
+	vpunpckhqdq	%xmm14,%xmm14,%xmm1
+	vpclmulqdq	$0x11,%xmm0,%xmm8,%xmm8
+	vpxor	%xmm14,%xmm1,%xmm1
+	vpxor	%xmm5,%xmm6,%xmm5
+	vpclmulqdq	$0x10,%xmm15,%xmm9,%xmm9
+	vmovdqu	32-32(%r9),%xmm15
+	vpxor	%xmm2,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm9,%xmm6
+
+	vmovdqu	16-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm7,%xmm9
+	vpclmulqdq	$0x00,%xmm3,%xmm14,%xmm4
+	vpxor	%xmm9,%xmm6,%xmm6
+	vpunpckhqdq	%xmm13,%xmm13,%xmm2
+	vpclmulqdq	$0x11,%xmm3,%xmm14,%xmm14
+	vpxor	%xmm13,%xmm2,%xmm2
+	vpslldq	$8,%xmm6,%xmm9
+	vpclmulqdq	$0x00,%xmm15,%xmm1,%xmm1
+	vpxor	%xmm9,%xmm5,%xmm8
+	vpsrldq	$8,%xmm6,%xmm6
+	vpxor	%xmm6,%xmm7,%xmm7
+
+	vpclmulqdq	$0x00,%xmm0,%xmm13,%xmm5
+	vmovdqu	48-32(%r9),%xmm3
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpunpckhqdq	%xmm12,%xmm12,%xmm9
+	vpclmulqdq	$0x11,%xmm0,%xmm13,%xmm13
+	vpxor	%xmm12,%xmm9,%xmm9
+	vpxor	%xmm14,%xmm13,%xmm13
+	vpalignr	$8,%xmm8,%xmm8,%xmm14
+	vpclmulqdq	$0x10,%xmm15,%xmm2,%xmm2
+	vmovdqu	80-32(%r9),%xmm15
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vpclmulqdq	$0x00,%xmm3,%xmm12,%xmm4
+	vmovdqu	64-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm11,%xmm11,%xmm1
+	vpclmulqdq	$0x11,%xmm3,%xmm12,%xmm12
+	vpxor	%xmm11,%xmm1,%xmm1
+	vpxor	%xmm13,%xmm12,%xmm12
+	vxorps	16(%rsp),%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm15,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm9,%xmm9
+
+	vpclmulqdq	$0x10,16(%r11),%xmm8,%xmm8
+	vxorps	%xmm14,%xmm8,%xmm8
+
+	vpclmulqdq	$0x00,%xmm0,%xmm11,%xmm5
+	vmovdqu	96-32(%r9),%xmm3
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpunpckhqdq	%xmm10,%xmm10,%xmm2
+	vpclmulqdq	$0x11,%xmm0,%xmm11,%xmm11
+	vpxor	%xmm10,%xmm2,%xmm2
+	vpalignr	$8,%xmm8,%xmm8,%xmm14
+	vpxor	%xmm12,%xmm11,%xmm11
+	vpclmulqdq	$0x10,%xmm15,%xmm1,%xmm1
+	vmovdqu	128-32(%r9),%xmm15
+	vpxor	%xmm9,%xmm1,%xmm1
+
+	vxorps	%xmm7,%xmm14,%xmm14
+	vpclmulqdq	$0x10,16(%r11),%xmm8,%xmm8
+	vxorps	%xmm14,%xmm8,%xmm8
+
+	vpclmulqdq	$0x00,%xmm3,%xmm10,%xmm4
+	vmovdqu	112-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm8,%xmm8,%xmm9
+	vpclmulqdq	$0x11,%xmm3,%xmm10,%xmm10
+	vpxor	%xmm8,%xmm9,%xmm9
+	vpxor	%xmm11,%xmm10,%xmm10
+	vpclmulqdq	$0x00,%xmm15,%xmm2,%xmm2
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vpclmulqdq	$0x00,%xmm0,%xmm8,%xmm5
+	vpclmulqdq	$0x11,%xmm0,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpclmulqdq	$0x10,%xmm15,%xmm9,%xmm6
+	vpxor	%xmm10,%xmm7,%xmm7
+	vpxor	%xmm2,%xmm6,%xmm6
+
+	vpxor	%xmm5,%xmm7,%xmm4
+	vpxor	%xmm4,%xmm6,%xmm6
+	vpslldq	$8,%xmm6,%xmm1
+	vmovdqu	16(%r11),%xmm3
+	vpsrldq	$8,%xmm6,%xmm6
+	vpxor	%xmm1,%xmm5,%xmm8
+	vpxor	%xmm6,%xmm7,%xmm7
+
+	vpalignr	$8,%xmm8,%xmm8,%xmm2
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm8
+	vpxor	%xmm2,%xmm8,%xmm8
+
+	vpalignr	$8,%xmm8,%xmm8,%xmm2
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm8
+	vpxor	%xmm7,%xmm2,%xmm2
+	vpxor	%xmm2,%xmm8,%xmm8
+	vpshufb	(%r11),%xmm8,%xmm8
+	vmovdqu	%xmm8,16(%r8)
+
+	vzeroupper
+	movq	-48(%rax),%r15
+.cfi_restore	%r15
+	movq	-40(%rax),%r14
+.cfi_restore	%r14
+	movq	-32(%rax),%r13
+.cfi_restore	%r13
+	movq	-24(%rax),%r12
+.cfi_restore	%r12
+	movq	-16(%rax),%rbp
+.cfi_restore	%rbp
+	movq	-8(%rax),%rbx
+.cfi_restore	%rbx
+	leaq	(%rax),%rsp
+.cfi_def_cfa_register	%rsp
+.Lgcm_enc_abort:
+	movq	%r10,%rax
+	ret
+.cfi_endproc	
+.size	aesni_gcm_encrypt,.-aesni_gcm_encrypt
+.align	64
+.Lbswap_mask:
+.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+.Lpoly:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+.Lone_msb:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+.Ltwo_lsb:
+.byte	2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.Lone_lsb:
+.byte	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.byte	65,69,83,45,78,73,32,71,67,77,32,109,111,100,117,108,101,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
+.align	64
diff --git a/crypto/aesgcm/aesni_gcm_x64_gas_macosx.s b/crypto/aesgcm/aesni_gcm_x64_gas_macosx.s
new file mode 100644
index 0000000..184f239
--- /dev/null
+++ b/crypto/aesgcm/aesni_gcm_x64_gas_macosx.s
@@ -0,0 +1,831 @@
+.text	
+
+
+.p2align	5
+_aesni_ctr32_ghash_6x:
+
+	vmovdqu	32(%r11),%xmm2
+	subq	$6,%rdx
+	vpxor	%xmm4,%xmm4,%xmm4
+	vmovdqu	0-128(%rcx),%xmm15
+	vpaddb	%xmm2,%xmm1,%xmm10
+	vpaddb	%xmm2,%xmm10,%xmm11
+	vpaddb	%xmm2,%xmm11,%xmm12
+	vpaddb	%xmm2,%xmm12,%xmm13
+	vpaddb	%xmm2,%xmm13,%xmm14
+	vpxor	%xmm15,%xmm1,%xmm9
+	vmovdqu	%xmm4,16+8(%rsp)
+	jmp	L$oop6x
+
+.p2align	5
+L$oop6x:
+	addl	$100663296,%ebx
+	jc	L$handle_ctr32
+	vmovdqu	0-32(%r9),%xmm3
+	vpaddb	%xmm2,%xmm14,%xmm1
+	vpxor	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm15,%xmm11,%xmm11
+
+L$resume_ctr32:
+	vmovdqu	%xmm1,(%r8)
+	vpclmulqdq	$0x10,%xmm3,%xmm7,%xmm5
+	vpxor	%xmm15,%xmm12,%xmm12
+	vmovups	16-128(%rcx),%xmm2
+	vpclmulqdq	$0x01,%xmm3,%xmm7,%xmm6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	xorq	%r12,%r12
+	cmpq	%r14,%r15
+
+	vaesenc	%xmm2,%xmm9,%xmm9
+	vmovdqu	48+8(%rsp),%xmm0
+	vpxor	%xmm15,%xmm13,%xmm13
+	vpclmulqdq	$0x00,%xmm3,%xmm7,%xmm1
+	vaesenc	%xmm2,%xmm10,%xmm10
+	vpxor	%xmm15,%xmm14,%xmm14
+	setnc	%r12b
+	vpclmulqdq	$0x11,%xmm3,%xmm7,%xmm7
+	vaesenc	%xmm2,%xmm11,%xmm11
+	vmovdqu	16-32(%r9),%xmm3
+	negq	%r12
+	vaesenc	%xmm2,%xmm12,%xmm12
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm3,%xmm0,%xmm5
+	vpxor	%xmm4,%xmm8,%xmm8
+	vaesenc	%xmm2,%xmm13,%xmm13
+	vpxor	%xmm5,%xmm1,%xmm4
+	andq	$0x60,%r12
+	vmovups	32-128(%rcx),%xmm15
+	vpclmulqdq	$0x10,%xmm3,%xmm0,%xmm1
+	vaesenc	%xmm2,%xmm14,%xmm14
+
+	vpclmulqdq	$0x01,%xmm3,%xmm0,%xmm2
+	leaq	(%r14,%r12,1),%r14
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	16+8(%rsp),%xmm8,%xmm8
+	vpclmulqdq	$0x11,%xmm3,%xmm0,%xmm3
+	vmovdqu	64+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	88(%r14),%r13
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	80(%r14),%r12
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,32+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,40+8(%rsp)
+	vmovdqu	48-32(%r9),%xmm5
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	48-128(%rcx),%xmm15
+	vpxor	%xmm1,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm5,%xmm0,%xmm1
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm5,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm3,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm5,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpclmulqdq	$0x11,%xmm5,%xmm0,%xmm5
+	vmovdqu	80+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	%xmm1,%xmm4,%xmm4
+	vmovdqu	64-32(%r9),%xmm1
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	64-128(%rcx),%xmm15
+	vpxor	%xmm2,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm1,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm3,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm1,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	72(%r14),%r13
+	vpxor	%xmm5,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm1,%xmm0,%xmm5
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	64(%r14),%r12
+	vpclmulqdq	$0x11,%xmm1,%xmm0,%xmm1
+	vmovdqu	96+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,48+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,56+8(%rsp)
+	vpxor	%xmm2,%xmm4,%xmm4
+	vmovdqu	96-32(%r9),%xmm2
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	80-128(%rcx),%xmm15
+	vpxor	%xmm3,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm2,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm2,%xmm0,%xmm5
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	56(%r14),%r13
+	vpxor	%xmm1,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm2,%xmm0,%xmm1
+	vpxor	112+8(%rsp),%xmm8,%xmm8
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	48(%r14),%r12
+	vpclmulqdq	$0x11,%xmm2,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,64+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,72+8(%rsp)
+	vpxor	%xmm3,%xmm4,%xmm4
+	vmovdqu	112-32(%r9),%xmm3
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	96-128(%rcx),%xmm15
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm5
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm6,%xmm6
+	vpclmulqdq	$0x01,%xmm3,%xmm8,%xmm1
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	40(%r14),%r13
+	vpxor	%xmm2,%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm3,%xmm8,%xmm2
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	32(%r14),%r12
+	vpclmulqdq	$0x11,%xmm3,%xmm8,%xmm8
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,80+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,88+8(%rsp)
+	vpxor	%xmm5,%xmm6,%xmm6
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	%xmm1,%xmm6,%xmm6
+
+	vmovups	112-128(%rcx),%xmm15
+	vpslldq	$8,%xmm6,%xmm5
+	vpxor	%xmm2,%xmm4,%xmm4
+	vmovdqu	16(%r11),%xmm3
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm8,%xmm7,%xmm7
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm5,%xmm4,%xmm4
+	movbeq	24(%r14),%r13
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	16(%r14),%r12
+	vpalignr	$8,%xmm4,%xmm4,%xmm0
+	vpclmulqdq	$0x10,%xmm3,%xmm4,%xmm4
+	movq	%r13,96+8(%rsp)
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r12,104+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vmovups	128-128(%rcx),%xmm1
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vmovups	144-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vpsrldq	$8,%xmm6,%xmm6
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vpxor	%xmm6,%xmm7,%xmm7
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vpxor	%xmm0,%xmm4,%xmm4
+	movbeq	8(%r14),%r13
+	vaesenc	%xmm1,%xmm13,%xmm13
+	movbeq	0(%r14),%r12
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	160-128(%rcx),%xmm1
+	cmpl	$11,%ebp
+	jb	L$enc_tail
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vaesenc	%xmm1,%xmm13,%xmm13
+	vmovups	176-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	192-128(%rcx),%xmm1
+	je	L$enc_tail
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vaesenc	%xmm1,%xmm13,%xmm13
+	vmovups	208-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	224-128(%rcx),%xmm1
+	jmp	L$enc_tail
+
+.p2align	5
+L$handle_ctr32:
+	vmovdqu	(%r11),%xmm0
+	vpshufb	%xmm0,%xmm1,%xmm6
+	vmovdqu	48(%r11),%xmm5
+	vpaddd	64(%r11),%xmm6,%xmm10
+	vpaddd	%xmm5,%xmm6,%xmm11
+	vmovdqu	0-32(%r9),%xmm3
+	vpaddd	%xmm5,%xmm10,%xmm12
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm11,%xmm13
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm15,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm12,%xmm14
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vpxor	%xmm15,%xmm11,%xmm11
+	vpaddd	%xmm5,%xmm13,%xmm1
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vpshufb	%xmm0,%xmm1,%xmm1
+	jmp	L$resume_ctr32
+
+.p2align	5
+L$enc_tail:
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vmovdqu	%xmm7,16+8(%rsp)
+	vpalignr	$8,%xmm4,%xmm4,%xmm8
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpclmulqdq	$0x10,%xmm3,%xmm4,%xmm4
+	vpxor	0(%rdi),%xmm1,%xmm2
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpxor	16(%rdi),%xmm1,%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vpxor	32(%rdi),%xmm1,%xmm5
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	48(%rdi),%xmm1,%xmm6
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	64(%rdi),%xmm1,%xmm7
+	vpxor	80(%rdi),%xmm1,%xmm3
+	vmovdqu	(%r8),%xmm1
+
+	vaesenclast	%xmm2,%xmm9,%xmm9
+	vmovdqu	32(%r11),%xmm2
+	vaesenclast	%xmm0,%xmm10,%xmm10
+	vpaddb	%xmm2,%xmm1,%xmm0
+	movq	%r13,112+8(%rsp)
+	leaq	96(%rdi),%rdi
+	vaesenclast	%xmm5,%xmm11,%xmm11
+	vpaddb	%xmm2,%xmm0,%xmm5
+	movq	%r12,120+8(%rsp)
+	leaq	96(%rsi),%rsi
+	vmovdqu	0-128(%rcx),%xmm15
+	vaesenclast	%xmm6,%xmm12,%xmm12
+	vpaddb	%xmm2,%xmm5,%xmm6
+	vaesenclast	%xmm7,%xmm13,%xmm13
+	vpaddb	%xmm2,%xmm6,%xmm7
+	vaesenclast	%xmm3,%xmm14,%xmm14
+	vpaddb	%xmm2,%xmm7,%xmm3
+
+	addq	$0x60,%r10
+	subq	$0x6,%rdx
+	jc	L$6x_done
+
+	vmovups	%xmm9,-96(%rsi)
+	vpxor	%xmm15,%xmm1,%xmm9
+	vmovups	%xmm10,-80(%rsi)
+	vmovdqa	%xmm0,%xmm10
+	vmovups	%xmm11,-64(%rsi)
+	vmovdqa	%xmm5,%xmm11
+	vmovups	%xmm12,-48(%rsi)
+	vmovdqa	%xmm6,%xmm12
+	vmovups	%xmm13,-32(%rsi)
+	vmovdqa	%xmm7,%xmm13
+	vmovups	%xmm14,-16(%rsi)
+	vmovdqa	%xmm3,%xmm14
+	vmovdqu	32+8(%rsp),%xmm7
+	jmp	L$oop6x
+
+L$6x_done:
+	vpxor	16+8(%rsp),%xmm8,%xmm8
+	vpxor	%xmm4,%xmm8,%xmm8
+
+	ret
+
+
+.globl	_aesni_gcm_decrypt
+
+.p2align	5
+_aesni_gcm_decrypt:
+
+	xorq	%r10,%r10
+
+
+
+	cmpq	$0x60,%rdx
+	jb	L$gcm_dec_abort
+
+	leaq	(%rsp),%rax
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+	vzeroupper
+
+	vmovdqu	(%r8),%xmm1
+	addq	$-128,%rsp
+	movl	12(%r8),%ebx
+	leaq	L$bswap_mask(%rip),%r11
+	leaq	-128(%rcx),%r14
+	movq	$0xf80,%r15
+	vmovdqu	16(%r8),%xmm8
+	andq	$-128,%rsp
+	vmovdqu	(%r11),%xmm0
+	leaq	128(%rcx),%rcx
+	leaq	16+32(%r9),%r9
+	movl	240-128(%rcx),%ebp
+	vpshufb	%xmm0,%xmm8,%xmm8
+
+	andq	%r15,%r14
+	andq	%rsp,%r15
+	subq	%r14,%r15
+	jc	L$dec_no_key_aliasing
+	cmpq	$768,%r15
+	jnc	L$dec_no_key_aliasing
+	subq	%r15,%rsp
+L$dec_no_key_aliasing:
+
+	vmovdqu	80(%rdi),%xmm7
+	leaq	(%rdi),%r14
+	vmovdqu	64(%rdi),%xmm4
+
+
+
+
+
+
+
+	leaq	-192(%rdi,%rdx,1),%r15
+
+	vmovdqu	48(%rdi),%xmm5
+	shrq	$4,%rdx
+	xorq	%r10,%r10
+	vmovdqu	32(%rdi),%xmm6
+	vpshufb	%xmm0,%xmm7,%xmm7
+	vmovdqu	16(%rdi),%xmm2
+	vpshufb	%xmm0,%xmm4,%xmm4
+	vmovdqu	(%rdi),%xmm3
+	vpshufb	%xmm0,%xmm5,%xmm5
+	vmovdqu	%xmm4,48(%rsp)
+	vpshufb	%xmm0,%xmm6,%xmm6
+	vmovdqu	%xmm5,64(%rsp)
+	vpshufb	%xmm0,%xmm2,%xmm2
+	vmovdqu	%xmm6,80(%rsp)
+	vpshufb	%xmm0,%xmm3,%xmm3
+	vmovdqu	%xmm2,96(%rsp)
+	vmovdqu	%xmm3,112(%rsp)
+
+	call	_aesni_ctr32_ghash_6x
+
+	vmovups	%xmm9,-96(%rsi)
+	vmovups	%xmm10,-80(%rsi)
+	vmovups	%xmm11,-64(%rsi)
+	vmovups	%xmm12,-48(%rsi)
+	vmovups	%xmm13,-32(%rsi)
+	vmovups	%xmm14,-16(%rsi)
+
+	vpshufb	(%r11),%xmm8,%xmm8
+	vmovdqu	%xmm8,16(%r8)
+
+	vzeroupper
+	movq	-48(%rax),%r15
+
+	movq	-40(%rax),%r14
+
+	movq	-32(%rax),%r13
+
+	movq	-24(%rax),%r12
+
+	movq	-16(%rax),%rbp
+
+	movq	-8(%rax),%rbx
+
+	leaq	(%rax),%rsp
+
+L$gcm_dec_abort:
+	movq	%r10,%rax
+	ret
+
+
+
+.p2align	5
+_aesni_ctr32_6x:
+
+	vmovdqu	0-128(%rcx),%xmm4
+	vmovdqu	32(%r11),%xmm2
+	leaq	-1(%rbp),%r13
+	vmovups	16-128(%rcx),%xmm15
+	leaq	32-128(%rcx),%r12
+	vpxor	%xmm4,%xmm1,%xmm9
+	addl	$100663296,%ebx
+	jc	L$handle_ctr32_2
+	vpaddb	%xmm2,%xmm1,%xmm10
+	vpaddb	%xmm2,%xmm10,%xmm11
+	vpxor	%xmm4,%xmm10,%xmm10
+	vpaddb	%xmm2,%xmm11,%xmm12
+	vpxor	%xmm4,%xmm11,%xmm11
+	vpaddb	%xmm2,%xmm12,%xmm13
+	vpxor	%xmm4,%xmm12,%xmm12
+	vpaddb	%xmm2,%xmm13,%xmm14
+	vpxor	%xmm4,%xmm13,%xmm13
+	vpaddb	%xmm2,%xmm14,%xmm1
+	vpxor	%xmm4,%xmm14,%xmm14
+	jmp	L$oop_ctr32
+
+.p2align	4
+L$oop_ctr32:
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vmovups	(%r12),%xmm15
+	leaq	16(%r12),%r12
+	decl	%r13d
+	jnz	L$oop_ctr32
+
+	vmovdqu	(%r12),%xmm3
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	0(%rdi),%xmm3,%xmm4
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	16(%rdi),%xmm3,%xmm5
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpxor	32(%rdi),%xmm3,%xmm6
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vpxor	48(%rdi),%xmm3,%xmm8
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	64(%rdi),%xmm3,%xmm2
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	80(%rdi),%xmm3,%xmm3
+	leaq	96(%rdi),%rdi
+
+	vaesenclast	%xmm4,%xmm9,%xmm9
+	vaesenclast	%xmm5,%xmm10,%xmm10
+	vaesenclast	%xmm6,%xmm11,%xmm11
+	vaesenclast	%xmm8,%xmm12,%xmm12
+	vaesenclast	%xmm2,%xmm13,%xmm13
+	vaesenclast	%xmm3,%xmm14,%xmm14
+	vmovups	%xmm9,0(%rsi)
+	vmovups	%xmm10,16(%rsi)
+	vmovups	%xmm11,32(%rsi)
+	vmovups	%xmm12,48(%rsi)
+	vmovups	%xmm13,64(%rsi)
+	vmovups	%xmm14,80(%rsi)
+	leaq	96(%rsi),%rsi
+
+	ret
+.p2align	5
+L$handle_ctr32_2:
+	vpshufb	%xmm0,%xmm1,%xmm6
+	vmovdqu	48(%r11),%xmm5
+	vpaddd	64(%r11),%xmm6,%xmm10
+	vpaddd	%xmm5,%xmm6,%xmm11
+	vpaddd	%xmm5,%xmm10,%xmm12
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm11,%xmm13
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm4,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm12,%xmm14
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vpxor	%xmm4,%xmm11,%xmm11
+	vpaddd	%xmm5,%xmm13,%xmm1
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vpxor	%xmm4,%xmm12,%xmm12
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vpxor	%xmm4,%xmm13,%xmm13
+	vpshufb	%xmm0,%xmm1,%xmm1
+	vpxor	%xmm4,%xmm14,%xmm14
+	jmp	L$oop_ctr32
+
+
+
+.globl	_aesni_gcm_encrypt
+
+.p2align	5
+_aesni_gcm_encrypt:
+
+	xorq	%r10,%r10
+
+
+
+
+	cmpq	$288,%rdx
+	jb	L$gcm_enc_abort
+
+	leaq	(%rsp),%rax
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+	vzeroupper
+
+	vmovdqu	(%r8),%xmm1
+	addq	$-128,%rsp
+	movl	12(%r8),%ebx
+	leaq	L$bswap_mask(%rip),%r11
+	leaq	-128(%rcx),%r14
+	movq	$0xf80,%r15
+	leaq	128(%rcx),%rcx
+	vmovdqu	(%r11),%xmm0
+	andq	$-128,%rsp
+	movl	240-128(%rcx),%ebp
+
+	andq	%r15,%r14
+	andq	%rsp,%r15
+	subq	%r14,%r15
+	jc	L$enc_no_key_aliasing
+	cmpq	$768,%r15
+	jnc	L$enc_no_key_aliasing
+	subq	%r15,%rsp
+L$enc_no_key_aliasing:
+
+	leaq	(%rsi),%r14
+
+
+
+
+
+
+
+
+	leaq	-192(%rsi,%rdx,1),%r15
+
+	shrq	$4,%rdx
+
+	call	_aesni_ctr32_6x
+
+	vpshufb	%xmm0,%xmm9,%xmm8
+	vpshufb	%xmm0,%xmm10,%xmm2
+	vmovdqu	%xmm8,112(%rsp)
+	vpshufb	%xmm0,%xmm11,%xmm4
+	vmovdqu	%xmm2,96(%rsp)
+	vpshufb	%xmm0,%xmm12,%xmm5
+	vmovdqu	%xmm4,80(%rsp)
+	vpshufb	%xmm0,%xmm13,%xmm6
+	vmovdqu	%xmm5,64(%rsp)
+	vpshufb	%xmm0,%xmm14,%xmm7
+	vmovdqu	%xmm6,48(%rsp)
+
+	call	_aesni_ctr32_6x
+
+	vmovdqu	16(%r8),%xmm8
+	leaq	16+32(%r9),%r9
+	subq	$12,%rdx
+	movq	$192,%r10
+	vpshufb	%xmm0,%xmm8,%xmm8
+
+	call	_aesni_ctr32_ghash_6x
+	vmovdqu	32(%rsp),%xmm7
+	vmovdqu	(%r11),%xmm0
+	vmovdqu	0-32(%r9),%xmm3
+	vpunpckhqdq	%xmm7,%xmm7,%xmm1
+	vmovdqu	32-32(%r9),%xmm15
+	vmovups	%xmm9,-96(%rsi)
+	vpshufb	%xmm0,%xmm9,%xmm9
+	vpxor	%xmm7,%xmm1,%xmm1
+	vmovups	%xmm10,-80(%rsi)
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vmovups	%xmm11,-64(%rsi)
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vmovups	%xmm12,-48(%rsi)
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vmovups	%xmm13,-32(%rsi)
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vmovups	%xmm14,-16(%rsi)
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vmovdqu	%xmm9,16(%rsp)
+	vmovdqu	48(%rsp),%xmm6
+	vmovdqu	16-32(%r9),%xmm0
+	vpunpckhqdq	%xmm6,%xmm6,%xmm2
+	vpclmulqdq	$0x00,%xmm3,%xmm7,%xmm5
+	vpxor	%xmm6,%xmm2,%xmm2
+	vpclmulqdq	$0x11,%xmm3,%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm15,%xmm1,%xmm1
+
+	vmovdqu	64(%rsp),%xmm9
+	vpclmulqdq	$0x00,%xmm0,%xmm6,%xmm4
+	vmovdqu	48-32(%r9),%xmm3
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm9,%xmm9,%xmm5
+	vpclmulqdq	$0x11,%xmm0,%xmm6,%xmm6
+	vpxor	%xmm9,%xmm5,%xmm5
+	vpxor	%xmm7,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm15,%xmm2,%xmm2
+	vmovdqu	80-32(%r9),%xmm15
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vmovdqu	80(%rsp),%xmm1
+	vpclmulqdq	$0x00,%xmm3,%xmm9,%xmm7
+	vmovdqu	64-32(%r9),%xmm0
+	vpxor	%xmm4,%xmm7,%xmm7
+	vpunpckhqdq	%xmm1,%xmm1,%xmm4
+	vpclmulqdq	$0x11,%xmm3,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpxor	%xmm6,%xmm9,%xmm9
+	vpclmulqdq	$0x00,%xmm15,%xmm5,%xmm5
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	96(%rsp),%xmm2
+	vpclmulqdq	$0x00,%xmm0,%xmm1,%xmm6
+	vmovdqu	96-32(%r9),%xmm3
+	vpxor	%xmm7,%xmm6,%xmm6
+	vpunpckhqdq	%xmm2,%xmm2,%xmm7
+	vpclmulqdq	$0x11,%xmm0,%xmm1,%xmm1
+	vpxor	%xmm2,%xmm7,%xmm7
+	vpxor	%xmm9,%xmm1,%xmm1
+	vpclmulqdq	$0x10,%xmm15,%xmm4,%xmm4
+	vmovdqu	128-32(%r9),%xmm15
+	vpxor	%xmm5,%xmm4,%xmm4
+
+	vpxor	112(%rsp),%xmm8,%xmm8
+	vpclmulqdq	$0x00,%xmm3,%xmm2,%xmm5
+	vmovdqu	112-32(%r9),%xmm0
+	vpunpckhqdq	%xmm8,%xmm8,%xmm9
+	vpxor	%xmm6,%xmm5,%xmm5
+	vpclmulqdq	$0x11,%xmm3,%xmm2,%xmm2
+	vpxor	%xmm8,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm2,%xmm2
+	vpclmulqdq	$0x00,%xmm15,%xmm7,%xmm7
+	vpxor	%xmm4,%xmm7,%xmm4
+
+	vpclmulqdq	$0x00,%xmm0,%xmm8,%xmm6
+	vmovdqu	0-32(%r9),%xmm3
+	vpunpckhqdq	%xmm14,%xmm14,%xmm1
+	vpclmulqdq	$0x11,%xmm0,%xmm8,%xmm8
+	vpxor	%xmm14,%xmm1,%xmm1
+	vpxor	%xmm5,%xmm6,%xmm5
+	vpclmulqdq	$0x10,%xmm15,%xmm9,%xmm9
+	vmovdqu	32-32(%r9),%xmm15
+	vpxor	%xmm2,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm9,%xmm6
+
+	vmovdqu	16-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm7,%xmm9
+	vpclmulqdq	$0x00,%xmm3,%xmm14,%xmm4
+	vpxor	%xmm9,%xmm6,%xmm6
+	vpunpckhqdq	%xmm13,%xmm13,%xmm2
+	vpclmulqdq	$0x11,%xmm3,%xmm14,%xmm14
+	vpxor	%xmm13,%xmm2,%xmm2
+	vpslldq	$8,%xmm6,%xmm9
+	vpclmulqdq	$0x00,%xmm15,%xmm1,%xmm1
+	vpxor	%xmm9,%xmm5,%xmm8
+	vpsrldq	$8,%xmm6,%xmm6
+	vpxor	%xmm6,%xmm7,%xmm7
+
+	vpclmulqdq	$0x00,%xmm0,%xmm13,%xmm5
+	vmovdqu	48-32(%r9),%xmm3
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpunpckhqdq	%xmm12,%xmm12,%xmm9
+	vpclmulqdq	$0x11,%xmm0,%xmm13,%xmm13
+	vpxor	%xmm12,%xmm9,%xmm9
+	vpxor	%xmm14,%xmm13,%xmm13
+	vpalignr	$8,%xmm8,%xmm8,%xmm14
+	vpclmulqdq	$0x10,%xmm15,%xmm2,%xmm2
+	vmovdqu	80-32(%r9),%xmm15
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vpclmulqdq	$0x00,%xmm3,%xmm12,%xmm4
+	vmovdqu	64-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm11,%xmm11,%xmm1
+	vpclmulqdq	$0x11,%xmm3,%xmm12,%xmm12
+	vpxor	%xmm11,%xmm1,%xmm1
+	vpxor	%xmm13,%xmm12,%xmm12
+	vxorps	16(%rsp),%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm15,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm9,%xmm9
+
+	vpclmulqdq	$0x10,16(%r11),%xmm8,%xmm8
+	vxorps	%xmm14,%xmm8,%xmm8
+
+	vpclmulqdq	$0x00,%xmm0,%xmm11,%xmm5
+	vmovdqu	96-32(%r9),%xmm3
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpunpckhqdq	%xmm10,%xmm10,%xmm2
+	vpclmulqdq	$0x11,%xmm0,%xmm11,%xmm11
+	vpxor	%xmm10,%xmm2,%xmm2
+	vpalignr	$8,%xmm8,%xmm8,%xmm14
+	vpxor	%xmm12,%xmm11,%xmm11
+	vpclmulqdq	$0x10,%xmm15,%xmm1,%xmm1
+	vmovdqu	128-32(%r9),%xmm15
+	vpxor	%xmm9,%xmm1,%xmm1
+
+	vxorps	%xmm7,%xmm14,%xmm14
+	vpclmulqdq	$0x10,16(%r11),%xmm8,%xmm8
+	vxorps	%xmm14,%xmm8,%xmm8
+
+	vpclmulqdq	$0x00,%xmm3,%xmm10,%xmm4
+	vmovdqu	112-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm8,%xmm8,%xmm9
+	vpclmulqdq	$0x11,%xmm3,%xmm10,%xmm10
+	vpxor	%xmm8,%xmm9,%xmm9
+	vpxor	%xmm11,%xmm10,%xmm10
+	vpclmulqdq	$0x00,%xmm15,%xmm2,%xmm2
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vpclmulqdq	$0x00,%xmm0,%xmm8,%xmm5
+	vpclmulqdq	$0x11,%xmm0,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpclmulqdq	$0x10,%xmm15,%xmm9,%xmm6
+	vpxor	%xmm10,%xmm7,%xmm7
+	vpxor	%xmm2,%xmm6,%xmm6
+
+	vpxor	%xmm5,%xmm7,%xmm4
+	vpxor	%xmm4,%xmm6,%xmm6
+	vpslldq	$8,%xmm6,%xmm1
+	vmovdqu	16(%r11),%xmm3
+	vpsrldq	$8,%xmm6,%xmm6
+	vpxor	%xmm1,%xmm5,%xmm8
+	vpxor	%xmm6,%xmm7,%xmm7
+
+	vpalignr	$8,%xmm8,%xmm8,%xmm2
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm8
+	vpxor	%xmm2,%xmm8,%xmm8
+
+	vpalignr	$8,%xmm8,%xmm8,%xmm2
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm8
+	vpxor	%xmm7,%xmm2,%xmm2
+	vpxor	%xmm2,%xmm8,%xmm8
+	vpshufb	(%r11),%xmm8,%xmm8
+	vmovdqu	%xmm8,16(%r8)
+
+	vzeroupper
+	movq	-48(%rax),%r15
+
+	movq	-40(%rax),%r14
+
+	movq	-32(%rax),%r13
+
+	movq	-24(%rax),%r12
+
+	movq	-16(%rax),%rbp
+
+	movq	-8(%rax),%rbx
+
+	leaq	(%rax),%rsp
+
+L$gcm_enc_abort:
+	movq	%r10,%rax
+	ret
+
+
+.p2align	6
+L$bswap_mask:
+.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+L$poly:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+L$one_msb:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+L$two_lsb:
+.byte	2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+L$one_lsb:
+.byte	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.byte	65,69,83,45,78,73,32,71,67,77,32,109,111,100,117,108,101,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
+.p2align	6
diff --git a/crypto/aesgcm/aesni_gcm_x64_nasm.asm b/crypto/aesgcm/aesni_gcm_x64_nasm.asm
new file mode 100644
index 0000000..f3371e8
--- /dev/null
+++ b/crypto/aesgcm/aesni_gcm_x64_nasm.asm
@@ -0,0 +1,1023 @@
+default	rel
+%define XMMWORD
+%define YMMWORD
+%define ZMMWORD
+section	.text code align=64
+
+
+
+ALIGN	32
+_aesni_ctr32_ghash_6x:
+
+	vmovdqu	xmm2,XMMWORD[32+r11]
+	sub	rdx,6
+	vpxor	xmm4,xmm4,xmm4
+	vmovdqu	xmm15,XMMWORD[((0-128))+rcx]
+	vpaddb	xmm10,xmm1,xmm2
+	vpaddb	xmm11,xmm10,xmm2
+	vpaddb	xmm12,xmm11,xmm2
+	vpaddb	xmm13,xmm12,xmm2
+	vpaddb	xmm14,xmm13,xmm2
+	vpxor	xmm9,xmm1,xmm15
+	vmovdqu	XMMWORD[(16+8)+rsp],xmm4
+	jmp	NEAR $L$oop6x
+
+ALIGN	32
+$L$oop6x:
+	add	ebx,100663296
+	jc	NEAR $L$handle_ctr32
+	vmovdqu	xmm3,XMMWORD[((0-32))+r9]
+	vpaddb	xmm1,xmm14,xmm2
+	vpxor	xmm10,xmm10,xmm15
+	vpxor	xmm11,xmm11,xmm15
+
+$L$resume_ctr32:
+	vmovdqu	XMMWORD[r8],xmm1
+	vpclmulqdq	xmm5,xmm7,xmm3,0x10
+	vpxor	xmm12,xmm12,xmm15
+	vmovups	xmm2,XMMWORD[((16-128))+rcx]
+	vpclmulqdq	xmm6,xmm7,xmm3,0x01
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	xor	r12,r12
+	cmp	r15,r14
+
+	vaesenc	xmm9,xmm9,xmm2
+	vmovdqu	xmm0,XMMWORD[((48+8))+rsp]
+	vpxor	xmm13,xmm13,xmm15
+	vpclmulqdq	xmm1,xmm7,xmm3,0x00
+	vaesenc	xmm10,xmm10,xmm2
+	vpxor	xmm14,xmm14,xmm15
+	setnc	r12b
+	vpclmulqdq	xmm7,xmm7,xmm3,0x11
+	vaesenc	xmm11,xmm11,xmm2
+	vmovdqu	xmm3,XMMWORD[((16-32))+r9]
+	neg	r12
+	vaesenc	xmm12,xmm12,xmm2
+	vpxor	xmm6,xmm6,xmm5
+	vpclmulqdq	xmm5,xmm0,xmm3,0x00
+	vpxor	xmm8,xmm8,xmm4
+	vaesenc	xmm13,xmm13,xmm2
+	vpxor	xmm4,xmm1,xmm5
+	and	r12,0x60
+	vmovups	xmm15,XMMWORD[((32-128))+rcx]
+	vpclmulqdq	xmm1,xmm0,xmm3,0x10
+	vaesenc	xmm14,xmm14,xmm2
+
+	vpclmulqdq	xmm2,xmm0,xmm3,0x01
+	lea	r14,[r12*1+r14]
+	vaesenc	xmm9,xmm9,xmm15
+	vpxor	xmm8,xmm8,XMMWORD[((16+8))+rsp]
+	vpclmulqdq	xmm3,xmm0,xmm3,0x11
+	vmovdqu	xmm0,XMMWORD[((64+8))+rsp]
+	vaesenc	xmm10,xmm10,xmm15
+	movbe	r13,QWORD[88+r14]
+	vaesenc	xmm11,xmm11,xmm15
+	movbe	r12,QWORD[80+r14]
+	vaesenc	xmm12,xmm12,xmm15
+	mov	QWORD[((32+8))+rsp],r13
+	vaesenc	xmm13,xmm13,xmm15
+	mov	QWORD[((40+8))+rsp],r12
+	vmovdqu	xmm5,XMMWORD[((48-32))+r9]
+	vaesenc	xmm14,xmm14,xmm15
+
+	vmovups	xmm15,XMMWORD[((48-128))+rcx]
+	vpxor	xmm6,xmm6,xmm1
+	vpclmulqdq	xmm1,xmm0,xmm5,0x00
+	vaesenc	xmm9,xmm9,xmm15
+	vpxor	xmm6,xmm6,xmm2
+	vpclmulqdq	xmm2,xmm0,xmm5,0x10
+	vaesenc	xmm10,xmm10,xmm15
+	vpxor	xmm7,xmm7,xmm3
+	vpclmulqdq	xmm3,xmm0,xmm5,0x01
+	vaesenc	xmm11,xmm11,xmm15
+	vpclmulqdq	xmm5,xmm0,xmm5,0x11
+	vmovdqu	xmm0,XMMWORD[((80+8))+rsp]
+	vaesenc	xmm12,xmm12,xmm15
+	vaesenc	xmm13,xmm13,xmm15
+	vpxor	xmm4,xmm4,xmm1
+	vmovdqu	xmm1,XMMWORD[((64-32))+r9]
+	vaesenc	xmm14,xmm14,xmm15
+
+	vmovups	xmm15,XMMWORD[((64-128))+rcx]
+	vpxor	xmm6,xmm6,xmm2
+	vpclmulqdq	xmm2,xmm0,xmm1,0x00
+	vaesenc	xmm9,xmm9,xmm15
+	vpxor	xmm6,xmm6,xmm3
+	vpclmulqdq	xmm3,xmm0,xmm1,0x10
+	vaesenc	xmm10,xmm10,xmm15
+	movbe	r13,QWORD[72+r14]
+	vpxor	xmm7,xmm7,xmm5
+	vpclmulqdq	xmm5,xmm0,xmm1,0x01
+	vaesenc	xmm11,xmm11,xmm15
+	movbe	r12,QWORD[64+r14]
+	vpclmulqdq	xmm1,xmm0,xmm1,0x11
+	vmovdqu	xmm0,XMMWORD[((96+8))+rsp]
+	vaesenc	xmm12,xmm12,xmm15
+	mov	QWORD[((48+8))+rsp],r13
+	vaesenc	xmm13,xmm13,xmm15
+	mov	QWORD[((56+8))+rsp],r12
+	vpxor	xmm4,xmm4,xmm2
+	vmovdqu	xmm2,XMMWORD[((96-32))+r9]
+	vaesenc	xmm14,xmm14,xmm15
+
+	vmovups	xmm15,XMMWORD[((80-128))+rcx]
+	vpxor	xmm6,xmm6,xmm3
+	vpclmulqdq	xmm3,xmm0,xmm2,0x00
+	vaesenc	xmm9,xmm9,xmm15
+	vpxor	xmm6,xmm6,xmm5
+	vpclmulqdq	xmm5,xmm0,xmm2,0x10
+	vaesenc	xmm10,xmm10,xmm15
+	movbe	r13,QWORD[56+r14]
+	vpxor	xmm7,xmm7,xmm1
+	vpclmulqdq	xmm1,xmm0,xmm2,0x01
+	vpxor	xmm8,xmm8,XMMWORD[((112+8))+rsp]
+	vaesenc	xmm11,xmm11,xmm15
+	movbe	r12,QWORD[48+r14]
+	vpclmulqdq	xmm2,xmm0,xmm2,0x11
+	vaesenc	xmm12,xmm12,xmm15
+	mov	QWORD[((64+8))+rsp],r13
+	vaesenc	xmm13,xmm13,xmm15
+	mov	QWORD[((72+8))+rsp],r12
+	vpxor	xmm4,xmm4,xmm3
+	vmovdqu	xmm3,XMMWORD[((112-32))+r9]
+	vaesenc	xmm14,xmm14,xmm15
+
+	vmovups	xmm15,XMMWORD[((96-128))+rcx]
+	vpxor	xmm6,xmm6,xmm5
+	vpclmulqdq	xmm5,xmm8,xmm3,0x10
+	vaesenc	xmm9,xmm9,xmm15
+	vpxor	xmm6,xmm6,xmm1
+	vpclmulqdq	xmm1,xmm8,xmm3,0x01
+	vaesenc	xmm10,xmm10,xmm15
+	movbe	r13,QWORD[40+r14]
+	vpxor	xmm7,xmm7,xmm2
+	vpclmulqdq	xmm2,xmm8,xmm3,0x00
+	vaesenc	xmm11,xmm11,xmm15
+	movbe	r12,QWORD[32+r14]
+	vpclmulqdq	xmm8,xmm8,xmm3,0x11
+	vaesenc	xmm12,xmm12,xmm15
+	mov	QWORD[((80+8))+rsp],r13
+	vaesenc	xmm13,xmm13,xmm15
+	mov	QWORD[((88+8))+rsp],r12
+	vpxor	xmm6,xmm6,xmm5
+	vaesenc	xmm14,xmm14,xmm15
+	vpxor	xmm6,xmm6,xmm1
+
+	vmovups	xmm15,XMMWORD[((112-128))+rcx]
+	vpslldq	xmm5,xmm6,8
+	vpxor	xmm4,xmm4,xmm2
+	vmovdqu	xmm3,XMMWORD[16+r11]
+
+	vaesenc	xmm9,xmm9,xmm15
+	vpxor	xmm7,xmm7,xmm8
+	vaesenc	xmm10,xmm10,xmm15
+	vpxor	xmm4,xmm4,xmm5
+	movbe	r13,QWORD[24+r14]
+	vaesenc	xmm11,xmm11,xmm15
+	movbe	r12,QWORD[16+r14]
+	vpalignr	xmm0,xmm4,xmm4,8
+	vpclmulqdq	xmm4,xmm4,xmm3,0x10
+	mov	QWORD[((96+8))+rsp],r13
+	vaesenc	xmm12,xmm12,xmm15
+	mov	QWORD[((104+8))+rsp],r12
+	vaesenc	xmm13,xmm13,xmm15
+	vmovups	xmm1,XMMWORD[((128-128))+rcx]
+	vaesenc	xmm14,xmm14,xmm15
+
+	vaesenc	xmm9,xmm9,xmm1
+	vmovups	xmm15,XMMWORD[((144-128))+rcx]
+	vaesenc	xmm10,xmm10,xmm1
+	vpsrldq	xmm6,xmm6,8
+	vaesenc	xmm11,xmm11,xmm1
+	vpxor	xmm7,xmm7,xmm6
+	vaesenc	xmm12,xmm12,xmm1
+	vpxor	xmm4,xmm4,xmm0
+	movbe	r13,QWORD[8+r14]
+	vaesenc	xmm13,xmm13,xmm1
+	movbe	r12,QWORD[r14]
+	vaesenc	xmm14,xmm14,xmm1
+	vmovups	xmm1,XMMWORD[((160-128))+rcx]
+	cmp	ebp,11
+	jb	NEAR $L$enc_tail
+
+	vaesenc	xmm9,xmm9,xmm15
+	vaesenc	xmm10,xmm10,xmm15
+	vaesenc	xmm11,xmm11,xmm15
+	vaesenc	xmm12,xmm12,xmm15
+	vaesenc	xmm13,xmm13,xmm15
+	vaesenc	xmm14,xmm14,xmm15
+
+	vaesenc	xmm9,xmm9,xmm1
+	vaesenc	xmm10,xmm10,xmm1
+	vaesenc	xmm11,xmm11,xmm1
+	vaesenc	xmm12,xmm12,xmm1
+	vaesenc	xmm13,xmm13,xmm1
+	vmovups	xmm15,XMMWORD[((176-128))+rcx]
+	vaesenc	xmm14,xmm14,xmm1
+	vmovups	xmm1,XMMWORD[((192-128))+rcx]
+	je	NEAR $L$enc_tail
+
+	vaesenc	xmm9,xmm9,xmm15
+	vaesenc	xmm10,xmm10,xmm15
+	vaesenc	xmm11,xmm11,xmm15
+	vaesenc	xmm12,xmm12,xmm15
+	vaesenc	xmm13,xmm13,xmm15
+	vaesenc	xmm14,xmm14,xmm15
+
+	vaesenc	xmm9,xmm9,xmm1
+	vaesenc	xmm10,xmm10,xmm1
+	vaesenc	xmm11,xmm11,xmm1
+	vaesenc	xmm12,xmm12,xmm1
+	vaesenc	xmm13,xmm13,xmm1
+	vmovups	xmm15,XMMWORD[((208-128))+rcx]
+	vaesenc	xmm14,xmm14,xmm1
+	vmovups	xmm1,XMMWORD[((224-128))+rcx]
+	jmp	NEAR $L$enc_tail
+
+ALIGN	32
+$L$handle_ctr32:
+	vmovdqu	xmm0,XMMWORD[r11]
+	vpshufb	xmm6,xmm1,xmm0
+	vmovdqu	xmm5,XMMWORD[48+r11]
+	vpaddd	xmm10,xmm6,XMMWORD[64+r11]
+	vpaddd	xmm11,xmm6,xmm5
+	vmovdqu	xmm3,XMMWORD[((0-32))+r9]
+	vpaddd	xmm12,xmm10,xmm5
+	vpshufb	xmm10,xmm10,xmm0
+	vpaddd	xmm13,xmm11,xmm5
+	vpshufb	xmm11,xmm11,xmm0
+	vpxor	xmm10,xmm10,xmm15
+	vpaddd	xmm14,xmm12,xmm5
+	vpshufb	xmm12,xmm12,xmm0
+	vpxor	xmm11,xmm11,xmm15
+	vpaddd	xmm1,xmm13,xmm5
+	vpshufb	xmm13,xmm13,xmm0
+	vpshufb	xmm14,xmm14,xmm0
+	vpshufb	xmm1,xmm1,xmm0
+	jmp	NEAR $L$resume_ctr32
+
+ALIGN	32
+$L$enc_tail:
+	vaesenc	xmm9,xmm9,xmm15
+	vmovdqu	XMMWORD[(16+8)+rsp],xmm7
+	vpalignr	xmm8,xmm4,xmm4,8
+	vaesenc	xmm10,xmm10,xmm15
+	vpclmulqdq	xmm4,xmm4,xmm3,0x10
+	vpxor	xmm2,xmm1,XMMWORD[rdi]
+	vaesenc	xmm11,xmm11,xmm15
+	vpxor	xmm0,xmm1,XMMWORD[16+rdi]
+	vaesenc	xmm12,xmm12,xmm15
+	vpxor	xmm5,xmm1,XMMWORD[32+rdi]
+	vaesenc	xmm13,xmm13,xmm15
+	vpxor	xmm6,xmm1,XMMWORD[48+rdi]
+	vaesenc	xmm14,xmm14,xmm15
+	vpxor	xmm7,xmm1,XMMWORD[64+rdi]
+	vpxor	xmm3,xmm1,XMMWORD[80+rdi]
+	vmovdqu	xmm1,XMMWORD[r8]
+
+	vaesenclast	xmm9,xmm9,xmm2
+	vmovdqu	xmm2,XMMWORD[32+r11]
+	vaesenclast	xmm10,xmm10,xmm0
+	vpaddb	xmm0,xmm1,xmm2
+	mov	QWORD[((112+8))+rsp],r13
+	lea	rdi,[96+rdi]
+	vaesenclast	xmm11,xmm11,xmm5
+	vpaddb	xmm5,xmm0,xmm2
+	mov	QWORD[((120+8))+rsp],r12
+	lea	rsi,[96+rsi]
+	vmovdqu	xmm15,XMMWORD[((0-128))+rcx]
+	vaesenclast	xmm12,xmm12,xmm6
+	vpaddb	xmm6,xmm5,xmm2
+	vaesenclast	xmm13,xmm13,xmm7
+	vpaddb	xmm7,xmm6,xmm2
+	vaesenclast	xmm14,xmm14,xmm3
+	vpaddb	xmm3,xmm7,xmm2
+
+	add	r10,0x60
+	sub	rdx,0x6
+	jc	NEAR $L$6x_done
+
+	vmovups	XMMWORD[(-96)+rsi],xmm9
+	vpxor	xmm9,xmm1,xmm15
+	vmovups	XMMWORD[(-80)+rsi],xmm10
+	vmovdqa	xmm10,xmm0
+	vmovups	XMMWORD[(-64)+rsi],xmm11
+	vmovdqa	xmm11,xmm5
+	vmovups	XMMWORD[(-48)+rsi],xmm12
+	vmovdqa	xmm12,xmm6
+	vmovups	XMMWORD[(-32)+rsi],xmm13
+	vmovdqa	xmm13,xmm7
+	vmovups	XMMWORD[(-16)+rsi],xmm14
+	vmovdqa	xmm14,xmm3
+	vmovdqu	xmm7,XMMWORD[((32+8))+rsp]
+	jmp	NEAR $L$oop6x
+
+$L$6x_done:
+	vpxor	xmm8,xmm8,XMMWORD[((16+8))+rsp]
+	vpxor	xmm8,xmm8,xmm4
+
+	ret
+
+
+global	aesni_gcm_decrypt
+
+ALIGN	32
+aesni_gcm_decrypt:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_aesni_gcm_decrypt:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+	mov	r9,QWORD[48+rsp]
+
+
+
+	xor	r10,r10
+
+
+
+	cmp	rdx,0x60
+	jb	NEAR $L$gcm_dec_abort
+
+	lea	rax,[rsp]
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+	lea	rsp,[((-168))+rsp]
+	movaps	XMMWORD[(-216)+rax],xmm6
+	movaps	XMMWORD[(-200)+rax],xmm7
+	movaps	XMMWORD[(-184)+rax],xmm8
+	movaps	XMMWORD[(-168)+rax],xmm9
+	movaps	XMMWORD[(-152)+rax],xmm10
+	movaps	XMMWORD[(-136)+rax],xmm11
+	movaps	XMMWORD[(-120)+rax],xmm12
+	movaps	XMMWORD[(-104)+rax],xmm13
+	movaps	XMMWORD[(-88)+rax],xmm14
+	movaps	XMMWORD[(-72)+rax],xmm15
+$L$gcm_dec_body:
+	vzeroupper
+
+	vmovdqu	xmm1,XMMWORD[r8]
+	add	rsp,-128
+	mov	ebx,DWORD[12+r8]
+	lea	r11,[$L$bswap_mask]
+	lea	r14,[((-128))+rcx]
+	mov	r15,0xf80
+	vmovdqu	xmm8,XMMWORD[16+r8]
+	and	rsp,-128
+	vmovdqu	xmm0,XMMWORD[r11]
+	lea	rcx,[128+rcx]
+	lea	r9,[((16+32))+r9]
+	mov	ebp,DWORD[((240-128))+rcx]
+	vpshufb	xmm8,xmm8,xmm0
+
+	and	r14,r15
+	and	r15,rsp
+	sub	r15,r14
+	jc	NEAR $L$dec_no_key_aliasing
+	cmp	r15,768
+	jnc	NEAR $L$dec_no_key_aliasing
+	sub	rsp,r15
+$L$dec_no_key_aliasing:
+
+	vmovdqu	xmm7,XMMWORD[80+rdi]
+	lea	r14,[rdi]
+	vmovdqu	xmm4,XMMWORD[64+rdi]
+
+
+
+
+
+
+
+	lea	r15,[((-192))+rdx*1+rdi]
+
+	vmovdqu	xmm5,XMMWORD[48+rdi]
+	shr	rdx,4
+	xor	r10,r10
+	vmovdqu	xmm6,XMMWORD[32+rdi]
+	vpshufb	xmm7,xmm7,xmm0
+	vmovdqu	xmm2,XMMWORD[16+rdi]
+	vpshufb	xmm4,xmm4,xmm0
+	vmovdqu	xmm3,XMMWORD[rdi]
+	vpshufb	xmm5,xmm5,xmm0
+	vmovdqu	XMMWORD[48+rsp],xmm4
+	vpshufb	xmm6,xmm6,xmm0
+	vmovdqu	XMMWORD[64+rsp],xmm5
+	vpshufb	xmm2,xmm2,xmm0
+	vmovdqu	XMMWORD[80+rsp],xmm6
+	vpshufb	xmm3,xmm3,xmm0
+	vmovdqu	XMMWORD[96+rsp],xmm2
+	vmovdqu	XMMWORD[112+rsp],xmm3
+
+	call	_aesni_ctr32_ghash_6x
+
+	vmovups	XMMWORD[(-96)+rsi],xmm9
+	vmovups	XMMWORD[(-80)+rsi],xmm10
+	vmovups	XMMWORD[(-64)+rsi],xmm11
+	vmovups	XMMWORD[(-48)+rsi],xmm12
+	vmovups	XMMWORD[(-32)+rsi],xmm13
+	vmovups	XMMWORD[(-16)+rsi],xmm14
+
+	vpshufb	xmm8,xmm8,XMMWORD[r11]
+	vmovdqu	XMMWORD[16+r8],xmm8
+
+	vzeroupper
+	movaps	xmm6,XMMWORD[((-216))+rax]
+	movaps	xmm7,XMMWORD[((-200))+rax]
+	movaps	xmm8,XMMWORD[((-184))+rax]
+	movaps	xmm9,XMMWORD[((-168))+rax]
+	movaps	xmm10,XMMWORD[((-152))+rax]
+	movaps	xmm11,XMMWORD[((-136))+rax]
+	movaps	xmm12,XMMWORD[((-120))+rax]
+	movaps	xmm13,XMMWORD[((-104))+rax]
+	movaps	xmm14,XMMWORD[((-88))+rax]
+	movaps	xmm15,XMMWORD[((-72))+rax]
+	mov	r15,QWORD[((-48))+rax]
+
+	mov	r14,QWORD[((-40))+rax]
+
+	mov	r13,QWORD[((-32))+rax]
+
+	mov	r12,QWORD[((-24))+rax]
+
+	mov	rbp,QWORD[((-16))+rax]
+
+	mov	rbx,QWORD[((-8))+rax]
+
+	lea	rsp,[rax]
+
+$L$gcm_dec_abort:
+	mov	rax,r10
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	ret
+
+$L$SEH_end_aesni_gcm_decrypt:
+
+ALIGN	32
+_aesni_ctr32_6x:
+
+	vmovdqu	xmm4,XMMWORD[((0-128))+rcx]
+	vmovdqu	xmm2,XMMWORD[32+r11]
+	lea	r13,[((-1))+rbp]
+	vmovups	xmm15,XMMWORD[((16-128))+rcx]
+	lea	r12,[((32-128))+rcx]
+	vpxor	xmm9,xmm1,xmm4
+	add	ebx,100663296
+	jc	NEAR $L$handle_ctr32_2
+	vpaddb	xmm10,xmm1,xmm2
+	vpaddb	xmm11,xmm10,xmm2
+	vpxor	xmm10,xmm10,xmm4
+	vpaddb	xmm12,xmm11,xmm2
+	vpxor	xmm11,xmm11,xmm4
+	vpaddb	xmm13,xmm12,xmm2
+	vpxor	xmm12,xmm12,xmm4
+	vpaddb	xmm14,xmm13,xmm2
+	vpxor	xmm13,xmm13,xmm4
+	vpaddb	xmm1,xmm14,xmm2
+	vpxor	xmm14,xmm14,xmm4
+	jmp	NEAR $L$oop_ctr32
+
+ALIGN	16
+$L$oop_ctr32:
+	vaesenc	xmm9,xmm9,xmm15
+	vaesenc	xmm10,xmm10,xmm15
+	vaesenc	xmm11,xmm11,xmm15
+	vaesenc	xmm12,xmm12,xmm15
+	vaesenc	xmm13,xmm13,xmm15
+	vaesenc	xmm14,xmm14,xmm15
+	vmovups	xmm15,XMMWORD[r12]
+	lea	r12,[16+r12]
+	dec	r13d
+	jnz	NEAR $L$oop_ctr32
+
+	vmovdqu	xmm3,XMMWORD[r12]
+	vaesenc	xmm9,xmm9,xmm15
+	vpxor	xmm4,xmm3,XMMWORD[rdi]
+	vaesenc	xmm10,xmm10,xmm15
+	vpxor	xmm5,xmm3,XMMWORD[16+rdi]
+	vaesenc	xmm11,xmm11,xmm15
+	vpxor	xmm6,xmm3,XMMWORD[32+rdi]
+	vaesenc	xmm12,xmm12,xmm15
+	vpxor	xmm8,xmm3,XMMWORD[48+rdi]
+	vaesenc	xmm13,xmm13,xmm15
+	vpxor	xmm2,xmm3,XMMWORD[64+rdi]
+	vaesenc	xmm14,xmm14,xmm15
+	vpxor	xmm3,xmm3,XMMWORD[80+rdi]
+	lea	rdi,[96+rdi]
+
+	vaesenclast	xmm9,xmm9,xmm4
+	vaesenclast	xmm10,xmm10,xmm5
+	vaesenclast	xmm11,xmm11,xmm6
+	vaesenclast	xmm12,xmm12,xmm8
+	vaesenclast	xmm13,xmm13,xmm2
+	vaesenclast	xmm14,xmm14,xmm3
+	vmovups	XMMWORD[rsi],xmm9
+	vmovups	XMMWORD[16+rsi],xmm10
+	vmovups	XMMWORD[32+rsi],xmm11
+	vmovups	XMMWORD[48+rsi],xmm12
+	vmovups	XMMWORD[64+rsi],xmm13
+	vmovups	XMMWORD[80+rsi],xmm14
+	lea	rsi,[96+rsi]
+
+	ret
+ALIGN	32
+$L$handle_ctr32_2:
+	vpshufb	xmm6,xmm1,xmm0
+	vmovdqu	xmm5,XMMWORD[48+r11]
+	vpaddd	xmm10,xmm6,XMMWORD[64+r11]
+	vpaddd	xmm11,xmm6,xmm5
+	vpaddd	xmm12,xmm10,xmm5
+	vpshufb	xmm10,xmm10,xmm0
+	vpaddd	xmm13,xmm11,xmm5
+	vpshufb	xmm11,xmm11,xmm0
+	vpxor	xmm10,xmm10,xmm4
+	vpaddd	xmm14,xmm12,xmm5
+	vpshufb	xmm12,xmm12,xmm0
+	vpxor	xmm11,xmm11,xmm4
+	vpaddd	xmm1,xmm13,xmm5
+	vpshufb	xmm13,xmm13,xmm0
+	vpxor	xmm12,xmm12,xmm4
+	vpshufb	xmm14,xmm14,xmm0
+	vpxor	xmm13,xmm13,xmm4
+	vpshufb	xmm1,xmm1,xmm0
+	vpxor	xmm14,xmm14,xmm4
+	jmp	NEAR $L$oop_ctr32
+
+
+
+global	aesni_gcm_encrypt
+
+ALIGN	32
+aesni_gcm_encrypt:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_aesni_gcm_encrypt:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+	mov	r9,QWORD[48+rsp]
+
+
+
+	xor	r10,r10
+
+
+
+
+	cmp	rdx,0x60*3
+	jb	NEAR $L$gcm_enc_abort
+
+	lea	rax,[rsp]
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+	lea	rsp,[((-168))+rsp]
+	movaps	XMMWORD[(-216)+rax],xmm6
+	movaps	XMMWORD[(-200)+rax],xmm7
+	movaps	XMMWORD[(-184)+rax],xmm8
+	movaps	XMMWORD[(-168)+rax],xmm9
+	movaps	XMMWORD[(-152)+rax],xmm10
+	movaps	XMMWORD[(-136)+rax],xmm11
+	movaps	XMMWORD[(-120)+rax],xmm12
+	movaps	XMMWORD[(-104)+rax],xmm13
+	movaps	XMMWORD[(-88)+rax],xmm14
+	movaps	XMMWORD[(-72)+rax],xmm15
+$L$gcm_enc_body:
+	vzeroupper
+
+	vmovdqu	xmm1,XMMWORD[r8]
+	add	rsp,-128
+	mov	ebx,DWORD[12+r8]
+	lea	r11,[$L$bswap_mask]
+	lea	r14,[((-128))+rcx]
+	mov	r15,0xf80
+	lea	rcx,[128+rcx]
+	vmovdqu	xmm0,XMMWORD[r11]
+	and	rsp,-128
+	mov	ebp,DWORD[((240-128))+rcx]
+
+	and	r14,r15
+	and	r15,rsp
+	sub	r15,r14
+	jc	NEAR $L$enc_no_key_aliasing
+	cmp	r15,768
+	jnc	NEAR $L$enc_no_key_aliasing
+	sub	rsp,r15
+$L$enc_no_key_aliasing:
+
+	lea	r14,[rsi]
+
+
+
+
+
+
+
+
+	lea	r15,[((-192))+rdx*1+rsi]
+
+	shr	rdx,4
+
+	call	_aesni_ctr32_6x
+
+	vpshufb	xmm8,xmm9,xmm0
+	vpshufb	xmm2,xmm10,xmm0
+	vmovdqu	XMMWORD[112+rsp],xmm8
+	vpshufb	xmm4,xmm11,xmm0
+	vmovdqu	XMMWORD[96+rsp],xmm2
+	vpshufb	xmm5,xmm12,xmm0
+	vmovdqu	XMMWORD[80+rsp],xmm4
+	vpshufb	xmm6,xmm13,xmm0
+	vmovdqu	XMMWORD[64+rsp],xmm5
+	vpshufb	xmm7,xmm14,xmm0
+	vmovdqu	XMMWORD[48+rsp],xmm6
+
+	call	_aesni_ctr32_6x
+
+	vmovdqu	xmm8,XMMWORD[16+r8]
+	lea	r9,[((16+32))+r9]
+	sub	rdx,12
+	mov	r10,0x60*2
+	vpshufb	xmm8,xmm8,xmm0
+
+	call	_aesni_ctr32_ghash_6x
+	vmovdqu	xmm7,XMMWORD[32+rsp]
+	vmovdqu	xmm0,XMMWORD[r11]
+	vmovdqu	xmm3,XMMWORD[((0-32))+r9]
+	vpunpckhqdq	xmm1,xmm7,xmm7
+	vmovdqu	xmm15,XMMWORD[((32-32))+r9]
+	vmovups	XMMWORD[(-96)+rsi],xmm9
+	vpshufb	xmm9,xmm9,xmm0
+	vpxor	xmm1,xmm1,xmm7
+	vmovups	XMMWORD[(-80)+rsi],xmm10
+	vpshufb	xmm10,xmm10,xmm0
+	vmovups	XMMWORD[(-64)+rsi],xmm11
+	vpshufb	xmm11,xmm11,xmm0
+	vmovups	XMMWORD[(-48)+rsi],xmm12
+	vpshufb	xmm12,xmm12,xmm0
+	vmovups	XMMWORD[(-32)+rsi],xmm13
+	vpshufb	xmm13,xmm13,xmm0
+	vmovups	XMMWORD[(-16)+rsi],xmm14
+	vpshufb	xmm14,xmm14,xmm0
+	vmovdqu	XMMWORD[16+rsp],xmm9
+	vmovdqu	xmm6,XMMWORD[48+rsp]
+	vmovdqu	xmm0,XMMWORD[((16-32))+r9]
+	vpunpckhqdq	xmm2,xmm6,xmm6
+	vpclmulqdq	xmm5,xmm7,xmm3,0x00
+	vpxor	xmm2,xmm2,xmm6
+	vpclmulqdq	xmm7,xmm7,xmm3,0x11
+	vpclmulqdq	xmm1,xmm1,xmm15,0x00
+
+	vmovdqu	xmm9,XMMWORD[64+rsp]
+	vpclmulqdq	xmm4,xmm6,xmm0,0x00
+	vmovdqu	xmm3,XMMWORD[((48-32))+r9]
+	vpxor	xmm4,xmm4,xmm5
+	vpunpckhqdq	xmm5,xmm9,xmm9
+	vpclmulqdq	xmm6,xmm6,xmm0,0x11
+	vpxor	xmm5,xmm5,xmm9
+	vpxor	xmm6,xmm6,xmm7
+	vpclmulqdq	xmm2,xmm2,xmm15,0x10
+	vmovdqu	xmm15,XMMWORD[((80-32))+r9]
+	vpxor	xmm2,xmm2,xmm1
+
+	vmovdqu	xmm1,XMMWORD[80+rsp]
+	vpclmulqdq	xmm7,xmm9,xmm3,0x00
+	vmovdqu	xmm0,XMMWORD[((64-32))+r9]
+	vpxor	xmm7,xmm7,xmm4
+	vpunpckhqdq	xmm4,xmm1,xmm1
+	vpclmulqdq	xmm9,xmm9,xmm3,0x11
+	vpxor	xmm4,xmm4,xmm1
+	vpxor	xmm9,xmm9,xmm6
+	vpclmulqdq	xmm5,xmm5,xmm15,0x00
+	vpxor	xmm5,xmm5,xmm2
+
+	vmovdqu	xmm2,XMMWORD[96+rsp]
+	vpclmulqdq	xmm6,xmm1,xmm0,0x00
+	vmovdqu	xmm3,XMMWORD[((96-32))+r9]
+	vpxor	xmm6,xmm6,xmm7
+	vpunpckhqdq	xmm7,xmm2,xmm2
+	vpclmulqdq	xmm1,xmm1,xmm0,0x11
+	vpxor	xmm7,xmm7,xmm2
+	vpxor	xmm1,xmm1,xmm9
+	vpclmulqdq	xmm4,xmm4,xmm15,0x10
+	vmovdqu	xmm15,XMMWORD[((128-32))+r9]
+	vpxor	xmm4,xmm4,xmm5
+
+	vpxor	xmm8,xmm8,XMMWORD[112+rsp]
+	vpclmulqdq	xmm5,xmm2,xmm3,0x00
+	vmovdqu	xmm0,XMMWORD[((112-32))+r9]
+	vpunpckhqdq	xmm9,xmm8,xmm8
+	vpxor	xmm5,xmm5,xmm6
+	vpclmulqdq	xmm2,xmm2,xmm3,0x11
+	vpxor	xmm9,xmm9,xmm8
+	vpxor	xmm2,xmm2,xmm1
+	vpclmulqdq	xmm7,xmm7,xmm15,0x00
+	vpxor	xmm4,xmm7,xmm4
+
+	vpclmulqdq	xmm6,xmm8,xmm0,0x00
+	vmovdqu	xmm3,XMMWORD[((0-32))+r9]
+	vpunpckhqdq	xmm1,xmm14,xmm14
+	vpclmulqdq	xmm8,xmm8,xmm0,0x11
+	vpxor	xmm1,xmm1,xmm14
+	vpxor	xmm5,xmm6,xmm5
+	vpclmulqdq	xmm9,xmm9,xmm15,0x10
+	vmovdqu	xmm15,XMMWORD[((32-32))+r9]
+	vpxor	xmm7,xmm8,xmm2
+	vpxor	xmm6,xmm9,xmm4
+
+	vmovdqu	xmm0,XMMWORD[((16-32))+r9]
+	vpxor	xmm9,xmm7,xmm5
+	vpclmulqdq	xmm4,xmm14,xmm3,0x00
+	vpxor	xmm6,xmm6,xmm9
+	vpunpckhqdq	xmm2,xmm13,xmm13
+	vpclmulqdq	xmm14,xmm14,xmm3,0x11
+	vpxor	xmm2,xmm2,xmm13
+	vpslldq	xmm9,xmm6,8
+	vpclmulqdq	xmm1,xmm1,xmm15,0x00
+	vpxor	xmm8,xmm5,xmm9
+	vpsrldq	xmm6,xmm6,8
+	vpxor	xmm7,xmm7,xmm6
+
+	vpclmulqdq	xmm5,xmm13,xmm0,0x00
+	vmovdqu	xmm3,XMMWORD[((48-32))+r9]
+	vpxor	xmm5,xmm5,xmm4
+	vpunpckhqdq	xmm9,xmm12,xmm12
+	vpclmulqdq	xmm13,xmm13,xmm0,0x11
+	vpxor	xmm9,xmm9,xmm12
+	vpxor	xmm13,xmm13,xmm14
+	vpalignr	xmm14,xmm8,xmm8,8
+	vpclmulqdq	xmm2,xmm2,xmm15,0x10
+	vmovdqu	xmm15,XMMWORD[((80-32))+r9]
+	vpxor	xmm2,xmm2,xmm1
+
+	vpclmulqdq	xmm4,xmm12,xmm3,0x00
+	vmovdqu	xmm0,XMMWORD[((64-32))+r9]
+	vpxor	xmm4,xmm4,xmm5
+	vpunpckhqdq	xmm1,xmm11,xmm11
+	vpclmulqdq	xmm12,xmm12,xmm3,0x11
+	vpxor	xmm1,xmm1,xmm11
+	vpxor	xmm12,xmm12,xmm13
+	vxorps	xmm7,xmm7,XMMWORD[16+rsp]
+	vpclmulqdq	xmm9,xmm9,xmm15,0x00
+	vpxor	xmm9,xmm9,xmm2
+
+	vpclmulqdq	xmm8,xmm8,XMMWORD[16+r11],0x10
+	vxorps	xmm8,xmm8,xmm14
+
+	vpclmulqdq	xmm5,xmm11,xmm0,0x00
+	vmovdqu	xmm3,XMMWORD[((96-32))+r9]
+	vpxor	xmm5,xmm5,xmm4
+	vpunpckhqdq	xmm2,xmm10,xmm10
+	vpclmulqdq	xmm11,xmm11,xmm0,0x11
+	vpxor	xmm2,xmm2,xmm10
+	vpalignr	xmm14,xmm8,xmm8,8
+	vpxor	xmm11,xmm11,xmm12
+	vpclmulqdq	xmm1,xmm1,xmm15,0x10
+	vmovdqu	xmm15,XMMWORD[((128-32))+r9]
+	vpxor	xmm1,xmm1,xmm9
+
+	vxorps	xmm14,xmm14,xmm7
+	vpclmulqdq	xmm8,xmm8,XMMWORD[16+r11],0x10
+	vxorps	xmm8,xmm8,xmm14
+
+	vpclmulqdq	xmm4,xmm10,xmm3,0x00
+	vmovdqu	xmm0,XMMWORD[((112-32))+r9]
+	vpxor	xmm4,xmm4,xmm5
+	vpunpckhqdq	xmm9,xmm8,xmm8
+	vpclmulqdq	xmm10,xmm10,xmm3,0x11
+	vpxor	xmm9,xmm9,xmm8
+	vpxor	xmm10,xmm10,xmm11
+	vpclmulqdq	xmm2,xmm2,xmm15,0x00
+	vpxor	xmm2,xmm2,xmm1
+
+	vpclmulqdq	xmm5,xmm8,xmm0,0x00
+	vpclmulqdq	xmm7,xmm8,xmm0,0x11
+	vpxor	xmm5,xmm5,xmm4
+	vpclmulqdq	xmm6,xmm9,xmm15,0x10
+	vpxor	xmm7,xmm7,xmm10
+	vpxor	xmm6,xmm6,xmm2
+
+	vpxor	xmm4,xmm7,xmm5
+	vpxor	xmm6,xmm6,xmm4
+	vpslldq	xmm1,xmm6,8
+	vmovdqu	xmm3,XMMWORD[16+r11]
+	vpsrldq	xmm6,xmm6,8
+	vpxor	xmm8,xmm5,xmm1
+	vpxor	xmm7,xmm7,xmm6
+
+	vpalignr	xmm2,xmm8,xmm8,8
+	vpclmulqdq	xmm8,xmm8,xmm3,0x10
+	vpxor	xmm8,xmm8,xmm2
+
+	vpalignr	xmm2,xmm8,xmm8,8
+	vpclmulqdq	xmm8,xmm8,xmm3,0x10
+	vpxor	xmm2,xmm2,xmm7
+	vpxor	xmm8,xmm8,xmm2
+	vpshufb	xmm8,xmm8,XMMWORD[r11]
+	vmovdqu	XMMWORD[16+r8],xmm8
+
+	vzeroupper
+	movaps	xmm6,XMMWORD[((-216))+rax]
+	movaps	xmm7,XMMWORD[((-200))+rax]
+	movaps	xmm8,XMMWORD[((-184))+rax]
+	movaps	xmm9,XMMWORD[((-168))+rax]
+	movaps	xmm10,XMMWORD[((-152))+rax]
+	movaps	xmm11,XMMWORD[((-136))+rax]
+	movaps	xmm12,XMMWORD[((-120))+rax]
+	movaps	xmm13,XMMWORD[((-104))+rax]
+	movaps	xmm14,XMMWORD[((-88))+rax]
+	movaps	xmm15,XMMWORD[((-72))+rax]
+	mov	r15,QWORD[((-48))+rax]
+
+	mov	r14,QWORD[((-40))+rax]
+
+	mov	r13,QWORD[((-32))+rax]
+
+	mov	r12,QWORD[((-24))+rax]
+
+	mov	rbp,QWORD[((-16))+rax]
+
+	mov	rbx,QWORD[((-8))+rax]
+
+	lea	rsp,[rax]
+
+$L$gcm_enc_abort:
+	mov	rax,r10
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	ret
+
+$L$SEH_end_aesni_gcm_encrypt:
+ALIGN	64
+$L$bswap_mask:
+DB	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+$L$poly:
+DB	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+$L$one_msb:
+DB	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+$L$two_lsb:
+DB	2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+$L$one_lsb:
+DB	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+DB	65,69,83,45,78,73,32,71,67,77,32,109,111,100,117,108
+DB	101,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82
+DB	89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112
+DB	114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
+ALIGN	64
+EXTERN	__imp_RtlVirtualUnwind
+
+ALIGN	16
+gcm_se_handler:
+	push	rsi
+	push	rdi
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	pushfq
+	sub	rsp,64
+
+	mov	rax,QWORD[120+r8]
+	mov	rbx,QWORD[248+r8]
+
+	mov	rsi,QWORD[8+r9]
+	mov	r11,QWORD[56+r9]
+
+	mov	r10d,DWORD[r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jb	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[152+r8]
+
+	mov	r10d,DWORD[4+r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jae	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[120+r8]
+
+	mov	r15,QWORD[((-48))+rax]
+	mov	r14,QWORD[((-40))+rax]
+	mov	r13,QWORD[((-32))+rax]
+	mov	r12,QWORD[((-24))+rax]
+	mov	rbp,QWORD[((-16))+rax]
+	mov	rbx,QWORD[((-8))+rax]
+	mov	QWORD[240+r8],r15
+	mov	QWORD[232+r8],r14
+	mov	QWORD[224+r8],r13
+	mov	QWORD[216+r8],r12
+	mov	QWORD[160+r8],rbp
+	mov	QWORD[144+r8],rbx
+
+	lea	rsi,[((-216))+rax]
+	lea	rdi,[512+r8]
+	mov	ecx,20
+	DD	0xa548f3fc
+
+$L$common_seh_tail:
+	mov	rdi,QWORD[8+rax]
+	mov	rsi,QWORD[16+rax]
+	mov	QWORD[152+r8],rax
+	mov	QWORD[168+r8],rsi
+	mov	QWORD[176+r8],rdi
+
+	mov	rdi,QWORD[40+r9]
+	mov	rsi,r8
+	mov	ecx,154
+	DD	0xa548f3fc
+
+	mov	rsi,r9
+	xor	rcx,rcx
+	mov	rdx,QWORD[8+rsi]
+	mov	r8,QWORD[rsi]
+	mov	r9,QWORD[16+rsi]
+	mov	r10,QWORD[40+rsi]
+	lea	r11,[56+rsi]
+	lea	r12,[24+rsi]
+	mov	QWORD[32+rsp],r10
+	mov	QWORD[40+rsp],r11
+	mov	QWORD[48+rsp],r12
+	mov	QWORD[56+rsp],rcx
+	call	QWORD[__imp_RtlVirtualUnwind]
+
+	mov	eax,1
+	add	rsp,64
+	popfq
+	pop	r15
+	pop	r14
+	pop	r13
+	pop	r12
+	pop	rbp
+	pop	rbx
+	pop	rdi
+	pop	rsi
+	ret
+
+
+section	.pdata rdata align=4
+ALIGN	4
+	DD	$L$SEH_begin_aesni_gcm_decrypt wrt ..imagebase
+	DD	$L$SEH_end_aesni_gcm_decrypt wrt ..imagebase
+	DD	$L$SEH_gcm_dec_info wrt ..imagebase
+
+	DD	$L$SEH_begin_aesni_gcm_encrypt wrt ..imagebase
+	DD	$L$SEH_end_aesni_gcm_encrypt wrt ..imagebase
+	DD	$L$SEH_gcm_enc_info wrt ..imagebase
+section	.xdata rdata align=8
+ALIGN	8
+$L$SEH_gcm_dec_info:
+DB	9,0,0,0
+	DD	gcm_se_handler wrt ..imagebase
+	DD	$L$gcm_dec_body wrt ..imagebase,$L$gcm_dec_abort wrt ..imagebase
+$L$SEH_gcm_enc_info:
+DB	9,0,0,0
+	DD	gcm_se_handler wrt ..imagebase
+	DD	$L$gcm_enc_body wrt ..imagebase,$L$gcm_enc_abort wrt ..imagebase
diff --git a/crypto/aesgcm/aesni_x64_gas.s b/crypto/aesgcm/aesni_x64_gas.s
new file mode 100644
index 0000000..a1cd80b
--- /dev/null
+++ b/crypto/aesgcm/aesni_x64_gas.s
@@ -0,0 +1,1510 @@
+.text	
+.globl	aesni_encrypt
+.type	aesni_encrypt,@function
+.align	16
+aesni_encrypt:
+	movups	(%rdi),%xmm2
+	movl	240(%rdx),%eax
+	movups	(%rdx),%xmm0
+	movups	16(%rdx),%xmm1
+	leaq	32(%rdx),%rdx
+	xorps	%xmm0,%xmm2
+.Loop_enc1_1:
+.byte	102,15,56,220,209
+	decl	%eax
+	movups	(%rdx),%xmm1
+	leaq	16(%rdx),%rdx
+	jnz	.Loop_enc1_1
+.byte	102,15,56,221,209
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	movups	%xmm2,(%rsi)
+	pxor	%xmm2,%xmm2
+	ret
+.size	aesni_encrypt,.-aesni_encrypt
+
+.globl	aesni_decrypt
+.type	aesni_decrypt,@function
+.align	16
+aesni_decrypt:
+	movups	(%rdi),%xmm2
+	movl	240(%rdx),%eax
+	movups	(%rdx),%xmm0
+	movups	16(%rdx),%xmm1
+	leaq	32(%rdx),%rdx
+	xorps	%xmm0,%xmm2
+.Loop_dec1_2:
+.byte	102,15,56,222,209
+	decl	%eax
+	movups	(%rdx),%xmm1
+	leaq	16(%rdx),%rdx
+	jnz	.Loop_dec1_2
+.byte	102,15,56,223,209
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	movups	%xmm2,(%rsi)
+	pxor	%xmm2,%xmm2
+	ret
+.size	aesni_decrypt, .-aesni_decrypt
+.type	_aesni_encrypt2,@function
+.align	16
+_aesni_encrypt2:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+	addq	$16,%rax
+
+.Lenc_loop2:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Lenc_loop2
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+	ret
+.size	_aesni_encrypt2,.-_aesni_encrypt2
+.type	_aesni_decrypt2,@function
+.align	16
+_aesni_decrypt2:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+	addq	$16,%rax
+
+.Ldec_loop2:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Ldec_loop2
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+	ret
+.size	_aesni_decrypt2,.-_aesni_decrypt2
+.type	_aesni_encrypt3,@function
+.align	16
+_aesni_encrypt3:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	xorps	%xmm0,%xmm4
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+	addq	$16,%rax
+
+.Lenc_loop3:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Lenc_loop3
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+.byte	102,15,56,221,224
+	ret
+.size	_aesni_encrypt3,.-_aesni_encrypt3
+.type	_aesni_decrypt3,@function
+.align	16
+_aesni_decrypt3:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	xorps	%xmm0,%xmm4
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+	addq	$16,%rax
+
+.Ldec_loop3:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+.byte	102,15,56,222,224
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Ldec_loop3
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+.byte	102,15,56,223,224
+	ret
+.size	_aesni_decrypt3,.-_aesni_decrypt3
+.type	_aesni_encrypt4,@function
+.align	16
+_aesni_encrypt4:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	xorps	%xmm0,%xmm4
+	xorps	%xmm0,%xmm5
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	0x0f,0x1f,0x00
+	addq	$16,%rax
+
+.Lenc_loop4:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Lenc_loop4
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+.byte	102,15,56,221,224
+.byte	102,15,56,221,232
+	ret
+.size	_aesni_encrypt4,.-_aesni_encrypt4
+.type	_aesni_decrypt4,@function
+.align	16
+_aesni_decrypt4:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	xorps	%xmm0,%xmm4
+	xorps	%xmm0,%xmm5
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	0x0f,0x1f,0x00
+	addq	$16,%rax
+
+.Ldec_loop4:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+.byte	102,15,56,222,224
+.byte	102,15,56,222,232
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Ldec_loop4
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+.byte	102,15,56,223,224
+.byte	102,15,56,223,232
+	ret
+.size	_aesni_decrypt4,.-_aesni_decrypt4
+.type	_aesni_encrypt6,@function
+.align	16
+_aesni_encrypt6:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	pxor	%xmm0,%xmm3
+	pxor	%xmm0,%xmm4
+.byte	102,15,56,220,209
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,220,217
+	pxor	%xmm0,%xmm5
+	pxor	%xmm0,%xmm6
+.byte	102,15,56,220,225
+	pxor	%xmm0,%xmm7
+	movups	(%rcx,%rax,1),%xmm0
+	addq	$16,%rax
+	jmp	.Lenc_loop6_enter
+.align	16
+.Lenc_loop6:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.Lenc_loop6_enter:
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Lenc_loop6
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+.byte	102,15,56,221,224
+.byte	102,15,56,221,232
+.byte	102,15,56,221,240
+.byte	102,15,56,221,248
+	ret
+.size	_aesni_encrypt6,.-_aesni_encrypt6
+.type	_aesni_decrypt6,@function
+.align	16
+_aesni_decrypt6:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	pxor	%xmm0,%xmm3
+	pxor	%xmm0,%xmm4
+.byte	102,15,56,222,209
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,222,217
+	pxor	%xmm0,%xmm5
+	pxor	%xmm0,%xmm6
+.byte	102,15,56,222,225
+	pxor	%xmm0,%xmm7
+	movups	(%rcx,%rax,1),%xmm0
+	addq	$16,%rax
+	jmp	.Ldec_loop6_enter
+.align	16
+.Ldec_loop6:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.Ldec_loop6_enter:
+.byte	102,15,56,222,233
+.byte	102,15,56,222,241
+.byte	102,15,56,222,249
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+.byte	102,15,56,222,224
+.byte	102,15,56,222,232
+.byte	102,15,56,222,240
+.byte	102,15,56,222,248
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Ldec_loop6
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+.byte	102,15,56,222,241
+.byte	102,15,56,222,249
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+.byte	102,15,56,223,224
+.byte	102,15,56,223,232
+.byte	102,15,56,223,240
+.byte	102,15,56,223,248
+	ret
+.size	_aesni_decrypt6,.-_aesni_decrypt6
+.type	_aesni_encrypt8,@function
+.align	16
+_aesni_encrypt8:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	pxor	%xmm0,%xmm4
+	pxor	%xmm0,%xmm5
+	pxor	%xmm0,%xmm6
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,220,209
+	pxor	%xmm0,%xmm7
+	pxor	%xmm0,%xmm8
+.byte	102,15,56,220,217
+	pxor	%xmm0,%xmm9
+	movups	(%rcx,%rax,1),%xmm0
+	addq	$16,%rax
+	jmp	.Lenc_loop8_inner
+.align	16
+.Lenc_loop8:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.Lenc_loop8_inner:
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+.Lenc_loop8_enter:
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Lenc_loop8
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+.byte	102,15,56,221,224
+.byte	102,15,56,221,232
+.byte	102,15,56,221,240
+.byte	102,15,56,221,248
+.byte	102,68,15,56,221,192
+.byte	102,68,15,56,221,200
+	ret
+.size	_aesni_encrypt8,.-_aesni_encrypt8
+.type	_aesni_decrypt8,@function
+.align	16
+_aesni_decrypt8:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	pxor	%xmm0,%xmm4
+	pxor	%xmm0,%xmm5
+	pxor	%xmm0,%xmm6
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,222,209
+	pxor	%xmm0,%xmm7
+	pxor	%xmm0,%xmm8
+.byte	102,15,56,222,217
+	pxor	%xmm0,%xmm9
+	movups	(%rcx,%rax,1),%xmm0
+	addq	$16,%rax
+	jmp	.Ldec_loop8_inner
+.align	16
+.Ldec_loop8:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.Ldec_loop8_inner:
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+.byte	102,15,56,222,241
+.byte	102,15,56,222,249
+.byte	102,68,15,56,222,193
+.byte	102,68,15,56,222,201
+.Ldec_loop8_enter:
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+.byte	102,15,56,222,224
+.byte	102,15,56,222,232
+.byte	102,15,56,222,240
+.byte	102,15,56,222,248
+.byte	102,68,15,56,222,192
+.byte	102,68,15,56,222,200
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	.Ldec_loop8
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+.byte	102,15,56,222,241
+.byte	102,15,56,222,249
+.byte	102,68,15,56,222,193
+.byte	102,68,15,56,222,201
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+.byte	102,15,56,223,224
+.byte	102,15,56,223,232
+.byte	102,15,56,223,240
+.byte	102,15,56,223,248
+.byte	102,68,15,56,223,192
+.byte	102,68,15,56,223,200
+	ret
+.size	_aesni_decrypt8,.-_aesni_decrypt8
+.globl	aesni_ctr32_encrypt_blocks
+.type	aesni_ctr32_encrypt_blocks,@function
+.align	16
+aesni_ctr32_encrypt_blocks:
+.cfi_startproc	
+	cmpq	$1,%rdx
+	jne	.Lctr32_bulk
+
+
+
+	movups	(%r8),%xmm2
+	movups	(%rdi),%xmm3
+	movl	240(%rcx),%edx
+	movups	(%rcx),%xmm0
+	movups	16(%rcx),%xmm1
+	leaq	32(%rcx),%rcx
+	xorps	%xmm0,%xmm2
+.Loop_enc1_3:
+.byte	102,15,56,220,209
+	decl	%edx
+	movups	(%rcx),%xmm1
+	leaq	16(%rcx),%rcx
+	jnz	.Loop_enc1_3
+.byte	102,15,56,221,209
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	xorps	%xmm3,%xmm2
+	pxor	%xmm3,%xmm3
+	movups	%xmm2,(%rsi)
+	xorps	%xmm2,%xmm2
+	jmp	.Lctr32_epilogue
+
+.align	16
+.Lctr32_bulk:
+	leaq	(%rsp),%r11
+.cfi_def_cfa_register	%r11
+	pushq	%rbp
+.cfi_offset	%rbp,-16
+	subq	$128,%rsp
+	andq	$-16,%rsp
+
+
+
+
+	movdqu	(%r8),%xmm2
+	movdqu	(%rcx),%xmm0
+	movl	12(%r8),%r8d
+	pxor	%xmm0,%xmm2
+	movl	12(%rcx),%ebp
+	movdqa	%xmm2,0(%rsp)
+	bswapl	%r8d
+	movdqa	%xmm2,%xmm3
+	movdqa	%xmm2,%xmm4
+	movdqa	%xmm2,%xmm5
+	movdqa	%xmm2,64(%rsp)
+	movdqa	%xmm2,80(%rsp)
+	movdqa	%xmm2,96(%rsp)
+	movq	%rdx,%r10
+	movdqa	%xmm2,112(%rsp)
+
+	leaq	1(%r8),%rax
+	leaq	2(%r8),%rdx
+	bswapl	%eax
+	bswapl	%edx
+	xorl	%ebp,%eax
+	xorl	%ebp,%edx
+.byte	102,15,58,34,216,3
+	leaq	3(%r8),%rax
+	movdqa	%xmm3,16(%rsp)
+.byte	102,15,58,34,226,3
+	bswapl	%eax
+	movq	%r10,%rdx
+	leaq	4(%r8),%r10
+	movdqa	%xmm4,32(%rsp)
+	xorl	%ebp,%eax
+	bswapl	%r10d
+.byte	102,15,58,34,232,3
+	xorl	%ebp,%r10d
+	movdqa	%xmm5,48(%rsp)
+	leaq	5(%r8),%r9
+	movl	%r10d,64+12(%rsp)
+	bswapl	%r9d
+	leaq	6(%r8),%r10
+	movl	240(%rcx),%eax
+	xorl	%ebp,%r9d
+	bswapl	%r10d
+	movl	%r9d,80+12(%rsp)
+	xorl	%ebp,%r10d
+	leaq	7(%r8),%r9
+	movl	%r10d,96+12(%rsp)
+	bswapl	%r9d
+
+
+	xorl	%ebp,%r9d
+
+	movl	%r9d,112+12(%rsp)
+
+	movups	16(%rcx),%xmm1
+
+	movdqa	64(%rsp),%xmm6
+	movdqa	80(%rsp),%xmm7
+
+	cmpq	$8,%rdx
+	jb	.Lctr32_tail
+
+	subq	$6,%rdx
+
+
+
+	leaq	128(%rcx),%rcx
+	subq	$2,%rdx
+	jmp	.Lctr32_loop8
+
+
+
+
+
+
+
+
+
+
+.align	16
+.Lctr32_loop6:
+	addl	$6,%r8d
+	movups	-48(%rcx,%r10,1),%xmm0
+.byte	102,15,56,220,209
+	movl	%r8d,%eax
+	xorl	%ebp,%eax
+.byte	102,15,56,220,217
+.byte	0x0f,0x38,0xf1,0x44,0x24,12
+	leal	1(%r8),%eax
+.byte	102,15,56,220,225
+	xorl	%ebp,%eax
+.byte	0x0f,0x38,0xf1,0x44,0x24,28
+.byte	102,15,56,220,233
+	leal	2(%r8),%eax
+	xorl	%ebp,%eax
+.byte	102,15,56,220,241
+.byte	0x0f,0x38,0xf1,0x44,0x24,44
+	leal	3(%r8),%eax
+.byte	102,15,56,220,249
+	movups	-32(%rcx,%r10,1),%xmm1
+	xorl	%ebp,%eax
+
+.byte	102,15,56,220,208
+.byte	0x0f,0x38,0xf1,0x44,0x24,60
+	leal	4(%r8),%eax
+.byte	102,15,56,220,216
+	xorl	%ebp,%eax
+.byte	0x0f,0x38,0xf1,0x44,0x24,76
+.byte	102,15,56,220,224
+	leal	5(%r8),%eax
+	xorl	%ebp,%eax
+.byte	102,15,56,220,232
+.byte	0x0f,0x38,0xf1,0x44,0x24,92
+	movq	%r10,%rax
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+	movups	-16(%rcx,%r10,1),%xmm0
+
+	call	.Lenc_loop6
+
+	movdqu	(%rdi),%xmm8
+	movdqu	16(%rdi),%xmm9
+	movdqu	32(%rdi),%xmm10
+	movdqu	48(%rdi),%xmm11
+	movdqu	64(%rdi),%xmm12
+	movdqu	80(%rdi),%xmm13
+	leaq	96(%rdi),%rdi
+	movups	-64(%rcx,%r10,1),%xmm1
+	pxor	%xmm2,%xmm8
+	movaps	0(%rsp),%xmm2
+	pxor	%xmm3,%xmm9
+	movaps	16(%rsp),%xmm3
+	pxor	%xmm4,%xmm10
+	movaps	32(%rsp),%xmm4
+	pxor	%xmm5,%xmm11
+	movaps	48(%rsp),%xmm5
+	pxor	%xmm6,%xmm12
+	movaps	64(%rsp),%xmm6
+	pxor	%xmm7,%xmm13
+	movaps	80(%rsp),%xmm7
+	movdqu	%xmm8,(%rsi)
+	movdqu	%xmm9,16(%rsi)
+	movdqu	%xmm10,32(%rsi)
+	movdqu	%xmm11,48(%rsi)
+	movdqu	%xmm12,64(%rsi)
+	movdqu	%xmm13,80(%rsi)
+	leaq	96(%rsi),%rsi
+
+	subq	$6,%rdx
+	jnc	.Lctr32_loop6
+
+	addq	$6,%rdx
+	jz	.Lctr32_done
+
+	leal	-48(%r10),%eax
+	leaq	-80(%rcx,%r10,1),%rcx
+	negl	%eax
+	shrl	$4,%eax
+	jmp	.Lctr32_tail
+
+.align	32
+.Lctr32_loop8:
+	addl	$8,%r8d
+	movdqa	96(%rsp),%xmm8
+.byte	102,15,56,220,209
+	movl	%r8d,%r9d
+	movdqa	112(%rsp),%xmm9
+.byte	102,15,56,220,217
+	bswapl	%r9d
+	movups	32-128(%rcx),%xmm0
+.byte	102,15,56,220,225
+	xorl	%ebp,%r9d
+	nop
+.byte	102,15,56,220,233
+	movl	%r9d,0+12(%rsp)
+	leaq	1(%r8),%r9
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	48-128(%rcx),%xmm1
+	bswapl	%r9d
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+	movl	%r9d,16+12(%rsp)
+	leaq	2(%r8),%r9
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	64-128(%rcx),%xmm0
+	bswapl	%r9d
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movl	%r9d,32+12(%rsp)
+	leaq	3(%r8),%r9
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	80-128(%rcx),%xmm1
+	bswapl	%r9d
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+	movl	%r9d,48+12(%rsp)
+	leaq	4(%r8),%r9
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	96-128(%rcx),%xmm0
+	bswapl	%r9d
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movl	%r9d,64+12(%rsp)
+	leaq	5(%r8),%r9
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	112-128(%rcx),%xmm1
+	bswapl	%r9d
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+	movl	%r9d,80+12(%rsp)
+	leaq	6(%r8),%r9
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	128-128(%rcx),%xmm0
+	bswapl	%r9d
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movl	%r9d,96+12(%rsp)
+	leaq	7(%r8),%r9
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	144-128(%rcx),%xmm1
+	bswapl	%r9d
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+	xorl	%ebp,%r9d
+	movdqu	0(%rdi),%xmm10
+.byte	102,15,56,220,232
+	movl	%r9d,112+12(%rsp)
+	cmpl	$11,%eax
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	160-128(%rcx),%xmm0
+
+	jb	.Lctr32_enc_done
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	176-128(%rcx),%xmm1
+
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	192-128(%rcx),%xmm0
+	je	.Lctr32_enc_done
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	208-128(%rcx),%xmm1
+
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	224-128(%rcx),%xmm0
+	jmp	.Lctr32_enc_done
+
+.align	16
+.Lctr32_enc_done:
+	movdqu	16(%rdi),%xmm11
+	pxor	%xmm0,%xmm10
+	movdqu	32(%rdi),%xmm12
+	pxor	%xmm0,%xmm11
+	movdqu	48(%rdi),%xmm13
+	pxor	%xmm0,%xmm12
+	movdqu	64(%rdi),%xmm14
+	pxor	%xmm0,%xmm13
+	movdqu	80(%rdi),%xmm15
+	pxor	%xmm0,%xmm14
+	pxor	%xmm0,%xmm15
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movdqu	96(%rdi),%xmm1
+	leaq	128(%rdi),%rdi
+
+.byte	102,65,15,56,221,210
+	pxor	%xmm0,%xmm1
+	movdqu	112-128(%rdi),%xmm10
+.byte	102,65,15,56,221,219
+	pxor	%xmm0,%xmm10
+	movdqa	0(%rsp),%xmm11
+.byte	102,65,15,56,221,228
+.byte	102,65,15,56,221,237
+	movdqa	16(%rsp),%xmm12
+	movdqa	32(%rsp),%xmm13
+.byte	102,65,15,56,221,246
+.byte	102,65,15,56,221,255
+	movdqa	48(%rsp),%xmm14
+	movdqa	64(%rsp),%xmm15
+.byte	102,68,15,56,221,193
+	movdqa	80(%rsp),%xmm0
+	movups	16-128(%rcx),%xmm1
+.byte	102,69,15,56,221,202
+
+	movups	%xmm2,(%rsi)
+	movdqa	%xmm11,%xmm2
+	movups	%xmm3,16(%rsi)
+	movdqa	%xmm12,%xmm3
+	movups	%xmm4,32(%rsi)
+	movdqa	%xmm13,%xmm4
+	movups	%xmm5,48(%rsi)
+	movdqa	%xmm14,%xmm5
+	movups	%xmm6,64(%rsi)
+	movdqa	%xmm15,%xmm6
+	movups	%xmm7,80(%rsi)
+	movdqa	%xmm0,%xmm7
+	movups	%xmm8,96(%rsi)
+	movups	%xmm9,112(%rsi)
+	leaq	128(%rsi),%rsi
+
+	subq	$8,%rdx
+	jnc	.Lctr32_loop8
+
+	addq	$8,%rdx
+	jz	.Lctr32_done
+	leaq	-128(%rcx),%rcx
+
+.Lctr32_tail:
+
+
+	leaq	16(%rcx),%rcx
+	cmpq	$4,%rdx
+	jb	.Lctr32_loop3
+	je	.Lctr32_loop4
+
+
+	shll	$4,%eax
+	movdqa	96(%rsp),%xmm8
+	pxor	%xmm9,%xmm9
+
+	movups	16(%rcx),%xmm0
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	leaq	32-16(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,220,225
+	addq	$16,%rax
+	movups	(%rdi),%xmm10
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+	movups	16(%rdi),%xmm11
+	movups	32(%rdi),%xmm12
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+
+	call	.Lenc_loop8_enter
+
+	movdqu	48(%rdi),%xmm13
+	pxor	%xmm10,%xmm2
+	movdqu	64(%rdi),%xmm10
+	pxor	%xmm11,%xmm3
+	movdqu	%xmm2,(%rsi)
+	pxor	%xmm12,%xmm4
+	movdqu	%xmm3,16(%rsi)
+	pxor	%xmm13,%xmm5
+	movdqu	%xmm4,32(%rsi)
+	pxor	%xmm10,%xmm6
+	movdqu	%xmm5,48(%rsi)
+	movdqu	%xmm6,64(%rsi)
+	cmpq	$6,%rdx
+	jb	.Lctr32_done
+
+	movups	80(%rdi),%xmm11
+	xorps	%xmm11,%xmm7
+	movups	%xmm7,80(%rsi)
+	je	.Lctr32_done
+
+	movups	96(%rdi),%xmm12
+	xorps	%xmm12,%xmm8
+	movups	%xmm8,96(%rsi)
+	jmp	.Lctr32_done
+
+.align	32
+.Lctr32_loop4:
+.byte	102,15,56,220,209
+	leaq	16(%rcx),%rcx
+	decl	%eax
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movups	(%rcx),%xmm1
+	jnz	.Lctr32_loop4
+.byte	102,15,56,221,209
+.byte	102,15,56,221,217
+	movups	(%rdi),%xmm10
+	movups	16(%rdi),%xmm11
+.byte	102,15,56,221,225
+.byte	102,15,56,221,233
+	movups	32(%rdi),%xmm12
+	movups	48(%rdi),%xmm13
+
+	xorps	%xmm10,%xmm2
+	movups	%xmm2,(%rsi)
+	xorps	%xmm11,%xmm3
+	movups	%xmm3,16(%rsi)
+	pxor	%xmm12,%xmm4
+	movdqu	%xmm4,32(%rsi)
+	pxor	%xmm13,%xmm5
+	movdqu	%xmm5,48(%rsi)
+	jmp	.Lctr32_done
+
+.align	32
+.Lctr32_loop3:
+.byte	102,15,56,220,209
+	leaq	16(%rcx),%rcx
+	decl	%eax
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+	movups	(%rcx),%xmm1
+	jnz	.Lctr32_loop3
+.byte	102,15,56,221,209
+.byte	102,15,56,221,217
+.byte	102,15,56,221,225
+
+	movups	(%rdi),%xmm10
+	xorps	%xmm10,%xmm2
+	movups	%xmm2,(%rsi)
+	cmpq	$2,%rdx
+	jb	.Lctr32_done
+
+	movups	16(%rdi),%xmm11
+	xorps	%xmm11,%xmm3
+	movups	%xmm3,16(%rsi)
+	je	.Lctr32_done
+
+	movups	32(%rdi),%xmm12
+	xorps	%xmm12,%xmm4
+	movups	%xmm4,32(%rsi)
+
+.Lctr32_done:
+	xorps	%xmm0,%xmm0
+	xorl	%ebp,%ebp
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+	pxor	%xmm6,%xmm6
+	pxor	%xmm7,%xmm7
+	movaps	%xmm0,0(%rsp)
+	pxor	%xmm8,%xmm8
+	movaps	%xmm0,16(%rsp)
+	pxor	%xmm9,%xmm9
+	movaps	%xmm0,32(%rsp)
+	pxor	%xmm10,%xmm10
+	movaps	%xmm0,48(%rsp)
+	pxor	%xmm11,%xmm11
+	movaps	%xmm0,64(%rsp)
+	pxor	%xmm12,%xmm12
+	movaps	%xmm0,80(%rsp)
+	pxor	%xmm13,%xmm13
+	movaps	%xmm0,96(%rsp)
+	pxor	%xmm14,%xmm14
+	movaps	%xmm0,112(%rsp)
+	pxor	%xmm15,%xmm15
+	movq	-8(%r11),%rbp
+.cfi_restore	%rbp
+	leaq	(%r11),%rsp
+.cfi_def_cfa_register	%rsp
+.Lctr32_epilogue:
+	ret
+.cfi_endproc	
+.size	aesni_ctr32_encrypt_blocks,.-aesni_ctr32_encrypt_blocks
+.globl	aesni_set_decrypt_key
+.type	aesni_set_decrypt_key,@function
+.align	16
+aesni_set_decrypt_key:
+.cfi_startproc	
+.byte	0x48,0x83,0xEC,0x08
+.cfi_adjust_cfa_offset	8
+	call	__aesni_set_encrypt_key
+	shll	$4,%esi
+	testl	%eax,%eax
+	jnz	.Ldec_key_ret
+	leaq	16(%rdx,%rsi,1),%rdi
+
+	movups	(%rdx),%xmm0
+	movups	(%rdi),%xmm1
+	movups	%xmm0,(%rdi)
+	movups	%xmm1,(%rdx)
+	leaq	16(%rdx),%rdx
+	leaq	-16(%rdi),%rdi
+
+.Ldec_key_inverse:
+	movups	(%rdx),%xmm0
+	movups	(%rdi),%xmm1
+.byte	102,15,56,219,192
+.byte	102,15,56,219,201
+	leaq	16(%rdx),%rdx
+	leaq	-16(%rdi),%rdi
+	movups	%xmm0,16(%rdi)
+	movups	%xmm1,-16(%rdx)
+	cmpq	%rdx,%rdi
+	ja	.Ldec_key_inverse
+
+	movups	(%rdx),%xmm0
+.byte	102,15,56,219,192
+	pxor	%xmm1,%xmm1
+	movups	%xmm0,(%rdi)
+	pxor	%xmm0,%xmm0
+.Ldec_key_ret:
+	addq	$8,%rsp
+.cfi_adjust_cfa_offset	-8
+	ret
+.cfi_endproc	
+.LSEH_end_set_decrypt_key:
+.size	aesni_set_decrypt_key,.-aesni_set_decrypt_key
+.globl	aesni_set_encrypt_key
+.type	aesni_set_encrypt_key,@function
+.align	16
+aesni_set_encrypt_key:
+__aesni_set_encrypt_key:
+.cfi_startproc	
+.byte	0x48,0x83,0xEC,0x08
+.cfi_adjust_cfa_offset	8
+	movq	$-1,%rax
+	testq	%rdi,%rdi
+	jz	.Lenc_key_ret
+	testq	%rdx,%rdx
+	jz	.Lenc_key_ret
+
+	movups	(%rdi),%xmm0
+	xorps	%xmm4,%xmm4
+
+
+
+	leaq	16(%rdx),%rax
+	cmpl	$256,%esi
+	je	.L14rounds
+	cmpl	$192,%esi
+	je	.L12rounds
+	cmpl	$128,%esi
+	jne	.Lbad_keybits
+
+.L10rounds:
+	movl	$9,%esi
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	movdqa	.Lkey_rotate(%rip),%xmm5
+	movl	$8,%r10d
+	movdqa	.Lkey_rcon1(%rip),%xmm4
+	movdqa	%xmm0,%xmm2
+	movdqu	%xmm0,(%rdx)
+	jmp	.Loop_key128
+
+.align	16
+.Loop_key128:
+	pshufb	%xmm5,%xmm0
+.byte	102,15,56,221,196
+	pslld	$1,%xmm4
+	leaq	16(%rax),%rax
+
+	movdqa	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm3,%xmm2
+
+	pxor	%xmm2,%xmm0
+	movdqu	%xmm0,-16(%rax)
+	movdqa	%xmm0,%xmm2
+
+	decl	%r10d
+	jnz	.Loop_key128
+
+	movdqa	.Lkey_rcon1b(%rip),%xmm4
+
+	pshufb	%xmm5,%xmm0
+.byte	102,15,56,221,196
+	pslld	$1,%xmm4
+
+	movdqa	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm3,%xmm2
+
+	pxor	%xmm2,%xmm0
+	movdqu	%xmm0,(%rax)
+
+	movdqa	%xmm0,%xmm2
+	pshufb	%xmm5,%xmm0
+.byte	102,15,56,221,196
+
+	movdqa	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm3,%xmm2
+
+	pxor	%xmm2,%xmm0
+	movdqu	%xmm0,16(%rax)
+
+	movl	%esi,96(%rax)
+	xorl	%eax,%eax
+	jmp	.Lenc_key_ret
+
+.align	16
+.L12rounds:
+	movq	16(%rdi),%xmm2
+	movl	$11,%esi
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	movdqa	.Lkey_rotate192(%rip),%xmm5
+	movdqa	.Lkey_rcon1(%rip),%xmm4
+	movl	$8,%r10d
+	movdqu	%xmm0,(%rdx)
+	jmp	.Loop_key192
+
+.align	16
+.Loop_key192:
+	movq	%xmm2,0(%rax)
+	movdqa	%xmm2,%xmm1
+	pshufb	%xmm5,%xmm2
+.byte	102,15,56,221,212
+	pslld	$1,%xmm4
+	leaq	24(%rax),%rax
+
+	movdqa	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm3,%xmm0
+
+	pshufd	$0xff,%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+	pslldq	$4,%xmm1
+	pxor	%xmm1,%xmm3
+
+	pxor	%xmm2,%xmm0
+	pxor	%xmm3,%xmm2
+	movdqu	%xmm0,-16(%rax)
+
+	decl	%r10d
+	jnz	.Loop_key192
+
+	movl	%esi,32(%rax)
+	xorl	%eax,%eax
+	jmp	.Lenc_key_ret
+
+.align	16
+.L14rounds:
+	movups	16(%rdi),%xmm2
+	movl	$13,%esi
+	leaq	16(%rax),%rax
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	movdqa	.Lkey_rotate(%rip),%xmm5
+	movdqa	.Lkey_rcon1(%rip),%xmm4
+	movl	$7,%r10d
+	movdqu	%xmm0,0(%rdx)
+	movdqa	%xmm2,%xmm1
+	movdqu	%xmm2,16(%rdx)
+	jmp	.Loop_key256
+
+.align	16
+.Loop_key256:
+	pshufb	%xmm5,%xmm2
+.byte	102,15,56,221,212
+
+	movdqa	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm3,%xmm0
+	pslld	$1,%xmm4
+
+	pxor	%xmm2,%xmm0
+	movdqu	%xmm0,(%rax)
+
+	decl	%r10d
+	jz	.Ldone_key256
+
+	pshufd	$0xff,%xmm0,%xmm2
+	pxor	%xmm3,%xmm3
+.byte	102,15,56,221,211
+
+	movdqa	%xmm1,%xmm3
+	pslldq	$4,%xmm1
+	pxor	%xmm1,%xmm3
+	pslldq	$4,%xmm1
+	pxor	%xmm1,%xmm3
+	pslldq	$4,%xmm1
+	pxor	%xmm3,%xmm1
+
+	pxor	%xmm1,%xmm2
+	movdqu	%xmm2,16(%rax)
+	leaq	32(%rax),%rax
+	movdqa	%xmm2,%xmm1
+
+	jmp	.Loop_key256
+
+.Ldone_key256:
+	movl	%esi,16(%rax)
+	xorl	%eax,%eax
+	jmp	.Lenc_key_ret
+
+.align	16
+.Lbad_keybits:
+	movq	$-2,%rax
+.Lenc_key_ret:
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+	addq	$8,%rsp
+.cfi_adjust_cfa_offset	-8
+	ret
+.cfi_endproc	
+.LSEH_end_set_encrypt_key:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.size	aesni_set_encrypt_key,.-aesni_set_encrypt_key
+.size	__aesni_set_encrypt_key,.-__aesni_set_encrypt_key
+.align	64
+.Lbswap_mask:
+.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+.Lincrement32:
+.long	6,6,6,0
+.Lincrement64:
+.long	1,0,0,0
+.Lxts_magic:
+.long	0x87,0,1,0
+.Lincrement1:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+.Lkey_rotate:
+.long	0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d
+.Lkey_rotate192:
+.long	0x04070605,0x04070605,0x04070605,0x04070605
+.Lkey_rcon1:
+.long	1,1,1,1
+.Lkey_rcon1b:
+.long	0x1b,0x1b,0x1b,0x1b
+
+.align	64
diff --git a/crypto/aesgcm/aesni_x64_gas_macosx.s b/crypto/aesgcm/aesni_x64_gas_macosx.s
new file mode 100644
index 0000000..13e6806
--- /dev/null
+++ b/crypto/aesgcm/aesni_x64_gas_macosx.s
@@ -0,0 +1,1510 @@
+.text	
+.globl	_aesni_encrypt
+
+.p2align	4
+_aesni_encrypt:
+	movups	(%rdi),%xmm2
+	movl	240(%rdx),%eax
+	movups	(%rdx),%xmm0
+	movups	16(%rdx),%xmm1
+	leaq	32(%rdx),%rdx
+	xorps	%xmm0,%xmm2
+L$oop_enc1_1:
+.byte	102,15,56,220,209
+	decl	%eax
+	movups	(%rdx),%xmm1
+	leaq	16(%rdx),%rdx
+	jnz	L$oop_enc1_1
+.byte	102,15,56,221,209
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	movups	%xmm2,(%rsi)
+	pxor	%xmm2,%xmm2
+	ret
+
+
+.globl	_aesni_decrypt
+
+.p2align	4
+_aesni_decrypt:
+	movups	(%rdi),%xmm2
+	movl	240(%rdx),%eax
+	movups	(%rdx),%xmm0
+	movups	16(%rdx),%xmm1
+	leaq	32(%rdx),%rdx
+	xorps	%xmm0,%xmm2
+L$oop_dec1_2:
+.byte	102,15,56,222,209
+	decl	%eax
+	movups	(%rdx),%xmm1
+	leaq	16(%rdx),%rdx
+	jnz	L$oop_dec1_2
+.byte	102,15,56,223,209
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	movups	%xmm2,(%rsi)
+	pxor	%xmm2,%xmm2
+	ret
+
+
+.p2align	4
+_aesni_encrypt2:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+	addq	$16,%rax
+
+L$enc_loop2:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$enc_loop2
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+	ret
+
+
+.p2align	4
+_aesni_decrypt2:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+	addq	$16,%rax
+
+L$dec_loop2:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$dec_loop2
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+	ret
+
+
+.p2align	4
+_aesni_encrypt3:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	xorps	%xmm0,%xmm4
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+	addq	$16,%rax
+
+L$enc_loop3:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$enc_loop3
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+.byte	102,15,56,221,224
+	ret
+
+
+.p2align	4
+_aesni_decrypt3:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	xorps	%xmm0,%xmm4
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+	addq	$16,%rax
+
+L$dec_loop3:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+.byte	102,15,56,222,224
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$dec_loop3
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+.byte	102,15,56,223,224
+	ret
+
+
+.p2align	4
+_aesni_encrypt4:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	xorps	%xmm0,%xmm4
+	xorps	%xmm0,%xmm5
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	0x0f,0x1f,0x00
+	addq	$16,%rax
+
+L$enc_loop4:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$enc_loop4
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+.byte	102,15,56,221,224
+.byte	102,15,56,221,232
+	ret
+
+
+.p2align	4
+_aesni_decrypt4:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	xorps	%xmm0,%xmm4
+	xorps	%xmm0,%xmm5
+	movups	32(%rcx),%xmm0
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	0x0f,0x1f,0x00
+	addq	$16,%rax
+
+L$dec_loop4:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+.byte	102,15,56,222,224
+.byte	102,15,56,222,232
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$dec_loop4
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+.byte	102,15,56,223,224
+.byte	102,15,56,223,232
+	ret
+
+
+.p2align	4
+_aesni_encrypt6:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	pxor	%xmm0,%xmm3
+	pxor	%xmm0,%xmm4
+.byte	102,15,56,220,209
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,220,217
+	pxor	%xmm0,%xmm5
+	pxor	%xmm0,%xmm6
+.byte	102,15,56,220,225
+	pxor	%xmm0,%xmm7
+	movups	(%rcx,%rax,1),%xmm0
+	addq	$16,%rax
+	jmp	L$enc_loop6_enter
+.p2align	4
+L$enc_loop6:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+L$enc_loop6_enter:
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$enc_loop6
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+.byte	102,15,56,221,224
+.byte	102,15,56,221,232
+.byte	102,15,56,221,240
+.byte	102,15,56,221,248
+	ret
+
+
+.p2align	4
+_aesni_decrypt6:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	pxor	%xmm0,%xmm3
+	pxor	%xmm0,%xmm4
+.byte	102,15,56,222,209
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,222,217
+	pxor	%xmm0,%xmm5
+	pxor	%xmm0,%xmm6
+.byte	102,15,56,222,225
+	pxor	%xmm0,%xmm7
+	movups	(%rcx,%rax,1),%xmm0
+	addq	$16,%rax
+	jmp	L$dec_loop6_enter
+.p2align	4
+L$dec_loop6:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+L$dec_loop6_enter:
+.byte	102,15,56,222,233
+.byte	102,15,56,222,241
+.byte	102,15,56,222,249
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+.byte	102,15,56,222,224
+.byte	102,15,56,222,232
+.byte	102,15,56,222,240
+.byte	102,15,56,222,248
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$dec_loop6
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+.byte	102,15,56,222,241
+.byte	102,15,56,222,249
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+.byte	102,15,56,223,224
+.byte	102,15,56,223,232
+.byte	102,15,56,223,240
+.byte	102,15,56,223,248
+	ret
+
+
+.p2align	4
+_aesni_encrypt8:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	pxor	%xmm0,%xmm4
+	pxor	%xmm0,%xmm5
+	pxor	%xmm0,%xmm6
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,220,209
+	pxor	%xmm0,%xmm7
+	pxor	%xmm0,%xmm8
+.byte	102,15,56,220,217
+	pxor	%xmm0,%xmm9
+	movups	(%rcx,%rax,1),%xmm0
+	addq	$16,%rax
+	jmp	L$enc_loop8_inner
+.p2align	4
+L$enc_loop8:
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+L$enc_loop8_inner:
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+L$enc_loop8_enter:
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$enc_loop8
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+.byte	102,15,56,221,208
+.byte	102,15,56,221,216
+.byte	102,15,56,221,224
+.byte	102,15,56,221,232
+.byte	102,15,56,221,240
+.byte	102,15,56,221,248
+.byte	102,68,15,56,221,192
+.byte	102,68,15,56,221,200
+	ret
+
+
+.p2align	4
+_aesni_decrypt8:
+	movups	(%rcx),%xmm0
+	shll	$4,%eax
+	movups	16(%rcx),%xmm1
+	xorps	%xmm0,%xmm2
+	xorps	%xmm0,%xmm3
+	pxor	%xmm0,%xmm4
+	pxor	%xmm0,%xmm5
+	pxor	%xmm0,%xmm6
+	leaq	32(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,222,209
+	pxor	%xmm0,%xmm7
+	pxor	%xmm0,%xmm8
+.byte	102,15,56,222,217
+	pxor	%xmm0,%xmm9
+	movups	(%rcx,%rax,1),%xmm0
+	addq	$16,%rax
+	jmp	L$dec_loop8_inner
+.p2align	4
+L$dec_loop8:
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+L$dec_loop8_inner:
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+.byte	102,15,56,222,241
+.byte	102,15,56,222,249
+.byte	102,68,15,56,222,193
+.byte	102,68,15,56,222,201
+L$dec_loop8_enter:
+	movups	(%rcx,%rax,1),%xmm1
+	addq	$32,%rax
+.byte	102,15,56,222,208
+.byte	102,15,56,222,216
+.byte	102,15,56,222,224
+.byte	102,15,56,222,232
+.byte	102,15,56,222,240
+.byte	102,15,56,222,248
+.byte	102,68,15,56,222,192
+.byte	102,68,15,56,222,200
+	movups	-16(%rcx,%rax,1),%xmm0
+	jnz	L$dec_loop8
+
+.byte	102,15,56,222,209
+.byte	102,15,56,222,217
+.byte	102,15,56,222,225
+.byte	102,15,56,222,233
+.byte	102,15,56,222,241
+.byte	102,15,56,222,249
+.byte	102,68,15,56,222,193
+.byte	102,68,15,56,222,201
+.byte	102,15,56,223,208
+.byte	102,15,56,223,216
+.byte	102,15,56,223,224
+.byte	102,15,56,223,232
+.byte	102,15,56,223,240
+.byte	102,15,56,223,248
+.byte	102,68,15,56,223,192
+.byte	102,68,15,56,223,200
+	ret
+
+.globl	_aesni_ctr32_encrypt_blocks
+
+.p2align	4
+_aesni_ctr32_encrypt_blocks:
+
+	cmpq	$1,%rdx
+	jne	L$ctr32_bulk
+
+
+
+	movups	(%r8),%xmm2
+	movups	(%rdi),%xmm3
+	movl	240(%rcx),%edx
+	movups	(%rcx),%xmm0
+	movups	16(%rcx),%xmm1
+	leaq	32(%rcx),%rcx
+	xorps	%xmm0,%xmm2
+L$oop_enc1_3:
+.byte	102,15,56,220,209
+	decl	%edx
+	movups	(%rcx),%xmm1
+	leaq	16(%rcx),%rcx
+	jnz	L$oop_enc1_3
+.byte	102,15,56,221,209
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	xorps	%xmm3,%xmm2
+	pxor	%xmm3,%xmm3
+	movups	%xmm2,(%rsi)
+	xorps	%xmm2,%xmm2
+	jmp	L$ctr32_epilogue
+
+.p2align	4
+L$ctr32_bulk:
+	leaq	(%rsp),%r11
+
+	pushq	%rbp
+
+	subq	$128,%rsp
+	andq	$-16,%rsp
+
+
+
+
+	movdqu	(%r8),%xmm2
+	movdqu	(%rcx),%xmm0
+	movl	12(%r8),%r8d
+	pxor	%xmm0,%xmm2
+	movl	12(%rcx),%ebp
+	movdqa	%xmm2,0(%rsp)
+	bswapl	%r8d
+	movdqa	%xmm2,%xmm3
+	movdqa	%xmm2,%xmm4
+	movdqa	%xmm2,%xmm5
+	movdqa	%xmm2,64(%rsp)
+	movdqa	%xmm2,80(%rsp)
+	movdqa	%xmm2,96(%rsp)
+	movq	%rdx,%r10
+	movdqa	%xmm2,112(%rsp)
+
+	leaq	1(%r8),%rax
+	leaq	2(%r8),%rdx
+	bswapl	%eax
+	bswapl	%edx
+	xorl	%ebp,%eax
+	xorl	%ebp,%edx
+.byte	102,15,58,34,216,3
+	leaq	3(%r8),%rax
+	movdqa	%xmm3,16(%rsp)
+.byte	102,15,58,34,226,3
+	bswapl	%eax
+	movq	%r10,%rdx
+	leaq	4(%r8),%r10
+	movdqa	%xmm4,32(%rsp)
+	xorl	%ebp,%eax
+	bswapl	%r10d
+.byte	102,15,58,34,232,3
+	xorl	%ebp,%r10d
+	movdqa	%xmm5,48(%rsp)
+	leaq	5(%r8),%r9
+	movl	%r10d,64+12(%rsp)
+	bswapl	%r9d
+	leaq	6(%r8),%r10
+	movl	240(%rcx),%eax
+	xorl	%ebp,%r9d
+	bswapl	%r10d
+	movl	%r9d,80+12(%rsp)
+	xorl	%ebp,%r10d
+	leaq	7(%r8),%r9
+	movl	%r10d,96+12(%rsp)
+	bswapl	%r9d
+
+
+	xorl	%ebp,%r9d
+
+	movl	%r9d,112+12(%rsp)
+
+	movups	16(%rcx),%xmm1
+
+	movdqa	64(%rsp),%xmm6
+	movdqa	80(%rsp),%xmm7
+
+	cmpq	$8,%rdx
+	jb	L$ctr32_tail
+
+	subq	$6,%rdx
+
+
+
+	leaq	128(%rcx),%rcx
+	subq	$2,%rdx
+	jmp	L$ctr32_loop8
+
+
+
+
+
+
+
+
+
+
+.p2align	4
+L$ctr32_loop6:
+	addl	$6,%r8d
+	movups	-48(%rcx,%r10,1),%xmm0
+.byte	102,15,56,220,209
+	movl	%r8d,%eax
+	xorl	%ebp,%eax
+.byte	102,15,56,220,217
+.byte	0x0f,0x38,0xf1,0x44,0x24,12
+	leal	1(%r8),%eax
+.byte	102,15,56,220,225
+	xorl	%ebp,%eax
+.byte	0x0f,0x38,0xf1,0x44,0x24,28
+.byte	102,15,56,220,233
+	leal	2(%r8),%eax
+	xorl	%ebp,%eax
+.byte	102,15,56,220,241
+.byte	0x0f,0x38,0xf1,0x44,0x24,44
+	leal	3(%r8),%eax
+.byte	102,15,56,220,249
+	movups	-32(%rcx,%r10,1),%xmm1
+	xorl	%ebp,%eax
+
+.byte	102,15,56,220,208
+.byte	0x0f,0x38,0xf1,0x44,0x24,60
+	leal	4(%r8),%eax
+.byte	102,15,56,220,216
+	xorl	%ebp,%eax
+.byte	0x0f,0x38,0xf1,0x44,0x24,76
+.byte	102,15,56,220,224
+	leal	5(%r8),%eax
+	xorl	%ebp,%eax
+.byte	102,15,56,220,232
+.byte	0x0f,0x38,0xf1,0x44,0x24,92
+	movq	%r10,%rax
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+	movups	-16(%rcx,%r10,1),%xmm0
+
+	call	L$enc_loop6
+
+	movdqu	(%rdi),%xmm8
+	movdqu	16(%rdi),%xmm9
+	movdqu	32(%rdi),%xmm10
+	movdqu	48(%rdi),%xmm11
+	movdqu	64(%rdi),%xmm12
+	movdqu	80(%rdi),%xmm13
+	leaq	96(%rdi),%rdi
+	movups	-64(%rcx,%r10,1),%xmm1
+	pxor	%xmm2,%xmm8
+	movaps	0(%rsp),%xmm2
+	pxor	%xmm3,%xmm9
+	movaps	16(%rsp),%xmm3
+	pxor	%xmm4,%xmm10
+	movaps	32(%rsp),%xmm4
+	pxor	%xmm5,%xmm11
+	movaps	48(%rsp),%xmm5
+	pxor	%xmm6,%xmm12
+	movaps	64(%rsp),%xmm6
+	pxor	%xmm7,%xmm13
+	movaps	80(%rsp),%xmm7
+	movdqu	%xmm8,(%rsi)
+	movdqu	%xmm9,16(%rsi)
+	movdqu	%xmm10,32(%rsi)
+	movdqu	%xmm11,48(%rsi)
+	movdqu	%xmm12,64(%rsi)
+	movdqu	%xmm13,80(%rsi)
+	leaq	96(%rsi),%rsi
+
+	subq	$6,%rdx
+	jnc	L$ctr32_loop6
+
+	addq	$6,%rdx
+	jz	L$ctr32_done
+
+	leal	-48(%r10),%eax
+	leaq	-80(%rcx,%r10,1),%rcx
+	negl	%eax
+	shrl	$4,%eax
+	jmp	L$ctr32_tail
+
+.p2align	5
+L$ctr32_loop8:
+	addl	$8,%r8d
+	movdqa	96(%rsp),%xmm8
+.byte	102,15,56,220,209
+	movl	%r8d,%r9d
+	movdqa	112(%rsp),%xmm9
+.byte	102,15,56,220,217
+	bswapl	%r9d
+	movups	32-128(%rcx),%xmm0
+.byte	102,15,56,220,225
+	xorl	%ebp,%r9d
+	nop
+.byte	102,15,56,220,233
+	movl	%r9d,0+12(%rsp)
+	leaq	1(%r8),%r9
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	48-128(%rcx),%xmm1
+	bswapl	%r9d
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+	movl	%r9d,16+12(%rsp)
+	leaq	2(%r8),%r9
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	64-128(%rcx),%xmm0
+	bswapl	%r9d
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movl	%r9d,32+12(%rsp)
+	leaq	3(%r8),%r9
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	80-128(%rcx),%xmm1
+	bswapl	%r9d
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+	movl	%r9d,48+12(%rsp)
+	leaq	4(%r8),%r9
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	96-128(%rcx),%xmm0
+	bswapl	%r9d
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movl	%r9d,64+12(%rsp)
+	leaq	5(%r8),%r9
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	112-128(%rcx),%xmm1
+	bswapl	%r9d
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+	movl	%r9d,80+12(%rsp)
+	leaq	6(%r8),%r9
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	128-128(%rcx),%xmm0
+	bswapl	%r9d
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	xorl	%ebp,%r9d
+.byte	0x66,0x90
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movl	%r9d,96+12(%rsp)
+	leaq	7(%r8),%r9
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	144-128(%rcx),%xmm1
+	bswapl	%r9d
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+	xorl	%ebp,%r9d
+	movdqu	0(%rdi),%xmm10
+.byte	102,15,56,220,232
+	movl	%r9d,112+12(%rsp)
+	cmpl	$11,%eax
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	160-128(%rcx),%xmm0
+
+	jb	L$ctr32_enc_done
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	176-128(%rcx),%xmm1
+
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	192-128(%rcx),%xmm0
+	je	L$ctr32_enc_done
+
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movups	208-128(%rcx),%xmm1
+
+.byte	102,15,56,220,208
+.byte	102,15,56,220,216
+.byte	102,15,56,220,224
+.byte	102,15,56,220,232
+.byte	102,15,56,220,240
+.byte	102,15,56,220,248
+.byte	102,68,15,56,220,192
+.byte	102,68,15,56,220,200
+	movups	224-128(%rcx),%xmm0
+	jmp	L$ctr32_enc_done
+
+.p2align	4
+L$ctr32_enc_done:
+	movdqu	16(%rdi),%xmm11
+	pxor	%xmm0,%xmm10
+	movdqu	32(%rdi),%xmm12
+	pxor	%xmm0,%xmm11
+	movdqu	48(%rdi),%xmm13
+	pxor	%xmm0,%xmm12
+	movdqu	64(%rdi),%xmm14
+	pxor	%xmm0,%xmm13
+	movdqu	80(%rdi),%xmm15
+	pxor	%xmm0,%xmm14
+	pxor	%xmm0,%xmm15
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+.byte	102,68,15,56,220,201
+	movdqu	96(%rdi),%xmm1
+	leaq	128(%rdi),%rdi
+
+.byte	102,65,15,56,221,210
+	pxor	%xmm0,%xmm1
+	movdqu	112-128(%rdi),%xmm10
+.byte	102,65,15,56,221,219
+	pxor	%xmm0,%xmm10
+	movdqa	0(%rsp),%xmm11
+.byte	102,65,15,56,221,228
+.byte	102,65,15,56,221,237
+	movdqa	16(%rsp),%xmm12
+	movdqa	32(%rsp),%xmm13
+.byte	102,65,15,56,221,246
+.byte	102,65,15,56,221,255
+	movdqa	48(%rsp),%xmm14
+	movdqa	64(%rsp),%xmm15
+.byte	102,68,15,56,221,193
+	movdqa	80(%rsp),%xmm0
+	movups	16-128(%rcx),%xmm1
+.byte	102,69,15,56,221,202
+
+	movups	%xmm2,(%rsi)
+	movdqa	%xmm11,%xmm2
+	movups	%xmm3,16(%rsi)
+	movdqa	%xmm12,%xmm3
+	movups	%xmm4,32(%rsi)
+	movdqa	%xmm13,%xmm4
+	movups	%xmm5,48(%rsi)
+	movdqa	%xmm14,%xmm5
+	movups	%xmm6,64(%rsi)
+	movdqa	%xmm15,%xmm6
+	movups	%xmm7,80(%rsi)
+	movdqa	%xmm0,%xmm7
+	movups	%xmm8,96(%rsi)
+	movups	%xmm9,112(%rsi)
+	leaq	128(%rsi),%rsi
+
+	subq	$8,%rdx
+	jnc	L$ctr32_loop8
+
+	addq	$8,%rdx
+	jz	L$ctr32_done
+	leaq	-128(%rcx),%rcx
+
+L$ctr32_tail:
+
+
+	leaq	16(%rcx),%rcx
+	cmpq	$4,%rdx
+	jb	L$ctr32_loop3
+	je	L$ctr32_loop4
+
+
+	shll	$4,%eax
+	movdqa	96(%rsp),%xmm8
+	pxor	%xmm9,%xmm9
+
+	movups	16(%rcx),%xmm0
+.byte	102,15,56,220,209
+.byte	102,15,56,220,217
+	leaq	32-16(%rcx,%rax,1),%rcx
+	negq	%rax
+.byte	102,15,56,220,225
+	addq	$16,%rax
+	movups	(%rdi),%xmm10
+.byte	102,15,56,220,233
+.byte	102,15,56,220,241
+	movups	16(%rdi),%xmm11
+	movups	32(%rdi),%xmm12
+.byte	102,15,56,220,249
+.byte	102,68,15,56,220,193
+
+	call	L$enc_loop8_enter
+
+	movdqu	48(%rdi),%xmm13
+	pxor	%xmm10,%xmm2
+	movdqu	64(%rdi),%xmm10
+	pxor	%xmm11,%xmm3
+	movdqu	%xmm2,(%rsi)
+	pxor	%xmm12,%xmm4
+	movdqu	%xmm3,16(%rsi)
+	pxor	%xmm13,%xmm5
+	movdqu	%xmm4,32(%rsi)
+	pxor	%xmm10,%xmm6
+	movdqu	%xmm5,48(%rsi)
+	movdqu	%xmm6,64(%rsi)
+	cmpq	$6,%rdx
+	jb	L$ctr32_done
+
+	movups	80(%rdi),%xmm11
+	xorps	%xmm11,%xmm7
+	movups	%xmm7,80(%rsi)
+	je	L$ctr32_done
+
+	movups	96(%rdi),%xmm12
+	xorps	%xmm12,%xmm8
+	movups	%xmm8,96(%rsi)
+	jmp	L$ctr32_done
+
+.p2align	5
+L$ctr32_loop4:
+.byte	102,15,56,220,209
+	leaq	16(%rcx),%rcx
+	decl	%eax
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+.byte	102,15,56,220,233
+	movups	(%rcx),%xmm1
+	jnz	L$ctr32_loop4
+.byte	102,15,56,221,209
+.byte	102,15,56,221,217
+	movups	(%rdi),%xmm10
+	movups	16(%rdi),%xmm11
+.byte	102,15,56,221,225
+.byte	102,15,56,221,233
+	movups	32(%rdi),%xmm12
+	movups	48(%rdi),%xmm13
+
+	xorps	%xmm10,%xmm2
+	movups	%xmm2,(%rsi)
+	xorps	%xmm11,%xmm3
+	movups	%xmm3,16(%rsi)
+	pxor	%xmm12,%xmm4
+	movdqu	%xmm4,32(%rsi)
+	pxor	%xmm13,%xmm5
+	movdqu	%xmm5,48(%rsi)
+	jmp	L$ctr32_done
+
+.p2align	5
+L$ctr32_loop3:
+.byte	102,15,56,220,209
+	leaq	16(%rcx),%rcx
+	decl	%eax
+.byte	102,15,56,220,217
+.byte	102,15,56,220,225
+	movups	(%rcx),%xmm1
+	jnz	L$ctr32_loop3
+.byte	102,15,56,221,209
+.byte	102,15,56,221,217
+.byte	102,15,56,221,225
+
+	movups	(%rdi),%xmm10
+	xorps	%xmm10,%xmm2
+	movups	%xmm2,(%rsi)
+	cmpq	$2,%rdx
+	jb	L$ctr32_done
+
+	movups	16(%rdi),%xmm11
+	xorps	%xmm11,%xmm3
+	movups	%xmm3,16(%rsi)
+	je	L$ctr32_done
+
+	movups	32(%rdi),%xmm12
+	xorps	%xmm12,%xmm4
+	movups	%xmm4,32(%rsi)
+
+L$ctr32_done:
+	xorps	%xmm0,%xmm0
+	xorl	%ebp,%ebp
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+	pxor	%xmm6,%xmm6
+	pxor	%xmm7,%xmm7
+	movaps	%xmm0,0(%rsp)
+	pxor	%xmm8,%xmm8
+	movaps	%xmm0,16(%rsp)
+	pxor	%xmm9,%xmm9
+	movaps	%xmm0,32(%rsp)
+	pxor	%xmm10,%xmm10
+	movaps	%xmm0,48(%rsp)
+	pxor	%xmm11,%xmm11
+	movaps	%xmm0,64(%rsp)
+	pxor	%xmm12,%xmm12
+	movaps	%xmm0,80(%rsp)
+	pxor	%xmm13,%xmm13
+	movaps	%xmm0,96(%rsp)
+	pxor	%xmm14,%xmm14
+	movaps	%xmm0,112(%rsp)
+	pxor	%xmm15,%xmm15
+	movq	-8(%r11),%rbp
+
+	leaq	(%r11),%rsp
+
+L$ctr32_epilogue:
+	ret
+
+
+.globl	_aesni_set_decrypt_key
+
+.p2align	4
+_aesni_set_decrypt_key:
+
+.byte	0x48,0x83,0xEC,0x08
+
+	call	__aesni_set_encrypt_key
+	shll	$4,%esi
+	testl	%eax,%eax
+	jnz	L$dec_key_ret
+	leaq	16(%rdx,%rsi,1),%rdi
+
+	movups	(%rdx),%xmm0
+	movups	(%rdi),%xmm1
+	movups	%xmm0,(%rdi)
+	movups	%xmm1,(%rdx)
+	leaq	16(%rdx),%rdx
+	leaq	-16(%rdi),%rdi
+
+L$dec_key_inverse:
+	movups	(%rdx),%xmm0
+	movups	(%rdi),%xmm1
+.byte	102,15,56,219,192
+.byte	102,15,56,219,201
+	leaq	16(%rdx),%rdx
+	leaq	-16(%rdi),%rdi
+	movups	%xmm0,16(%rdi)
+	movups	%xmm1,-16(%rdx)
+	cmpq	%rdx,%rdi
+	ja	L$dec_key_inverse
+
+	movups	(%rdx),%xmm0
+.byte	102,15,56,219,192
+	pxor	%xmm1,%xmm1
+	movups	%xmm0,(%rdi)
+	pxor	%xmm0,%xmm0
+L$dec_key_ret:
+	addq	$8,%rsp
+
+	ret
+
+L$SEH_end_set_decrypt_key:
+
+.globl	_aesni_set_encrypt_key
+
+.p2align	4
+_aesni_set_encrypt_key:
+__aesni_set_encrypt_key:
+
+.byte	0x48,0x83,0xEC,0x08
+
+	movq	$-1,%rax
+	testq	%rdi,%rdi
+	jz	L$enc_key_ret
+	testq	%rdx,%rdx
+	jz	L$enc_key_ret
+
+	movups	(%rdi),%xmm0
+	xorps	%xmm4,%xmm4
+
+
+
+	leaq	16(%rdx),%rax
+	cmpl	$256,%esi
+	je	L$14rounds
+	cmpl	$192,%esi
+	je	L$12rounds
+	cmpl	$128,%esi
+	jne	L$bad_keybits
+
+L$10rounds:
+	movl	$9,%esi
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	movdqa	L$key_rotate(%rip),%xmm5
+	movl	$8,%r10d
+	movdqa	L$key_rcon1(%rip),%xmm4
+	movdqa	%xmm0,%xmm2
+	movdqu	%xmm0,(%rdx)
+	jmp	L$oop_key128
+
+.p2align	4
+L$oop_key128:
+	pshufb	%xmm5,%xmm0
+.byte	102,15,56,221,196
+	pslld	$1,%xmm4
+	leaq	16(%rax),%rax
+
+	movdqa	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm3,%xmm2
+
+	pxor	%xmm2,%xmm0
+	movdqu	%xmm0,-16(%rax)
+	movdqa	%xmm0,%xmm2
+
+	decl	%r10d
+	jnz	L$oop_key128
+
+	movdqa	L$key_rcon1b(%rip),%xmm4
+
+	pshufb	%xmm5,%xmm0
+.byte	102,15,56,221,196
+	pslld	$1,%xmm4
+
+	movdqa	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm3,%xmm2
+
+	pxor	%xmm2,%xmm0
+	movdqu	%xmm0,(%rax)
+
+	movdqa	%xmm0,%xmm2
+	pshufb	%xmm5,%xmm0
+.byte	102,15,56,221,196
+
+	movdqa	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm2,%xmm3
+	pslldq	$4,%xmm2
+	pxor	%xmm3,%xmm2
+
+	pxor	%xmm2,%xmm0
+	movdqu	%xmm0,16(%rax)
+
+	movl	%esi,96(%rax)
+	xorl	%eax,%eax
+	jmp	L$enc_key_ret
+
+.p2align	4
+L$12rounds:
+	movq	16(%rdi),%xmm2
+	movl	$11,%esi
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	movdqa	L$key_rotate192(%rip),%xmm5
+	movdqa	L$key_rcon1(%rip),%xmm4
+	movl	$8,%r10d
+	movdqu	%xmm0,(%rdx)
+	jmp	L$oop_key192
+
+.p2align	4
+L$oop_key192:
+	movq	%xmm2,0(%rax)
+	movdqa	%xmm2,%xmm1
+	pshufb	%xmm5,%xmm2
+.byte	102,15,56,221,212
+	pslld	$1,%xmm4
+	leaq	24(%rax),%rax
+
+	movdqa	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm3,%xmm0
+
+	pshufd	$0xff,%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+	pslldq	$4,%xmm1
+	pxor	%xmm1,%xmm3
+
+	pxor	%xmm2,%xmm0
+	pxor	%xmm3,%xmm2
+	movdqu	%xmm0,-16(%rax)
+
+	decl	%r10d
+	jnz	L$oop_key192
+
+	movl	%esi,32(%rax)
+	xorl	%eax,%eax
+	jmp	L$enc_key_ret
+
+.p2align	4
+L$14rounds:
+	movups	16(%rdi),%xmm2
+	movl	$13,%esi
+	leaq	16(%rax),%rax
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	movdqa	L$key_rotate(%rip),%xmm5
+	movdqa	L$key_rcon1(%rip),%xmm4
+	movl	$7,%r10d
+	movdqu	%xmm0,0(%rdx)
+	movdqa	%xmm2,%xmm1
+	movdqu	%xmm2,16(%rdx)
+	jmp	L$oop_key256
+
+.p2align	4
+L$oop_key256:
+	pshufb	%xmm5,%xmm2
+.byte	102,15,56,221,212
+
+	movdqa	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm0,%xmm3
+	pslldq	$4,%xmm0
+	pxor	%xmm3,%xmm0
+	pslld	$1,%xmm4
+
+	pxor	%xmm2,%xmm0
+	movdqu	%xmm0,(%rax)
+
+	decl	%r10d
+	jz	L$done_key256
+
+	pshufd	$0xff,%xmm0,%xmm2
+	pxor	%xmm3,%xmm3
+.byte	102,15,56,221,211
+
+	movdqa	%xmm1,%xmm3
+	pslldq	$4,%xmm1
+	pxor	%xmm1,%xmm3
+	pslldq	$4,%xmm1
+	pxor	%xmm1,%xmm3
+	pslldq	$4,%xmm1
+	pxor	%xmm3,%xmm1
+
+	pxor	%xmm1,%xmm2
+	movdqu	%xmm2,16(%rax)
+	leaq	32(%rax),%rax
+	movdqa	%xmm2,%xmm1
+
+	jmp	L$oop_key256
+
+L$done_key256:
+	movl	%esi,16(%rax)
+	xorl	%eax,%eax
+	jmp	L$enc_key_ret
+
+.p2align	4
+L$bad_keybits:
+	movq	$-2,%rax
+L$enc_key_ret:
+	pxor	%xmm0,%xmm0
+	pxor	%xmm1,%xmm1
+	pxor	%xmm2,%xmm2
+	pxor	%xmm3,%xmm3
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+	addq	$8,%rsp
+
+	ret
+
+L$SEH_end_set_encrypt_key:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.p2align	6
+L$bswap_mask:
+.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+L$increment32:
+.long	6,6,6,0
+L$increment64:
+.long	1,0,0,0
+L$xts_magic:
+.long	0x87,0,1,0
+L$increment1:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+L$key_rotate:
+.long	0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d
+L$key_rotate192:
+.long	0x04070605,0x04070605,0x04070605,0x04070605
+L$key_rcon1:
+.long	1,1,1,1
+L$key_rcon1b:
+.long	0x1b,0x1b,0x1b,0x1b
+
+.p2align	6
diff --git a/crypto/aesgcm/aesni_x64_nasm.asm b/crypto/aesgcm/aesni_x64_nasm.asm
new file mode 100644
index 0000000..f464cf9
--- /dev/null
+++ b/crypto/aesgcm/aesni_x64_nasm.asm
@@ -0,0 +1,1723 @@
+default	rel
+%define XMMWORD
+%define YMMWORD
+%define ZMMWORD
+section	.text code align=64
+
+global	aesni_encrypt
+
+ALIGN	16
+aesni_encrypt:
+	movups	xmm2,XMMWORD[rcx]
+	mov	eax,DWORD[240+r8]
+	movups	xmm0,XMMWORD[r8]
+	movups	xmm1,XMMWORD[16+r8]
+	lea	r8,[32+r8]
+	xorps	xmm2,xmm0
+$L$oop_enc1_1:
+DB	102,15,56,220,209
+	dec	eax
+	movups	xmm1,XMMWORD[r8]
+	lea	r8,[16+r8]
+	jnz	NEAR $L$oop_enc1_1
+DB	102,15,56,221,209
+	pxor	xmm0,xmm0
+	pxor	xmm1,xmm1
+	movups	XMMWORD[rdx],xmm2
+	pxor	xmm2,xmm2
+	ret
+
+
+global	aesni_decrypt
+
+ALIGN	16
+aesni_decrypt:
+	movups	xmm2,XMMWORD[rcx]
+	mov	eax,DWORD[240+r8]
+	movups	xmm0,XMMWORD[r8]
+	movups	xmm1,XMMWORD[16+r8]
+	lea	r8,[32+r8]
+	xorps	xmm2,xmm0
+$L$oop_dec1_2:
+DB	102,15,56,222,209
+	dec	eax
+	movups	xmm1,XMMWORD[r8]
+	lea	r8,[16+r8]
+	jnz	NEAR $L$oop_dec1_2
+DB	102,15,56,223,209
+	pxor	xmm0,xmm0
+	pxor	xmm1,xmm1
+	movups	XMMWORD[rdx],xmm2
+	pxor	xmm2,xmm2
+	ret
+
+
+ALIGN	16
+_aesni_encrypt2:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	xorps	xmm3,xmm0
+	movups	xmm0,XMMWORD[32+rcx]
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+	add	rax,16
+
+$L$enc_loop2:
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$enc_loop2
+
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,221,208
+DB	102,15,56,221,216
+	ret
+
+
+ALIGN	16
+_aesni_decrypt2:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	xorps	xmm3,xmm0
+	movups	xmm0,XMMWORD[32+rcx]
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+	add	rax,16
+
+$L$dec_loop2:
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,222,208
+DB	102,15,56,222,216
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$dec_loop2
+
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+DB	102,15,56,223,208
+DB	102,15,56,223,216
+	ret
+
+
+ALIGN	16
+_aesni_encrypt3:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	xorps	xmm3,xmm0
+	xorps	xmm4,xmm0
+	movups	xmm0,XMMWORD[32+rcx]
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+	add	rax,16
+
+$L$enc_loop3:
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+DB	102,15,56,220,224
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$enc_loop3
+
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,221,208
+DB	102,15,56,221,216
+DB	102,15,56,221,224
+	ret
+
+
+ALIGN	16
+_aesni_decrypt3:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	xorps	xmm3,xmm0
+	xorps	xmm4,xmm0
+	movups	xmm0,XMMWORD[32+rcx]
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+	add	rax,16
+
+$L$dec_loop3:
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+DB	102,15,56,222,225
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,222,208
+DB	102,15,56,222,216
+DB	102,15,56,222,224
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$dec_loop3
+
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+DB	102,15,56,222,225
+DB	102,15,56,223,208
+DB	102,15,56,223,216
+DB	102,15,56,223,224
+	ret
+
+
+ALIGN	16
+_aesni_encrypt4:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	xorps	xmm3,xmm0
+	xorps	xmm4,xmm0
+	xorps	xmm5,xmm0
+	movups	xmm0,XMMWORD[32+rcx]
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+DB	0x0f,0x1f,0x00
+	add	rax,16
+
+$L$enc_loop4:
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+DB	102,15,56,220,224
+DB	102,15,56,220,232
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$enc_loop4
+
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+DB	102,15,56,221,208
+DB	102,15,56,221,216
+DB	102,15,56,221,224
+DB	102,15,56,221,232
+	ret
+
+
+ALIGN	16
+_aesni_decrypt4:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	xorps	xmm3,xmm0
+	xorps	xmm4,xmm0
+	xorps	xmm5,xmm0
+	movups	xmm0,XMMWORD[32+rcx]
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+DB	0x0f,0x1f,0x00
+	add	rax,16
+
+$L$dec_loop4:
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+DB	102,15,56,222,225
+DB	102,15,56,222,233
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,222,208
+DB	102,15,56,222,216
+DB	102,15,56,222,224
+DB	102,15,56,222,232
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$dec_loop4
+
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+DB	102,15,56,222,225
+DB	102,15,56,222,233
+DB	102,15,56,223,208
+DB	102,15,56,223,216
+DB	102,15,56,223,224
+DB	102,15,56,223,232
+	ret
+
+
+ALIGN	16
+_aesni_encrypt6:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	pxor	xmm3,xmm0
+	pxor	xmm4,xmm0
+DB	102,15,56,220,209
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+DB	102,15,56,220,217
+	pxor	xmm5,xmm0
+	pxor	xmm6,xmm0
+DB	102,15,56,220,225
+	pxor	xmm7,xmm0
+	movups	xmm0,XMMWORD[rax*1+rcx]
+	add	rax,16
+	jmp	NEAR $L$enc_loop6_enter
+ALIGN	16
+$L$enc_loop6:
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+$L$enc_loop6_enter:
+DB	102,15,56,220,233
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+DB	102,15,56,220,224
+DB	102,15,56,220,232
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$enc_loop6
+
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,15,56,221,208
+DB	102,15,56,221,216
+DB	102,15,56,221,224
+DB	102,15,56,221,232
+DB	102,15,56,221,240
+DB	102,15,56,221,248
+	ret
+
+
+ALIGN	16
+_aesni_decrypt6:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	pxor	xmm3,xmm0
+	pxor	xmm4,xmm0
+DB	102,15,56,222,209
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+DB	102,15,56,222,217
+	pxor	xmm5,xmm0
+	pxor	xmm6,xmm0
+DB	102,15,56,222,225
+	pxor	xmm7,xmm0
+	movups	xmm0,XMMWORD[rax*1+rcx]
+	add	rax,16
+	jmp	NEAR $L$dec_loop6_enter
+ALIGN	16
+$L$dec_loop6:
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+DB	102,15,56,222,225
+$L$dec_loop6_enter:
+DB	102,15,56,222,233
+DB	102,15,56,222,241
+DB	102,15,56,222,249
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,222,208
+DB	102,15,56,222,216
+DB	102,15,56,222,224
+DB	102,15,56,222,232
+DB	102,15,56,222,240
+DB	102,15,56,222,248
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$dec_loop6
+
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+DB	102,15,56,222,225
+DB	102,15,56,222,233
+DB	102,15,56,222,241
+DB	102,15,56,222,249
+DB	102,15,56,223,208
+DB	102,15,56,223,216
+DB	102,15,56,223,224
+DB	102,15,56,223,232
+DB	102,15,56,223,240
+DB	102,15,56,223,248
+	ret
+
+
+ALIGN	16
+_aesni_encrypt8:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	xorps	xmm3,xmm0
+	pxor	xmm4,xmm0
+	pxor	xmm5,xmm0
+	pxor	xmm6,xmm0
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+DB	102,15,56,220,209
+	pxor	xmm7,xmm0
+	pxor	xmm8,xmm0
+DB	102,15,56,220,217
+	pxor	xmm9,xmm0
+	movups	xmm0,XMMWORD[rax*1+rcx]
+	add	rax,16
+	jmp	NEAR $L$enc_loop8_inner
+ALIGN	16
+$L$enc_loop8:
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+$L$enc_loop8_inner:
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+$L$enc_loop8_enter:
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+DB	102,15,56,220,224
+DB	102,15,56,220,232
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+DB	102,68,15,56,220,192
+DB	102,68,15,56,220,200
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$enc_loop8
+
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+DB	102,15,56,221,208
+DB	102,15,56,221,216
+DB	102,15,56,221,224
+DB	102,15,56,221,232
+DB	102,15,56,221,240
+DB	102,15,56,221,248
+DB	102,68,15,56,221,192
+DB	102,68,15,56,221,200
+	ret
+
+
+ALIGN	16
+_aesni_decrypt8:
+	movups	xmm0,XMMWORD[rcx]
+	shl	eax,4
+	movups	xmm1,XMMWORD[16+rcx]
+	xorps	xmm2,xmm0
+	xorps	xmm3,xmm0
+	pxor	xmm4,xmm0
+	pxor	xmm5,xmm0
+	pxor	xmm6,xmm0
+	lea	rcx,[32+rax*1+rcx]
+	neg	rax
+DB	102,15,56,222,209
+	pxor	xmm7,xmm0
+	pxor	xmm8,xmm0
+DB	102,15,56,222,217
+	pxor	xmm9,xmm0
+	movups	xmm0,XMMWORD[rax*1+rcx]
+	add	rax,16
+	jmp	NEAR $L$dec_loop8_inner
+ALIGN	16
+$L$dec_loop8:
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+$L$dec_loop8_inner:
+DB	102,15,56,222,225
+DB	102,15,56,222,233
+DB	102,15,56,222,241
+DB	102,15,56,222,249
+DB	102,68,15,56,222,193
+DB	102,68,15,56,222,201
+$L$dec_loop8_enter:
+	movups	xmm1,XMMWORD[rax*1+rcx]
+	add	rax,32
+DB	102,15,56,222,208
+DB	102,15,56,222,216
+DB	102,15,56,222,224
+DB	102,15,56,222,232
+DB	102,15,56,222,240
+DB	102,15,56,222,248
+DB	102,68,15,56,222,192
+DB	102,68,15,56,222,200
+	movups	xmm0,XMMWORD[((-16))+rax*1+rcx]
+	jnz	NEAR $L$dec_loop8
+
+DB	102,15,56,222,209
+DB	102,15,56,222,217
+DB	102,15,56,222,225
+DB	102,15,56,222,233
+DB	102,15,56,222,241
+DB	102,15,56,222,249
+DB	102,68,15,56,222,193
+DB	102,68,15,56,222,201
+DB	102,15,56,223,208
+DB	102,15,56,223,216
+DB	102,15,56,223,224
+DB	102,15,56,223,232
+DB	102,15,56,223,240
+DB	102,15,56,223,248
+DB	102,68,15,56,223,192
+DB	102,68,15,56,223,200
+	ret
+
+global	aesni_ctr32_encrypt_blocks
+
+ALIGN	16
+aesni_ctr32_encrypt_blocks:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_aesni_ctr32_encrypt_blocks:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+	cmp	rdx,1
+	jne	NEAR $L$ctr32_bulk
+
+
+
+	movups	xmm2,XMMWORD[r8]
+	movups	xmm3,XMMWORD[rdi]
+	mov	edx,DWORD[240+rcx]
+	movups	xmm0,XMMWORD[rcx]
+	movups	xmm1,XMMWORD[16+rcx]
+	lea	rcx,[32+rcx]
+	xorps	xmm2,xmm0
+$L$oop_enc1_3:
+DB	102,15,56,220,209
+	dec	edx
+	movups	xmm1,XMMWORD[rcx]
+	lea	rcx,[16+rcx]
+	jnz	NEAR $L$oop_enc1_3
+DB	102,15,56,221,209
+	pxor	xmm0,xmm0
+	pxor	xmm1,xmm1
+	xorps	xmm2,xmm3
+	pxor	xmm3,xmm3
+	movups	XMMWORD[rsi],xmm2
+	xorps	xmm2,xmm2
+	jmp	NEAR $L$ctr32_epilogue
+
+ALIGN	16
+$L$ctr32_bulk:
+	lea	r11,[rsp]
+
+	push	rbp
+
+	sub	rsp,288
+	and	rsp,-16
+	movaps	XMMWORD[(-168)+r11],xmm6
+	movaps	XMMWORD[(-152)+r11],xmm7
+	movaps	XMMWORD[(-136)+r11],xmm8
+	movaps	XMMWORD[(-120)+r11],xmm9
+	movaps	XMMWORD[(-104)+r11],xmm10
+	movaps	XMMWORD[(-88)+r11],xmm11
+	movaps	XMMWORD[(-72)+r11],xmm12
+	movaps	XMMWORD[(-56)+r11],xmm13
+	movaps	XMMWORD[(-40)+r11],xmm14
+	movaps	XMMWORD[(-24)+r11],xmm15
+$L$ctr32_body:
+
+
+
+
+	movdqu	xmm2,XMMWORD[r8]
+	movdqu	xmm0,XMMWORD[rcx]
+	mov	r8d,DWORD[12+r8]
+	pxor	xmm2,xmm0
+	mov	ebp,DWORD[12+rcx]
+	movdqa	XMMWORD[rsp],xmm2
+	bswap	r8d
+	movdqa	xmm3,xmm2
+	movdqa	xmm4,xmm2
+	movdqa	xmm5,xmm2
+	movdqa	XMMWORD[64+rsp],xmm2
+	movdqa	XMMWORD[80+rsp],xmm2
+	movdqa	XMMWORD[96+rsp],xmm2
+	mov	r10,rdx
+	movdqa	XMMWORD[112+rsp],xmm2
+
+	lea	rax,[1+r8]
+	lea	rdx,[2+r8]
+	bswap	eax
+	bswap	edx
+	xor	eax,ebp
+	xor	edx,ebp
+DB	102,15,58,34,216,3
+	lea	rax,[3+r8]
+	movdqa	XMMWORD[16+rsp],xmm3
+DB	102,15,58,34,226,3
+	bswap	eax
+	mov	rdx,r10
+	lea	r10,[4+r8]
+	movdqa	XMMWORD[32+rsp],xmm4
+	xor	eax,ebp
+	bswap	r10d
+DB	102,15,58,34,232,3
+	xor	r10d,ebp
+	movdqa	XMMWORD[48+rsp],xmm5
+	lea	r9,[5+r8]
+	mov	DWORD[((64+12))+rsp],r10d
+	bswap	r9d
+	lea	r10,[6+r8]
+	mov	eax,DWORD[240+rcx]
+	xor	r9d,ebp
+	bswap	r10d
+	mov	DWORD[((80+12))+rsp],r9d
+	xor	r10d,ebp
+	lea	r9,[7+r8]
+	mov	DWORD[((96+12))+rsp],r10d
+	bswap	r9d
+;	leaq	OPENSSL_ia32cap_P(%rip),%r10
+;	 mov	4(%r10),%r10d
+	xor	r9d,ebp
+;	 and	$71303168,%r10d
+	mov	DWORD[((112+12))+rsp],r9d
+
+	movups	xmm1,XMMWORD[16+rcx]
+
+	movdqa	xmm6,XMMWORD[64+rsp]
+	movdqa	xmm7,XMMWORD[80+rsp]
+
+	cmp	rdx,8
+	jb	NEAR $L$ctr32_tail
+
+	sub	rdx,6
+;	cmp	$4194304,%r10d
+;	je	.Lctr32_6x
+
+	lea	rcx,[128+rcx]
+	sub	rdx,2
+	jmp	NEAR $L$ctr32_loop8
+
+;.align	16
+;.Lctr32_6x:
+;	shl	$4,%eax
+;	mov	$48,%r10d
+;	bswap	%ebp
+;	lea	32(%rcx,%eax),%rcx
+;	sub	%rax,%r10
+;	jmp	.Lctr32_loop6
+
+ALIGN	16
+$L$ctr32_loop6:
+	add	r8d,6
+	movups	xmm0,XMMWORD[((-48))+r10*1+rcx]
+DB	102,15,56,220,209
+	mov	eax,r8d
+	xor	eax,ebp
+DB	102,15,56,220,217
+DB	0x0f,0x38,0xf1,0x44,0x24,12
+	lea	eax,[1+r8]
+DB	102,15,56,220,225
+	xor	eax,ebp
+DB	0x0f,0x38,0xf1,0x44,0x24,28
+DB	102,15,56,220,233
+	lea	eax,[2+r8]
+	xor	eax,ebp
+DB	102,15,56,220,241
+DB	0x0f,0x38,0xf1,0x44,0x24,44
+	lea	eax,[3+r8]
+DB	102,15,56,220,249
+	movups	xmm1,XMMWORD[((-32))+r10*1+rcx]
+	xor	eax,ebp
+
+DB	102,15,56,220,208
+DB	0x0f,0x38,0xf1,0x44,0x24,60
+	lea	eax,[4+r8]
+DB	102,15,56,220,216
+	xor	eax,ebp
+DB	0x0f,0x38,0xf1,0x44,0x24,76
+DB	102,15,56,220,224
+	lea	eax,[5+r8]
+	xor	eax,ebp
+DB	102,15,56,220,232
+DB	0x0f,0x38,0xf1,0x44,0x24,92
+	mov	rax,r10
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+	movups	xmm0,XMMWORD[((-16))+r10*1+rcx]
+
+	call	$L$enc_loop6
+
+	movdqu	xmm8,XMMWORD[rdi]
+	movdqu	xmm9,XMMWORD[16+rdi]
+	movdqu	xmm10,XMMWORD[32+rdi]
+	movdqu	xmm11,XMMWORD[48+rdi]
+	movdqu	xmm12,XMMWORD[64+rdi]
+	movdqu	xmm13,XMMWORD[80+rdi]
+	lea	rdi,[96+rdi]
+	movups	xmm1,XMMWORD[((-64))+r10*1+rcx]
+	pxor	xmm8,xmm2
+	movaps	xmm2,XMMWORD[rsp]
+	pxor	xmm9,xmm3
+	movaps	xmm3,XMMWORD[16+rsp]
+	pxor	xmm10,xmm4
+	movaps	xmm4,XMMWORD[32+rsp]
+	pxor	xmm11,xmm5
+	movaps	xmm5,XMMWORD[48+rsp]
+	pxor	xmm12,xmm6
+	movaps	xmm6,XMMWORD[64+rsp]
+	pxor	xmm13,xmm7
+	movaps	xmm7,XMMWORD[80+rsp]
+	movdqu	XMMWORD[rsi],xmm8
+	movdqu	XMMWORD[16+rsi],xmm9
+	movdqu	XMMWORD[32+rsi],xmm10
+	movdqu	XMMWORD[48+rsi],xmm11
+	movdqu	XMMWORD[64+rsi],xmm12
+	movdqu	XMMWORD[80+rsi],xmm13
+	lea	rsi,[96+rsi]
+
+	sub	rdx,6
+	jnc	NEAR $L$ctr32_loop6
+
+	add	rdx,6
+	jz	NEAR $L$ctr32_done
+
+	lea	eax,[((-48))+r10]
+	lea	rcx,[((-80))+r10*1+rcx]
+	neg	eax
+	shr	eax,4
+	jmp	NEAR $L$ctr32_tail
+
+ALIGN	32
+$L$ctr32_loop8:
+	add	r8d,8
+	movdqa	xmm8,XMMWORD[96+rsp]
+DB	102,15,56,220,209
+	mov	r9d,r8d
+	movdqa	xmm9,XMMWORD[112+rsp]
+DB	102,15,56,220,217
+	bswap	r9d
+	movups	xmm0,XMMWORD[((32-128))+rcx]
+DB	102,15,56,220,225
+	xor	r9d,ebp
+	nop
+DB	102,15,56,220,233
+	mov	DWORD[((0+12))+rsp],r9d
+	lea	r9,[1+r8]
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+	movups	xmm1,XMMWORD[((48-128))+rcx]
+	bswap	r9d
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+	xor	r9d,ebp
+DB	0x66,0x90
+DB	102,15,56,220,224
+DB	102,15,56,220,232
+	mov	DWORD[((16+12))+rsp],r9d
+	lea	r9,[2+r8]
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+DB	102,68,15,56,220,192
+DB	102,68,15,56,220,200
+	movups	xmm0,XMMWORD[((64-128))+rcx]
+	bswap	r9d
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+	xor	r9d,ebp
+DB	0x66,0x90
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+	mov	DWORD[((32+12))+rsp],r9d
+	lea	r9,[3+r8]
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+	movups	xmm1,XMMWORD[((80-128))+rcx]
+	bswap	r9d
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+	xor	r9d,ebp
+DB	0x66,0x90
+DB	102,15,56,220,224
+DB	102,15,56,220,232
+	mov	DWORD[((48+12))+rsp],r9d
+	lea	r9,[4+r8]
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+DB	102,68,15,56,220,192
+DB	102,68,15,56,220,200
+	movups	xmm0,XMMWORD[((96-128))+rcx]
+	bswap	r9d
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+	xor	r9d,ebp
+DB	0x66,0x90
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+	mov	DWORD[((64+12))+rsp],r9d
+	lea	r9,[5+r8]
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+	movups	xmm1,XMMWORD[((112-128))+rcx]
+	bswap	r9d
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+	xor	r9d,ebp
+DB	0x66,0x90
+DB	102,15,56,220,224
+DB	102,15,56,220,232
+	mov	DWORD[((80+12))+rsp],r9d
+	lea	r9,[6+r8]
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+DB	102,68,15,56,220,192
+DB	102,68,15,56,220,200
+	movups	xmm0,XMMWORD[((128-128))+rcx]
+	bswap	r9d
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+	xor	r9d,ebp
+DB	0x66,0x90
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+	mov	DWORD[((96+12))+rsp],r9d
+	lea	r9,[7+r8]
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+	movups	xmm1,XMMWORD[((144-128))+rcx]
+	bswap	r9d
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+DB	102,15,56,220,224
+	xor	r9d,ebp
+	movdqu	xmm10,XMMWORD[rdi]
+DB	102,15,56,220,232
+	mov	DWORD[((112+12))+rsp],r9d
+	cmp	eax,11
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+DB	102,68,15,56,220,192
+DB	102,68,15,56,220,200
+	movups	xmm0,XMMWORD[((160-128))+rcx]
+
+	jb	NEAR $L$ctr32_enc_done
+
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+	movups	xmm1,XMMWORD[((176-128))+rcx]
+
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+DB	102,15,56,220,224
+DB	102,15,56,220,232
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+DB	102,68,15,56,220,192
+DB	102,68,15,56,220,200
+	movups	xmm0,XMMWORD[((192-128))+rcx]
+	je	NEAR $L$ctr32_enc_done
+
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+	movups	xmm1,XMMWORD[((208-128))+rcx]
+
+DB	102,15,56,220,208
+DB	102,15,56,220,216
+DB	102,15,56,220,224
+DB	102,15,56,220,232
+DB	102,15,56,220,240
+DB	102,15,56,220,248
+DB	102,68,15,56,220,192
+DB	102,68,15,56,220,200
+	movups	xmm0,XMMWORD[((224-128))+rcx]
+	jmp	NEAR $L$ctr32_enc_done
+
+ALIGN	16
+$L$ctr32_enc_done:
+	movdqu	xmm11,XMMWORD[16+rdi]
+	pxor	xmm10,xmm0
+	movdqu	xmm12,XMMWORD[32+rdi]
+	pxor	xmm11,xmm0
+	movdqu	xmm13,XMMWORD[48+rdi]
+	pxor	xmm12,xmm0
+	movdqu	xmm14,XMMWORD[64+rdi]
+	pxor	xmm13,xmm0
+	movdqu	xmm15,XMMWORD[80+rdi]
+	pxor	xmm14,xmm0
+	pxor	xmm15,xmm0
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+DB	102,15,56,220,241
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+DB	102,68,15,56,220,201
+	movdqu	xmm1,XMMWORD[96+rdi]
+	lea	rdi,[128+rdi]
+
+DB	102,65,15,56,221,210
+	pxor	xmm1,xmm0
+	movdqu	xmm10,XMMWORD[((112-128))+rdi]
+DB	102,65,15,56,221,219
+	pxor	xmm10,xmm0
+	movdqa	xmm11,XMMWORD[rsp]
+DB	102,65,15,56,221,228
+DB	102,65,15,56,221,237
+	movdqa	xmm12,XMMWORD[16+rsp]
+	movdqa	xmm13,XMMWORD[32+rsp]
+DB	102,65,15,56,221,246
+DB	102,65,15,56,221,255
+	movdqa	xmm14,XMMWORD[48+rsp]
+	movdqa	xmm15,XMMWORD[64+rsp]
+DB	102,68,15,56,221,193
+	movdqa	xmm0,XMMWORD[80+rsp]
+	movups	xmm1,XMMWORD[((16-128))+rcx]
+DB	102,69,15,56,221,202
+
+	movups	XMMWORD[rsi],xmm2
+	movdqa	xmm2,xmm11
+	movups	XMMWORD[16+rsi],xmm3
+	movdqa	xmm3,xmm12
+	movups	XMMWORD[32+rsi],xmm4
+	movdqa	xmm4,xmm13
+	movups	XMMWORD[48+rsi],xmm5
+	movdqa	xmm5,xmm14
+	movups	XMMWORD[64+rsi],xmm6
+	movdqa	xmm6,xmm15
+	movups	XMMWORD[80+rsi],xmm7
+	movdqa	xmm7,xmm0
+	movups	XMMWORD[96+rsi],xmm8
+	movups	XMMWORD[112+rsi],xmm9
+	lea	rsi,[128+rsi]
+
+	sub	rdx,8
+	jnc	NEAR $L$ctr32_loop8
+
+	add	rdx,8
+	jz	NEAR $L$ctr32_done
+	lea	rcx,[((-128))+rcx]
+
+$L$ctr32_tail:
+
+
+	lea	rcx,[16+rcx]
+	cmp	rdx,4
+	jb	NEAR $L$ctr32_loop3
+	je	NEAR $L$ctr32_loop4
+
+
+	shl	eax,4
+	movdqa	xmm8,XMMWORD[96+rsp]
+	pxor	xmm9,xmm9
+
+	movups	xmm0,XMMWORD[16+rcx]
+DB	102,15,56,220,209
+DB	102,15,56,220,217
+	lea	rcx,[((32-16))+rax*1+rcx]
+	neg	rax
+DB	102,15,56,220,225
+	add	rax,16
+	movups	xmm10,XMMWORD[rdi]
+DB	102,15,56,220,233
+DB	102,15,56,220,241
+	movups	xmm11,XMMWORD[16+rdi]
+	movups	xmm12,XMMWORD[32+rdi]
+DB	102,15,56,220,249
+DB	102,68,15,56,220,193
+
+	call	$L$enc_loop8_enter
+
+	movdqu	xmm13,XMMWORD[48+rdi]
+	pxor	xmm2,xmm10
+	movdqu	xmm10,XMMWORD[64+rdi]
+	pxor	xmm3,xmm11
+	movdqu	XMMWORD[rsi],xmm2
+	pxor	xmm4,xmm12
+	movdqu	XMMWORD[16+rsi],xmm3
+	pxor	xmm5,xmm13
+	movdqu	XMMWORD[32+rsi],xmm4
+	pxor	xmm6,xmm10
+	movdqu	XMMWORD[48+rsi],xmm5
+	movdqu	XMMWORD[64+rsi],xmm6
+	cmp	rdx,6
+	jb	NEAR $L$ctr32_done
+
+	movups	xmm11,XMMWORD[80+rdi]
+	xorps	xmm7,xmm11
+	movups	XMMWORD[80+rsi],xmm7
+	je	NEAR $L$ctr32_done
+
+	movups	xmm12,XMMWORD[96+rdi]
+	xorps	xmm8,xmm12
+	movups	XMMWORD[96+rsi],xmm8
+	jmp	NEAR $L$ctr32_done
+
+ALIGN	32
+$L$ctr32_loop4:
+DB	102,15,56,220,209
+	lea	rcx,[16+rcx]
+	dec	eax
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+DB	102,15,56,220,233
+	movups	xmm1,XMMWORD[rcx]
+	jnz	NEAR $L$ctr32_loop4
+DB	102,15,56,221,209
+DB	102,15,56,221,217
+	movups	xmm10,XMMWORD[rdi]
+	movups	xmm11,XMMWORD[16+rdi]
+DB	102,15,56,221,225
+DB	102,15,56,221,233
+	movups	xmm12,XMMWORD[32+rdi]
+	movups	xmm13,XMMWORD[48+rdi]
+
+	xorps	xmm2,xmm10
+	movups	XMMWORD[rsi],xmm2
+	xorps	xmm3,xmm11
+	movups	XMMWORD[16+rsi],xmm3
+	pxor	xmm4,xmm12
+	movdqu	XMMWORD[32+rsi],xmm4
+	pxor	xmm5,xmm13
+	movdqu	XMMWORD[48+rsi],xmm5
+	jmp	NEAR $L$ctr32_done
+
+ALIGN	32
+$L$ctr32_loop3:
+DB	102,15,56,220,209
+	lea	rcx,[16+rcx]
+	dec	eax
+DB	102,15,56,220,217
+DB	102,15,56,220,225
+	movups	xmm1,XMMWORD[rcx]
+	jnz	NEAR $L$ctr32_loop3
+DB	102,15,56,221,209
+DB	102,15,56,221,217
+DB	102,15,56,221,225
+
+	movups	xmm10,XMMWORD[rdi]
+	xorps	xmm2,xmm10
+	movups	XMMWORD[rsi],xmm2
+	cmp	rdx,2
+	jb	NEAR $L$ctr32_done
+
+	movups	xmm11,XMMWORD[16+rdi]
+	xorps	xmm3,xmm11
+	movups	XMMWORD[16+rsi],xmm3
+	je	NEAR $L$ctr32_done
+
+	movups	xmm12,XMMWORD[32+rdi]
+	xorps	xmm4,xmm12
+	movups	XMMWORD[32+rsi],xmm4
+
+$L$ctr32_done:
+	xorps	xmm0,xmm0
+	xor	ebp,ebp
+	pxor	xmm1,xmm1
+	pxor	xmm2,xmm2
+	pxor	xmm3,xmm3
+	pxor	xmm4,xmm4
+	pxor	xmm5,xmm5
+	movaps	xmm6,XMMWORD[((-168))+r11]
+	movaps	XMMWORD[(-168)+r11],xmm0
+	movaps	xmm7,XMMWORD[((-152))+r11]
+	movaps	XMMWORD[(-152)+r11],xmm0
+	movaps	xmm8,XMMWORD[((-136))+r11]
+	movaps	XMMWORD[(-136)+r11],xmm0
+	movaps	xmm9,XMMWORD[((-120))+r11]
+	movaps	XMMWORD[(-120)+r11],xmm0
+	movaps	xmm10,XMMWORD[((-104))+r11]
+	movaps	XMMWORD[(-104)+r11],xmm0
+	movaps	xmm11,XMMWORD[((-88))+r11]
+	movaps	XMMWORD[(-88)+r11],xmm0
+	movaps	xmm12,XMMWORD[((-72))+r11]
+	movaps	XMMWORD[(-72)+r11],xmm0
+	movaps	xmm13,XMMWORD[((-56))+r11]
+	movaps	XMMWORD[(-56)+r11],xmm0
+	movaps	xmm14,XMMWORD[((-40))+r11]
+	movaps	XMMWORD[(-40)+r11],xmm0
+	movaps	xmm15,XMMWORD[((-24))+r11]
+	movaps	XMMWORD[(-24)+r11],xmm0
+	movaps	XMMWORD[rsp],xmm0
+	movaps	XMMWORD[16+rsp],xmm0
+	movaps	XMMWORD[32+rsp],xmm0
+	movaps	XMMWORD[48+rsp],xmm0
+	movaps	XMMWORD[64+rsp],xmm0
+	movaps	XMMWORD[80+rsp],xmm0
+	movaps	XMMWORD[96+rsp],xmm0
+	movaps	XMMWORD[112+rsp],xmm0
+	mov	rbp,QWORD[((-8))+r11]
+
+	lea	rsp,[r11]
+
+$L$ctr32_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	ret
+
+$L$SEH_end_aesni_ctr32_encrypt_blocks:
+global	aesni_set_decrypt_key
+
+ALIGN	16
+aesni_set_decrypt_key:
+
+DB	0x48,0x83,0xEC,0x08
+
+	call	__aesni_set_encrypt_key
+	shl	edx,4
+	test	eax,eax
+	jnz	NEAR $L$dec_key_ret
+	lea	rcx,[16+rdx*1+r8]
+
+	movups	xmm0,XMMWORD[r8]
+	movups	xmm1,XMMWORD[rcx]
+	movups	XMMWORD[rcx],xmm0
+	movups	XMMWORD[r8],xmm1
+	lea	r8,[16+r8]
+	lea	rcx,[((-16))+rcx]
+
+$L$dec_key_inverse:
+	movups	xmm0,XMMWORD[r8]
+	movups	xmm1,XMMWORD[rcx]
+DB	102,15,56,219,192
+DB	102,15,56,219,201
+	lea	r8,[16+r8]
+	lea	rcx,[((-16))+rcx]
+	movups	XMMWORD[16+rcx],xmm0
+	movups	XMMWORD[(-16)+r8],xmm1
+	cmp	rcx,r8
+	ja	NEAR $L$dec_key_inverse
+
+	movups	xmm0,XMMWORD[r8]
+DB	102,15,56,219,192
+	pxor	xmm1,xmm1
+	movups	XMMWORD[rcx],xmm0
+	pxor	xmm0,xmm0
+$L$dec_key_ret:
+	add	rsp,8
+
+	ret
+
+$L$SEH_end_set_decrypt_key:
+
+global	aesni_set_encrypt_key
+
+ALIGN	16
+aesni_set_encrypt_key:
+__aesni_set_encrypt_key:
+
+DB	0x48,0x83,0xEC,0x08
+
+	mov	rax,-1
+	test	rcx,rcx
+	jz	NEAR $L$enc_key_ret
+	test	r8,r8
+	jz	NEAR $L$enc_key_ret
+
+	movups	xmm0,XMMWORD[rcx]
+	xorps	xmm4,xmm4
+;	leaq	OPENSSL_ia32cap_P(%rip),%r10
+;	movl	4(%r10),%r10d
+;	and	$268437504,%r10d
+	lea	rax,[16+r8]
+	cmp	edx,256
+	je	NEAR $L$14rounds
+	cmp	edx,192
+	je	NEAR $L$12rounds
+	cmp	edx,128
+	jne	NEAR $L$bad_keybits
+
+$L$10rounds:
+	mov	edx,9
+;	cmp	$268435456,%r10d
+;	je	.L10rounds_alt
+;  jmp	.L10rounds_alt
+;	movups	%xmm0,(%r8)
+;	.byte	102,15,58,223,200,1
+;	call		.Lkey_expansion_128_cold
+;	.byte	102,15,58,223,200,2
+;	call		.Lkey_expansion_128
+;	.byte	102,15,58,223,200,4
+;	call		.Lkey_expansion_128
+;	.byte	102,15,58,223,200,8
+;	call		.Lkey_expansion_128
+;	.byte	102,15,58,223,200,16
+;	call		.Lkey_expansion_128
+;	.byte	102,15,58,223,200,32
+;	call		.Lkey_expansion_128
+;	.byte	102,15,58,223,200,64
+;	call		.Lkey_expansion_128
+;	.byte	102,15,58,223,200,128
+;	call		.Lkey_expansion_128
+;	.byte	102,15,58,223,200,27
+;	call		.Lkey_expansion_128
+;	.byte	102,15,58,223,200,54
+;	call		.Lkey_expansion_128
+;	movups	%xmm0,(%rax)
+;	mov	%edx,80(%rax)
+;	xor	%eax,%eax
+;	jmp	.Lenc_key_ret
+
+;.align	16
+;.L10rounds_alt:
+	movdqa	xmm5,XMMWORD[$L$key_rotate]
+	mov	r10d,8
+	movdqa	xmm4,XMMWORD[$L$key_rcon1]
+	movdqa	xmm2,xmm0
+	movdqu	XMMWORD[r8],xmm0
+	jmp	NEAR $L$oop_key128
+
+ALIGN	16
+$L$oop_key128:
+	pshufb	xmm0,xmm5
+DB	102,15,56,221,196
+	pslld	xmm4,1
+	lea	rax,[16+rax]
+
+	movdqa	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm2,xmm3
+
+	pxor	xmm0,xmm2
+	movdqu	XMMWORD[(-16)+rax],xmm0
+	movdqa	xmm2,xmm0
+
+	dec	r10d
+	jnz	NEAR $L$oop_key128
+
+	movdqa	xmm4,XMMWORD[$L$key_rcon1b]
+
+	pshufb	xmm0,xmm5
+DB	102,15,56,221,196
+	pslld	xmm4,1
+
+	movdqa	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm2,xmm3
+
+	pxor	xmm0,xmm2
+	movdqu	XMMWORD[rax],xmm0
+
+	movdqa	xmm2,xmm0
+	pshufb	xmm0,xmm5
+DB	102,15,56,221,196
+
+	movdqa	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm3,xmm2
+	pslldq	xmm2,4
+	pxor	xmm2,xmm3
+
+	pxor	xmm0,xmm2
+	movdqu	XMMWORD[16+rax],xmm0
+
+	mov	DWORD[96+rax],edx
+	xor	eax,eax
+	jmp	NEAR $L$enc_key_ret
+
+ALIGN	16
+$L$12rounds:
+	movq	xmm2,QWORD[16+rcx]
+	mov	edx,11
+;	cmp	$268435456,%r10d
+;	je	.L12rounds_alt
+
+;	movups	%xmm0,(%r8)
+;	.byte	102,15,58,223,202,1
+;	call		.Lkey_expansion_192a_cold
+;	.byte	102,15,58,223,202,2
+;	call		.Lkey_expansion_192b
+;	.byte	102,15,58,223,202,4
+;	call		.Lkey_expansion_192a
+;	.byte	102,15,58,223,202,8
+;	call		.Lkey_expansion_192b
+;	.byte	102,15,58,223,202,16
+;	call		.Lkey_expansion_192a
+;	.byte	102,15,58,223,202,32
+;	call		.Lkey_expansion_192b
+;	.byte	102,15,58,223,202,64
+;	call		.Lkey_expansion_192a
+;	.byte	102,15,58,223,202,128
+;	call		.Lkey_expansion_192b
+;	movups	%xmm0,(%rax)
+;	mov	%edx,48(%rax)
+;	xor	%rax, %rax
+;	jmp	.Lenc_key_ret
+
+;.align	16
+;.L12rounds_alt:
+	movdqa	xmm5,XMMWORD[$L$key_rotate192]
+	movdqa	xmm4,XMMWORD[$L$key_rcon1]
+	mov	r10d,8
+	movdqu	XMMWORD[r8],xmm0
+	jmp	NEAR $L$oop_key192
+
+ALIGN	16
+$L$oop_key192:
+	movq	QWORD[rax],xmm2
+	movdqa	xmm1,xmm2
+	pshufb	xmm2,xmm5
+DB	102,15,56,221,212
+	pslld	xmm4,1
+	lea	rax,[24+rax]
+
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,4
+	pxor	xmm3,xmm0
+	pslldq	xmm0,4
+	pxor	xmm3,xmm0
+	pslldq	xmm0,4
+	pxor	xmm0,xmm3
+
+	pshufd	xmm3,xmm0,0xff
+	pxor	xmm3,xmm1
+	pslldq	xmm1,4
+	pxor	xmm3,xmm1
+
+	pxor	xmm0,xmm2
+	pxor	xmm2,xmm3
+	movdqu	XMMWORD[(-16)+rax],xmm0
+
+	dec	r10d
+	jnz	NEAR $L$oop_key192
+
+	mov	DWORD[32+rax],edx
+	xor	eax,eax
+	jmp	NEAR $L$enc_key_ret
+
+ALIGN	16
+$L$14rounds:
+	movups	xmm2,XMMWORD[16+rcx]
+	mov	edx,13
+	lea	rax,[16+rax]
+;	cmp	$268435456,%r10d
+;	je	.L14rounds_alt
+;
+;	movups	%xmm0,(%r8)
+;	movups	%xmm2,16(%r8)
+;	.byte	102,15,58,223,202,1
+;	call		.Lkey_expansion_256a_cold
+;	.byte	102,15,58,223,200,1
+;	call		.Lkey_expansion_256b
+;	.byte	102,15,58,223,202,2
+;	call		.Lkey_expansion_256a
+;	.byte	102,15,58,223,200,2
+;	call		.Lkey_expansion_256b
+;	.byte	102,15,58,223,202,4
+;	call		.Lkey_expansion_256a
+;	.byte	102,15,58,223,200,4
+;	call		.Lkey_expansion_256b
+;	.byte	102,15,58,223,202,8
+;	call		.Lkey_expansion_256a
+;	.byte	102,15,58,223,200,8
+;	call		.Lkey_expansion_256b
+;	.byte	102,15,58,223,202,16
+;	call		.Lkey_expansion_256a
+;	.byte	102,15,58,223,200,16
+;	call		.Lkey_expansion_256b
+;	.byte	102,15,58,223,202,32
+;	call		.Lkey_expansion_256a
+;	.byte	102,15,58,223,200,32
+;	call		.Lkey_expansion_256b
+;	.byte	102,15,58,223,202,64
+;	call		.Lkey_expansion_256a
+;	movups	%xmm0,(%rax)
+;	mov	%edx,16(%rax)
+;	xor	%rax,%rax
+;	jmp	.Lenc_key_ret
+
+;.align	16
+;.L14rounds_alt:
+	movdqa	xmm5,XMMWORD[$L$key_rotate]
+	movdqa	xmm4,XMMWORD[$L$key_rcon1]
+	mov	r10d,7
+	movdqu	XMMWORD[r8],xmm0
+	movdqa	xmm1,xmm2
+	movdqu	XMMWORD[16+r8],xmm2
+	jmp	NEAR $L$oop_key256
+
+ALIGN	16
+$L$oop_key256:
+	pshufb	xmm2,xmm5
+DB	102,15,56,221,212
+
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,4
+	pxor	xmm3,xmm0
+	pslldq	xmm0,4
+	pxor	xmm3,xmm0
+	pslldq	xmm0,4
+	pxor	xmm0,xmm3
+	pslld	xmm4,1
+
+	pxor	xmm0,xmm2
+	movdqu	XMMWORD[rax],xmm0
+
+	dec	r10d
+	jz	NEAR $L$done_key256
+
+	pshufd	xmm2,xmm0,0xff
+	pxor	xmm3,xmm3
+DB	102,15,56,221,211
+
+	movdqa	xmm3,xmm1
+	pslldq	xmm1,4
+	pxor	xmm3,xmm1
+	pslldq	xmm1,4
+	pxor	xmm3,xmm1
+	pslldq	xmm1,4
+	pxor	xmm1,xmm3
+
+	pxor	xmm2,xmm1
+	movdqu	XMMWORD[16+rax],xmm2
+	lea	rax,[32+rax]
+	movdqa	xmm1,xmm2
+
+	jmp	NEAR $L$oop_key256
+
+$L$done_key256:
+	mov	DWORD[16+rax],edx
+	xor	eax,eax
+	jmp	NEAR $L$enc_key_ret
+
+ALIGN	16
+$L$bad_keybits:
+	mov	rax,-2
+$L$enc_key_ret:
+	pxor	xmm0,xmm0
+	pxor	xmm1,xmm1
+	pxor	xmm2,xmm2
+	pxor	xmm3,xmm3
+	pxor	xmm4,xmm4
+	pxor	xmm5,xmm5
+	add	rsp,8
+
+	ret
+
+$L$SEH_end_set_encrypt_key:
+
+;.align	16
+;.Lkey_expansion_128:
+;	movups	%xmm0,(%rax)
+;	lea	16(%rax),%rax
+;.Lkey_expansion_128_cold:
+;	shufps	$0b00010000,%xmm0,%xmm4
+;	xorps	%xmm4, %xmm0
+;	shufps	$0b10001100,%xmm0,%xmm4
+;	xorps	%xmm4, %xmm0
+;	shufps	$0b11111111,%xmm1,%xmm1
+;	xorps	%xmm1,%xmm0
+;	ret
+
+;.align 16
+;.Lkey_expansion_192a:
+;	movups	%xmm0,(%rax)
+;	lea	16(%rax),%rax
+;.Lkey_expansion_192a_cold:
+;	movaps	%xmm2, %xmm5
+;.Lkey_expansion_192b_warm:
+;	shufps	$0b00010000,%xmm0,%xmm4
+;	movdqa	%xmm2,%xmm3
+;	xorps	%xmm4,%xmm0
+;	shufps	$0b10001100,%xmm0,%xmm4
+;	pslldq	$4,%xmm3
+;	xorps	%xmm4,%xmm0
+;	pshufd	$0b01010101,%xmm1,%xmm1
+;	pxor	%xmm3,%xmm2
+;	pxor	%xmm1,%xmm0
+;	pshufd	$0b11111111,%xmm0,%xmm3
+;	pxor	%xmm3,%xmm2
+;	ret
+;
+;.align 16
+;.Lkey_expansion_192b:
+;	movaps	%xmm0,%xmm3
+;	shufps	$0b01000100,%xmm0,%xmm5
+;	movups	%xmm5,(%rax)
+;	shufps	$0b01001110,%xmm2,%xmm3
+;	movups	%xmm3,16(%rax)
+;	lea	32(%rax),%rax
+;	jmp	.Lkey_expansion_192b_warm
+;
+;.align	16
+;.Lkey_expansion_256a:
+;	movups	%xmm2,(%rax)
+;	lea	16(%rax),%rax
+;.Lkey_expansion_256a_cold:
+;	shufps	$0b00010000,%xmm0,%xmm4
+;	xorps	%xmm4,%xmm0
+;	shufps	$0b10001100,%xmm0,%xmm4
+;	xorps	%xmm4,%xmm0
+;	shufps	$0b11111111,%xmm1,%xmm1
+;	xorps	%xmm1,%xmm0
+;	ret
+;
+;.align 16
+;.Lkey_expansion_256b:
+;	movups	%xmm0,(%rax)
+;	lea	16(%rax),%rax
+;
+;	shufps	$0b00010000,%xmm2,%xmm4
+;	xorps	%xmm4,%xmm2
+;	shufps	$0b10001100,%xmm2,%xmm4
+;	xorps	%xmm4,%xmm2
+;	shufps	$0b10101010,%xmm1,%xmm1
+;	xorps	%xmm1,%xmm2
+;	ret
+
+
+ALIGN	64
+$L$bswap_mask:
+DB	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+$L$increment32:
+	DD	6,6,6,0
+$L$increment64:
+	DD	1,0,0,0
+$L$xts_magic:
+	DD	0x87,0,1,0
+$L$increment1:
+DB	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+$L$key_rotate:
+	DD	0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d
+$L$key_rotate192:
+	DD	0x04070605,0x04070605,0x04070605,0x04070605
+$L$key_rcon1:
+	DD	1,1,1,1
+$L$key_rcon1b:
+	DD	0x1b,0x1b,0x1b,0x1b
+
+ALIGN	64
+EXTERN	__imp_RtlVirtualUnwind
+
+ALIGN	16
+ctr_xts_se_handler:
+	push	rsi
+	push	rdi
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	pushfq
+	sub	rsp,64
+
+	mov	rax,QWORD[120+r8]
+	mov	rbx,QWORD[248+r8]
+
+	mov	rsi,QWORD[8+r9]
+	mov	r11,QWORD[56+r9]
+
+	mov	r10d,DWORD[r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jb	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[152+r8]
+
+	mov	r10d,DWORD[4+r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jae	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[208+r8]
+
+	lea	rsi,[((-168))+rax]
+	lea	rdi,[512+r8]
+	mov	ecx,20
+	DD	0xa548f3fc
+
+	mov	rbp,QWORD[((-8))+rax]
+	mov	QWORD[160+r8],rbp
+	jmp	NEAR $L$common_seh_tail
+
+
+ALIGN	16
+cbc_se_handler:
+$L$common_seh_tail:
+	mov	rdi,QWORD[8+rax]
+	mov	rsi,QWORD[16+rax]
+	mov	QWORD[152+r8],rax
+	mov	QWORD[168+r8],rsi
+	mov	QWORD[176+r8],rdi
+
+	mov	rdi,QWORD[40+r9]
+	mov	rsi,r8
+	mov	ecx,154
+	DD	0xa548f3fc
+
+	mov	rsi,r9
+	xor	rcx,rcx
+	mov	rdx,QWORD[8+rsi]
+	mov	r8,QWORD[rsi]
+	mov	r9,QWORD[16+rsi]
+	mov	r10,QWORD[40+rsi]
+	lea	r11,[56+rsi]
+	lea	r12,[24+rsi]
+	mov	QWORD[32+rsp],r10
+	mov	QWORD[40+rsp],r11
+	mov	QWORD[48+rsp],r12
+	mov	QWORD[56+rsp],rcx
+	call	QWORD[__imp_RtlVirtualUnwind]
+
+	mov	eax,1
+	add	rsp,64
+	popfq
+	pop	r15
+	pop	r14
+	pop	r13
+	pop	r12
+	pop	rbp
+	pop	rbx
+	pop	rdi
+	pop	rsi
+	ret
+
+section	.pdata rdata align=4
+ALIGN	4
+
+
+
+
+
+
+
+
+
+
+
+
+	DD	$L$SEH_begin_aesni_ctr32_encrypt_blocks wrt ..imagebase
+	DD	$L$SEH_end_aesni_ctr32_encrypt_blocks wrt ..imagebase
+	DD	$L$SEH_info_ctr32 wrt ..imagebase
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	DD	aesni_set_decrypt_key wrt ..imagebase
+	DD	$L$SEH_end_set_decrypt_key wrt ..imagebase
+	DD	$L$SEH_info_key wrt ..imagebase
+
+	DD	aesni_set_encrypt_key wrt ..imagebase
+	DD	$L$SEH_end_set_encrypt_key wrt ..imagebase
+	DD	$L$SEH_info_key wrt ..imagebase
+section	.xdata rdata align=8
+ALIGN	8
+;.LSEH_info_ecb:
+;	.byte	9,0,0,0
+;	.rva	ecb_ccm64_se_handler
+;	.rva	.Lecb_enc_body,.Lecb_enc_ret
+;.LSEH_info_ccm64_enc:
+;	.byte	9,0,0,0
+;	.rva	ecb_ccm64_se_handler
+;	.rva	.Lccm64_enc_body,.Lccm64_enc_ret
+;.LSEH_info_ccm64_dec:
+;	.byte	9,0,0,0
+;	.rva	ecb_ccm64_se_handler
+;	.rva	.Lccm64_dec_body,.Lccm64_dec_ret
+$L$SEH_info_ctr32:
+DB	9,0,0,0
+	DD	ctr_xts_se_handler wrt ..imagebase
+	DD	$L$ctr32_body wrt ..imagebase,$L$ctr32_epilogue wrt ..imagebase
+;.LSEH_info_xts_enc:
+;	.byte	9,0,0,0
+;	.rva	ctr_xts_se_handler
+;	.rva	.Lxts_enc_body,.Lxts_enc_epilogue
+;.LSEH_info_xts_dec:
+;	.byte	9,0,0,0
+;	.rva	ctr_xts_se_handler
+;	.rva	.Lxts_dec_body,.Lxts_dec_epilogue
+;.LSEH_info_ocb_enc:
+;	.byte	9,0,0,0
+;	.rva	ocb_se_handler
+;	.rva	.Locb_enc_body,.Locb_enc_epilogue
+;	.rva	.Locb_enc_pop
+;	.long	0
+;.LSEH_info_ocb_dec:
+;	.byte	9,0,0,0
+;	.rva	ocb_se_handler
+;	.rva	.Locb_dec_body,.Locb_dec_epilogue
+;	.rva	.Locb_dec_pop
+;	.long	0
+;.LSEH_info_cbc:
+;	.byte	9,0,0,0
+;	.rva	cbc_se_handler
+$L$SEH_info_key:
+DB	0x01,0x04,0x01,0x00
+DB	0x04,0x02,0x00,0x00
diff --git a/crypto/aesgcm/ghash-x86.pl b/crypto/aesgcm/ghash-x86.pl
new file mode 100644
index 0000000..02edf03
--- /dev/null
+++ b/crypto/aesgcm/ghash-x86.pl
@@ -0,0 +1,1176 @@
+#! /usr/bin/env perl
+# Copyright 2010-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# March, May, June 2010
+#
+# The module implements "4-bit" GCM GHASH function and underlying
+# single multiplication operation in GF(2^128). "4-bit" means that it
+# uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two
+# code paths: vanilla x86 and vanilla SSE. Former will be executed on
+# 486 and Pentium, latter on all others. SSE GHASH features so called
+# "528B" variant of "4-bit" method utilizing additional 256+16 bytes
+# of per-key storage [+512 bytes shared table]. Performance results
+# are for streamed GHASH subroutine and are expressed in cycles per
+# processed byte, less is better:
+#
+#		gcc 2.95.3(*)	SSE assembler	x86 assembler
+#
+# Pentium	105/111(**)	-		50
+# PIII		68 /75		12.2		24
+# P4		125/125		17.8		84(***)
+# Opteron	66 /70		10.1		30
+# Core2		54 /67		8.4		18
+# Atom		105/105		16.8		53
+# VIA Nano	69 /71		13.0		27
+#
+# (*)	gcc 3.4.x was observed to generate few percent slower code,
+#	which is one of reasons why 2.95.3 results were chosen,
+#	another reason is lack of 3.4.x results for older CPUs;
+#	comparison with SSE results is not completely fair, because C
+#	results are for vanilla "256B" implementation, while
+#	assembler results are for "528B";-)
+# (**)	second number is result for code compiled with -fPIC flag,
+#	which is actually more relevant, because assembler code is
+#	position-independent;
+# (***)	see comment in non-MMX routine for further details;
+#
+# To summarize, it's >2-5 times faster than gcc-generated code. To
+# anchor it to something else SHA1 assembler processes one byte in
+# ~7 cycles on contemporary x86 cores. As for choice of MMX/SSE
+# in particular, see comment at the end of the file...
+
+# May 2010
+#
+# Add PCLMULQDQ version performing at 2.10 cycles per processed byte.
+# The question is how close is it to theoretical limit? The pclmulqdq
+# instruction latency appears to be 14 cycles and there can't be more
+# than 2 of them executing at any given time. This means that single
+# Karatsuba multiplication would take 28 cycles *plus* few cycles for
+# pre- and post-processing. Then multiplication has to be followed by
+# modulo-reduction. Given that aggregated reduction method [see
+# "Carry-less Multiplication and Its Usage for Computing the GCM Mode"
+# white paper by Intel] allows you to perform reduction only once in
+# a while we can assume that asymptotic performance can be estimated
+# as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction
+# and Naggr is the aggregation factor.
+#
+# Before we proceed to this implementation let's have closer look at
+# the best-performing code suggested by Intel in their white paper.
+# By tracing inter-register dependencies Tmod is estimated as ~19
+# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
+# processed byte. As implied, this is quite optimistic estimate,
+# because it does not account for Karatsuba pre- and post-processing,
+# which for a single multiplication is ~5 cycles. Unfortunately Intel
+# does not provide performance data for GHASH alone. But benchmarking
+# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
+# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
+# the result accounts even for pre-computing of degrees of the hash
+# key H, but its portion is negligible at 16KB buffer size.
+#
+# Moving on to the implementation in question. Tmod is estimated as
+# ~13 cycles and Naggr is 2, giving asymptotic performance of ...
+# 2.16. How is it possible that measured performance is better than
+# optimistic theoretical estimate? There is one thing Intel failed
+# to recognize. By serializing GHASH with CTR in same subroutine
+# former's performance is really limited to above (Tmul + Tmod/Naggr)
+# equation. But if GHASH procedure is detached, the modulo-reduction
+# can be interleaved with Naggr-1 multiplications at instruction level
+# and under ideal conditions even disappear from the equation. So that
+# optimistic theoretical estimate for this implementation is ...
+# 28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
+# at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
+# where Tproc is time required for Karatsuba pre- and post-processing,
+# is more realistic estimate. In this case it gives ... 1.91 cycles.
+# Or in other words, depending on how well we can interleave reduction
+# and one of the two multiplications the performance should be between
+# 1.91 and 2.16. As already mentioned, this implementation processes
+# one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
+# - in 2.02. x86_64 performance is better, because larger register
+# bank allows to interleave reduction and multiplication better.
+#
+# Does it make sense to increase Naggr? To start with it's virtually
+# impossible in 32-bit mode, because of limited register bank
+# capacity. Otherwise improvement has to be weighed against slower
+# setup, as well as code size and complexity increase. As even
+# optimistic estimate doesn't promise 30% performance improvement,
+# there are currently no plans to increase Naggr.
+#
+# Special thanks to David Woodhouse for providing access to a
+# Westmere-based system on behalf of Intel Open Source Technology Centre.
+
+# January 2010
+#
+# Tweaked to optimize transitions between integer and FP operations
+# on same XMM register, PCLMULQDQ subroutine was measured to process
+# one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere.
+# The minor regression on Westmere is outweighed by ~15% improvement
+# on Sandy Bridge. Strangely enough attempt to modify 64-bit code in
+# similar manner resulted in almost 20% degradation on Sandy Bridge,
+# where original 64-bit code processes one byte in 1.95 cycles.
+
+#####################################################################
+# For reference, AMD Bulldozer processes one byte in 1.98 cycles in
+# 32-bit mode and 1.89 in 64-bit.
+
+# February 2013
+#
+# Overhaul: aggregate Karatsuba post-processing, improve ILP in
+# reduction_alg9. Resulting performance is 1.96 cycles per byte on
+# Westmere, 1.95 - on Sandy/Ivy Bridge, 1.76 - on Bulldozer.
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+push(@INC,"${dir}","${dir}../../../perlasm");
+require "x86asm.pl";
+
+$output=pop;
+open STDOUT,">$output";
+
+&asm_init($ARGV[0],$x86only = $ARGV[$#ARGV] eq "386");
+
+$sse2=0;
+for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); }
+
+($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx");
+$inp  = "edi";
+$Htbl = "esi";
+
+$unroll = 0;	# Affects x86 loop. Folded loop performs ~7% worse
+		# than unrolled, which has to be weighted against
+		# 2.5x x86-specific code size reduction.
+
+sub x86_loop {
+    my $off = shift;
+    my $rem = "eax";
+
+	&mov	($Zhh,&DWP(4,$Htbl,$Zll));
+	&mov	($Zhl,&DWP(0,$Htbl,$Zll));
+	&mov	($Zlh,&DWP(12,$Htbl,$Zll));
+	&mov	($Zll,&DWP(8,$Htbl,$Zll));
+	&xor	($rem,$rem);	# avoid partial register stalls on PIII
+
+	# shrd practically kills P4, 2.5x deterioration, but P4 has
+	# MMX code-path to execute. shrd runs tad faster [than twice
+	# the shifts, move's and or's] on pre-MMX Pentium (as well as
+	# PIII and Core2), *but* minimizes code size, spares register
+	# and thus allows to fold the loop...
+	if (!$unroll) {
+	my $cnt = $inp;
+	&mov	($cnt,15);
+	&jmp	(&label("x86_loop"));
+	&set_label("x86_loop",16);
+	    for($i=1;$i<=2;$i++) {
+		&mov	(&LB($rem),&LB($Zll));
+		&shrd	($Zll,$Zlh,4);
+		&and	(&LB($rem),0xf);
+		&shrd	($Zlh,$Zhl,4);
+		&shrd	($Zhl,$Zhh,4);
+		&shr	($Zhh,4);
+		&xor	($Zhh,&DWP($off+16,"esp",$rem,4));
+
+		&mov	(&LB($rem),&BP($off,"esp",$cnt));
+		if ($i&1) {
+			&and	(&LB($rem),0xf0);
+		} else {
+			&shl	(&LB($rem),4);
+		}
+
+		&xor	($Zll,&DWP(8,$Htbl,$rem));
+		&xor	($Zlh,&DWP(12,$Htbl,$rem));
+		&xor	($Zhl,&DWP(0,$Htbl,$rem));
+		&xor	($Zhh,&DWP(4,$Htbl,$rem));
+
+		if ($i&1) {
+			&dec	($cnt);
+			&js	(&label("x86_break"));
+		} else {
+			&jmp	(&label("x86_loop"));
+		}
+	    }
+	&set_label("x86_break",16);
+	} else {
+	    for($i=1;$i<32;$i++) {
+		&comment($i);
+		&mov	(&LB($rem),&LB($Zll));
+		&shrd	($Zll,$Zlh,4);
+		&and	(&LB($rem),0xf);
+		&shrd	($Zlh,$Zhl,4);
+		&shrd	($Zhl,$Zhh,4);
+		&shr	($Zhh,4);
+		&xor	($Zhh,&DWP($off+16,"esp",$rem,4));
+
+		if ($i&1) {
+			&mov	(&LB($rem),&BP($off+15-($i>>1),"esp"));
+			&and	(&LB($rem),0xf0);
+		} else {
+			&mov	(&LB($rem),&BP($off+15-($i>>1),"esp"));
+			&shl	(&LB($rem),4);
+		}
+
+		&xor	($Zll,&DWP(8,$Htbl,$rem));
+		&xor	($Zlh,&DWP(12,$Htbl,$rem));
+		&xor	($Zhl,&DWP(0,$Htbl,$rem));
+		&xor	($Zhh,&DWP(4,$Htbl,$rem));
+	    }
+	}
+	&bswap	($Zll);
+	&bswap	($Zlh);
+	&bswap	($Zhl);
+	if (!$x86only) {
+		&bswap	($Zhh);
+	} else {
+		&mov	("eax",$Zhh);
+		&bswap	("eax");
+		&mov	($Zhh,"eax");
+	}
+}
+
+if ($unroll) {
+    &function_begin_B("_x86_gmult_4bit_inner");
+	&x86_loop(4);
+	&ret	();
+    &function_end_B("_x86_gmult_4bit_inner");
+}
+
+sub deposit_rem_4bit {
+    my $bias = shift;
+
+	&mov	(&DWP($bias+0, "esp"),0x0000<<16);
+	&mov	(&DWP($bias+4, "esp"),0x1C20<<16);
+	&mov	(&DWP($bias+8, "esp"),0x3840<<16);
+	&mov	(&DWP($bias+12,"esp"),0x2460<<16);
+	&mov	(&DWP($bias+16,"esp"),0x7080<<16);
+	&mov	(&DWP($bias+20,"esp"),0x6CA0<<16);
+	&mov	(&DWP($bias+24,"esp"),0x48C0<<16);
+	&mov	(&DWP($bias+28,"esp"),0x54E0<<16);
+	&mov	(&DWP($bias+32,"esp"),0xE100<<16);
+	&mov	(&DWP($bias+36,"esp"),0xFD20<<16);
+	&mov	(&DWP($bias+40,"esp"),0xD940<<16);
+	&mov	(&DWP($bias+44,"esp"),0xC560<<16);
+	&mov	(&DWP($bias+48,"esp"),0x9180<<16);
+	&mov	(&DWP($bias+52,"esp"),0x8DA0<<16);
+	&mov	(&DWP($bias+56,"esp"),0xA9C0<<16);
+	&mov	(&DWP($bias+60,"esp"),0xB5E0<<16);
+}
+
+if (!$x86only) {{{
+
+&static_label("rem_4bit");
+
+if (!$sse2) {{	# pure-MMX "May" version...
+
+    # This code was removed since SSE2 is required for BoringSSL. The
+    # outer structure of the code was retained to minimize future merge
+    # conflicts.
+
+}} else {{	# "June" MMX version...
+		# ... has slower "April" gcm_gmult_4bit_mmx with folded
+		# loop. This is done to conserve code size...
+$S=16;		# shift factor for rem_4bit
+
+sub mmx_loop() {
+# MMX version performs 2.8 times better on P4 (see comment in non-MMX
+# routine for further details), 40% better on Opteron and Core2, 50%
+# better on PIII... In other words effort is considered to be well
+# spent...
+    my $inp = shift;
+    my $rem_4bit = shift;
+    my $cnt = $Zhh;
+    my $nhi = $Zhl;
+    my $nlo = $Zlh;
+    my $rem = $Zll;
+
+    my ($Zlo,$Zhi) = ("mm0","mm1");
+    my $tmp = "mm2";
+
+	&xor	($nlo,$nlo);	# avoid partial register stalls on PIII
+	&mov	($nhi,$Zll);
+	&mov	(&LB($nlo),&LB($nhi));
+	&mov	($cnt,14);
+	&shl	(&LB($nlo),4);
+	&and	($nhi,0xf0);
+	&movq	($Zlo,&QWP(8,$Htbl,$nlo));
+	&movq	($Zhi,&QWP(0,$Htbl,$nlo));
+	&movd	($rem,$Zlo);
+	&jmp	(&label("mmx_loop"));
+
+    &set_label("mmx_loop",16);
+	&psrlq	($Zlo,4);
+	&and	($rem,0xf);
+	&movq	($tmp,$Zhi);
+	&psrlq	($Zhi,4);
+	&pxor	($Zlo,&QWP(8,$Htbl,$nhi));
+	&mov	(&LB($nlo),&BP(0,$inp,$cnt));
+	&psllq	($tmp,60);
+	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
+	&dec	($cnt);
+	&movd	($rem,$Zlo);
+	&pxor	($Zhi,&QWP(0,$Htbl,$nhi));
+	&mov	($nhi,$nlo);
+	&pxor	($Zlo,$tmp);
+	&js	(&label("mmx_break"));
+
+	&shl	(&LB($nlo),4);
+	&and	($rem,0xf);
+	&psrlq	($Zlo,4);
+	&and	($nhi,0xf0);
+	&movq	($tmp,$Zhi);
+	&psrlq	($Zhi,4);
+	&pxor	($Zlo,&QWP(8,$Htbl,$nlo));
+	&psllq	($tmp,60);
+	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
+	&movd	($rem,$Zlo);
+	&pxor	($Zhi,&QWP(0,$Htbl,$nlo));
+	&pxor	($Zlo,$tmp);
+	&jmp	(&label("mmx_loop"));
+
+    &set_label("mmx_break",16);
+	&shl	(&LB($nlo),4);
+	&and	($rem,0xf);
+	&psrlq	($Zlo,4);
+	&and	($nhi,0xf0);
+	&movq	($tmp,$Zhi);
+	&psrlq	($Zhi,4);
+	&pxor	($Zlo,&QWP(8,$Htbl,$nlo));
+	&psllq	($tmp,60);
+	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
+	&movd	($rem,$Zlo);
+	&pxor	($Zhi,&QWP(0,$Htbl,$nlo));
+	&pxor	($Zlo,$tmp);
+
+	&psrlq	($Zlo,4);
+	&and	($rem,0xf);
+	&movq	($tmp,$Zhi);
+	&psrlq	($Zhi,4);
+	&pxor	($Zlo,&QWP(8,$Htbl,$nhi));
+	&psllq	($tmp,60);
+	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
+	&movd	($rem,$Zlo);
+	&pxor	($Zhi,&QWP(0,$Htbl,$nhi));
+	&pxor	($Zlo,$tmp);
+
+	&psrlq	($Zlo,32);	# lower part of Zlo is already there
+	&movd	($Zhl,$Zhi);
+	&psrlq	($Zhi,32);
+	&movd	($Zlh,$Zlo);
+	&movd	($Zhh,$Zhi);
+
+	&bswap	($Zll);
+	&bswap	($Zhl);
+	&bswap	($Zlh);
+	&bswap	($Zhh);
+}
+
+&function_begin("gcm_gmult_4bit_mmx");
+	&mov	($inp,&wparam(0));	# load Xi
+	&mov	($Htbl,&wparam(1));	# load Htable
+
+	&call	(&label("pic_point"));
+	&set_label("pic_point");
+	&blindpop("eax");
+	&lea	("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
+
+	&movz	($Zll,&BP(15,$inp));
+
+	&mmx_loop($inp,"eax");
+
+	&emms	();
+	&mov	(&DWP(12,$inp),$Zll);
+	&mov	(&DWP(4,$inp),$Zhl);
+	&mov	(&DWP(8,$inp),$Zlh);
+	&mov	(&DWP(0,$inp),$Zhh);
+&function_end("gcm_gmult_4bit_mmx");
+
+######################################################################
+# Below subroutine is "528B" variant of "4-bit" GCM GHASH function
+# (see gcm128.c for details). It provides further 20-40% performance
+# improvement over above mentioned "May" version.
+
+&static_label("rem_8bit");
+
+&function_begin("gcm_ghash_4bit_mmx");
+{ my ($Zlo,$Zhi) = ("mm7","mm6");
+  my $rem_8bit = "esi";
+  my $Htbl = "ebx";
+
+    # parameter block
+    &mov	("eax",&wparam(0));		# Xi
+    &mov	("ebx",&wparam(1));		# Htable
+    &mov	("ecx",&wparam(2));		# inp
+    &mov	("edx",&wparam(3));		# len
+    &mov	("ebp","esp");			# original %esp
+    &call	(&label("pic_point"));
+    &set_label	("pic_point");
+    &blindpop	($rem_8bit);
+    &lea	($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit));
+
+    &sub	("esp",512+16+16);		# allocate stack frame...
+    &and	("esp",-64);			# ...and align it
+    &sub	("esp",16);			# place for (u8)(H[]<<4)
+
+    &add	("edx","ecx");			# pointer to the end of input
+    &mov	(&DWP(528+16+0,"esp"),"eax");	# save Xi
+    &mov	(&DWP(528+16+8,"esp"),"edx");	# save inp+len
+    &mov	(&DWP(528+16+12,"esp"),"ebp");	# save original %esp
+
+    { my @lo  = ("mm0","mm1","mm2");
+      my @hi  = ("mm3","mm4","mm5");
+      my @tmp = ("mm6","mm7");
+      my ($off1,$off2,$i) = (0,0,);
+
+      &add	($Htbl,128);			# optimize for size
+      &lea	("edi",&DWP(16+128,"esp"));
+      &lea	("ebp",&DWP(16+256+128,"esp"));
+
+      # decompose Htable (low and high parts are kept separately),
+      # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack...
+      for ($i=0;$i<18;$i++) {
+
+	&mov	("edx",&DWP(16*$i+8-128,$Htbl))		if ($i<16);
+	&movq	($lo[0],&QWP(16*$i+8-128,$Htbl))	if ($i<16);
+	&psllq	($tmp[1],60)				if ($i>1);
+	&movq	($hi[0],&QWP(16*$i+0-128,$Htbl))	if ($i<16);
+	&por	($lo[2],$tmp[1])			if ($i>1);
+	&movq	(&QWP($off1-128,"edi"),$lo[1])		if ($i>0 && $i<17);
+	&psrlq	($lo[1],4)				if ($i>0 && $i<17);
+	&movq	(&QWP($off1,"edi"),$hi[1])		if ($i>0 && $i<17);
+	&movq	($tmp[0],$hi[1])			if ($i>0 && $i<17);
+	&movq	(&QWP($off2-128,"ebp"),$lo[2])		if ($i>1);
+	&psrlq	($hi[1],4)				if ($i>0 && $i<17);
+	&movq	(&QWP($off2,"ebp"),$hi[2])		if ($i>1);
+	&shl	("edx",4)				if ($i<16);
+	&mov	(&BP($i,"esp"),&LB("edx"))		if ($i<16);
+
+	unshift	(@lo,pop(@lo));			# "rotate" registers
+	unshift	(@hi,pop(@hi));
+	unshift	(@tmp,pop(@tmp));
+	$off1 += 8	if ($i>0);
+	$off2 += 8	if ($i>1);
+      }
+    }
+
+    &movq	($Zhi,&QWP(0,"eax"));
+    &mov	("ebx",&DWP(8,"eax"));
+    &mov	("edx",&DWP(12,"eax"));		# load Xi
+
+&set_label("outer",16);
+  { my $nlo = "eax";
+    my $dat = "edx";
+    my @nhi = ("edi","ebp");
+    my @rem = ("ebx","ecx");
+    my @red = ("mm0","mm1","mm2");
+    my $tmp = "mm3";
+
+    &xor	($dat,&DWP(12,"ecx"));		# merge input data
+    &xor	("ebx",&DWP(8,"ecx"));
+    &pxor	($Zhi,&QWP(0,"ecx"));
+    &lea	("ecx",&DWP(16,"ecx"));		# inp+=16
+    #&mov	(&DWP(528+12,"esp"),$dat);	# save inp^Xi
+    &mov	(&DWP(528+8,"esp"),"ebx");
+    &movq	(&QWP(528+0,"esp"),$Zhi);
+    &mov	(&DWP(528+16+4,"esp"),"ecx");	# save inp
+
+    &xor	($nlo,$nlo);
+    &rol	($dat,8);
+    &mov	(&LB($nlo),&LB($dat));
+    &mov	($nhi[1],$nlo);
+    &and	(&LB($nlo),0x0f);
+    &shr	($nhi[1],4);
+    &pxor	($red[0],$red[0]);
+    &rol	($dat,8);			# next byte
+    &pxor	($red[1],$red[1]);
+    &pxor	($red[2],$red[2]);
+
+    # Just like in "May" version modulo-schedule for critical path in
+    # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor'
+    # is scheduled so late that rem_8bit[] has to be shifted *right*
+    # by 16, which is why last argument to pinsrw is 2, which
+    # corresponds to <<32=<<48>>16...
+    for ($j=11,$i=0;$i<15;$i++) {
+
+      if ($i>0) {
+	&pxor	($Zlo,&QWP(16,"esp",$nlo,8));		# Z^=H[nlo]
+	&rol	($dat,8);				# next byte
+	&pxor	($Zhi,&QWP(16+128,"esp",$nlo,8));
+
+	&pxor	($Zlo,$tmp);
+	&pxor	($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
+	&xor	(&LB($rem[1]),&BP(0,"esp",$nhi[0]));	# rem^(H[nhi]<<4)
+      } else {
+	&movq	($Zlo,&QWP(16,"esp",$nlo,8));
+	&movq	($Zhi,&QWP(16+128,"esp",$nlo,8));
+      }
+
+	&mov	(&LB($nlo),&LB($dat));
+	&mov	($dat,&DWP(528+$j,"esp"))		if (--$j%4==0);
+
+	&movd	($rem[0],$Zlo);
+	&movz	($rem[1],&LB($rem[1]))			if ($i>0);
+	&psrlq	($Zlo,8);				# Z>>=8
+
+	&movq	($tmp,$Zhi);
+	&mov	($nhi[0],$nlo);
+	&psrlq	($Zhi,8);
+
+	&pxor	($Zlo,&QWP(16+256+0,"esp",$nhi[1],8));	# Z^=H[nhi]>>4
+	&and	(&LB($nlo),0x0f);
+	&psllq	($tmp,56);
+
+	&pxor	($Zhi,$red[1])				if ($i>1);
+	&shr	($nhi[0],4);
+	&pinsrw	($red[0],&WP(0,$rem_8bit,$rem[1],2),2)	if ($i>0);
+
+	unshift	(@red,pop(@red));			# "rotate" registers
+	unshift	(@rem,pop(@rem));
+	unshift	(@nhi,pop(@nhi));
+    }
+
+    &pxor	($Zlo,&QWP(16,"esp",$nlo,8));		# Z^=H[nlo]
+    &pxor	($Zhi,&QWP(16+128,"esp",$nlo,8));
+    &xor	(&LB($rem[1]),&BP(0,"esp",$nhi[0]));	# rem^(H[nhi]<<4)
+
+    &pxor	($Zlo,$tmp);
+    &pxor	($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
+    &movz	($rem[1],&LB($rem[1]));
+
+    &pxor	($red[2],$red[2]);			# clear 2nd word
+    &psllq	($red[1],4);
+
+    &movd	($rem[0],$Zlo);
+    &psrlq	($Zlo,4);				# Z>>=4
+
+    &movq	($tmp,$Zhi);
+    &psrlq	($Zhi,4);
+    &shl	($rem[0],4);				# rem<<4
+
+    &pxor	($Zlo,&QWP(16,"esp",$nhi[1],8));	# Z^=H[nhi]
+    &psllq	($tmp,60);
+    &movz	($rem[0],&LB($rem[0]));
+
+    &pxor	($Zlo,$tmp);
+    &pxor	($Zhi,&QWP(16+128,"esp",$nhi[1],8));
+
+    &pinsrw	($red[0],&WP(0,$rem_8bit,$rem[1],2),2);
+    &pxor	($Zhi,$red[1]);
+
+    &movd	($dat,$Zlo);
+    &pinsrw	($red[2],&WP(0,$rem_8bit,$rem[0],2),3);	# last is <<48
+
+    &psllq	($red[0],12);				# correct by <<16>>4
+    &pxor	($Zhi,$red[0]);
+    &psrlq	($Zlo,32);
+    &pxor	($Zhi,$red[2]);
+
+    &mov	("ecx",&DWP(528+16+4,"esp"));	# restore inp
+    &movd	("ebx",$Zlo);
+    &movq	($tmp,$Zhi);			# 01234567
+    &psllw	($Zhi,8);			# 1.3.5.7.
+    &psrlw	($tmp,8);			# .0.2.4.6
+    &por	($Zhi,$tmp);			# 10325476
+    &bswap	($dat);
+    &pshufw	($Zhi,$Zhi,0b00011011);		# 76543210
+    &bswap	("ebx");
+
+    &cmp	("ecx",&DWP(528+16+8,"esp"));	# are we done?
+    &jne	(&label("outer"));
+  }
+
+    &mov	("eax",&DWP(528+16+0,"esp"));	# restore Xi
+    &mov	(&DWP(12,"eax"),"edx");
+    &mov	(&DWP(8,"eax"),"ebx");
+    &movq	(&QWP(0,"eax"),$Zhi);
+
+    &mov	("esp",&DWP(528+16+12,"esp"));	# restore original %esp
+    &emms	();
+}
+&function_end("gcm_ghash_4bit_mmx");
+}}
+
+if ($sse2) {{
+######################################################################
+# PCLMULQDQ version.
+
+$Xip="eax";
+$Htbl="edx";
+$const="ecx";
+$inp="esi";
+$len="ebx";
+
+($Xi,$Xhi)=("xmm0","xmm1");	$Hkey="xmm2";
+($T1,$T2,$T3)=("xmm3","xmm4","xmm5");
+($Xn,$Xhn)=("xmm6","xmm7");
+
+&static_label("bswap");
+
+sub clmul64x64_T2 {	# minimal "register" pressure
+my ($Xhi,$Xi,$Hkey,$HK)=@_;
+
+	&movdqa		($Xhi,$Xi);		#
+	&pshufd		($T1,$Xi,0b01001110);
+	&pshufd		($T2,$Hkey,0b01001110)	if (!defined($HK));
+	&pxor		($T1,$Xi);		#
+	&pxor		($T2,$Hkey)		if (!defined($HK));
+			$HK=$T2			if (!defined($HK));
+
+	&pclmulqdq	($Xi,$Hkey,0x00);	#######
+	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
+	&pclmulqdq	($T1,$HK,0x00);		#######
+	&xorps		($T1,$Xi);		#
+	&xorps		($T1,$Xhi);		#
+
+	&movdqa		($T2,$T1);		#
+	&psrldq		($T1,8);
+	&pslldq		($T2,8);		#
+	&pxor		($Xhi,$T1);
+	&pxor		($Xi,$T2);		#
+}
+
+sub clmul64x64_T3 {
+# Even though this subroutine offers visually better ILP, it
+# was empirically found to be a tad slower than above version.
+# At least in gcm_ghash_clmul context. But it's just as well,
+# because loop modulo-scheduling is possible only thanks to
+# minimized "register" pressure...
+my ($Xhi,$Xi,$Hkey)=@_;
+
+	&movdqa		($T1,$Xi);		#
+	&movdqa		($Xhi,$Xi);
+	&pclmulqdq	($Xi,$Hkey,0x00);	#######
+	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
+	&pshufd		($T2,$T1,0b01001110);	#
+	&pshufd		($T3,$Hkey,0b01001110);
+	&pxor		($T2,$T1);		#
+	&pxor		($T3,$Hkey);
+	&pclmulqdq	($T2,$T3,0x00);		#######
+	&pxor		($T2,$Xi);		#
+	&pxor		($T2,$Xhi);		#
+
+	&movdqa		($T3,$T2);		#
+	&psrldq		($T2,8);
+	&pslldq		($T3,8);		#
+	&pxor		($Xhi,$T2);
+	&pxor		($Xi,$T3);		#
+}
+
+if (1) {		# Algorithm 9 with <<1 twist.
+			# Reduction is shorter and uses only two
+			# temporary registers, which makes it better
+			# candidate for interleaving with 64x64
+			# multiplication. Pre-modulo-scheduled loop
+			# was found to be ~20% faster than Algorithm 5
+			# below. Algorithm 9 was therefore chosen for
+			# further optimization...
+
+sub reduction_alg9 {	# 17/11 times faster than Intel version
+my ($Xhi,$Xi) = @_;
+
+	# 1st phase
+	&movdqa		($T2,$Xi);		#
+	&movdqa		($T1,$Xi);
+	&psllq		($Xi,5);
+	&pxor		($T1,$Xi);		#
+	&psllq		($Xi,1);
+	&pxor		($Xi,$T1);		#
+	&psllq		($Xi,57);		#
+	&movdqa		($T1,$Xi);		#
+	&pslldq		($Xi,8);
+	&psrldq		($T1,8);		#
+	&pxor		($Xi,$T2);
+	&pxor		($Xhi,$T1);		#
+
+	# 2nd phase
+	&movdqa		($T2,$Xi);
+	&psrlq		($Xi,1);
+	&pxor		($Xhi,$T2);		#
+	&pxor		($T2,$Xi);
+	&psrlq		($Xi,5);
+	&pxor		($Xi,$T2);		#
+	&psrlq		($Xi,1);		#
+	&pxor		($Xi,$Xhi)		#
+}
+
+&function_begin_B("gcm_init_clmul");
+	&mov		($Htbl,&wparam(0));
+	&mov		($Xip,&wparam(1));
+
+	&call		(&label("pic"));
+&set_label("pic");
+	&blindpop	($const);
+	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
+
+	&movdqu		($Hkey,&QWP(0,$Xip));
+	&pshufd		($Hkey,$Hkey,0b01001110);# dword swap
+
+	# <<1 twist
+	&pshufd		($T2,$Hkey,0b11111111);	# broadcast uppermost dword
+	&movdqa		($T1,$Hkey);
+	&psllq		($Hkey,1);
+	&pxor		($T3,$T3);		#
+	&psrlq		($T1,63);
+	&pcmpgtd	($T3,$T2);		# broadcast carry bit
+	&pslldq		($T1,8);
+	&por		($Hkey,$T1);		# H<<=1
+
+	# magic reduction
+	&pand		($T3,&QWP(16,$const));	# 0x1c2_polynomial
+	&pxor		($Hkey,$T3);		# if(carry) H^=0x1c2_polynomial
+
+	# calculate H^2
+	&movdqa		($Xi,$Hkey);
+	&clmul64x64_T2	($Xhi,$Xi,$Hkey);
+	&reduction_alg9	($Xhi,$Xi);
+
+	&pshufd		($T1,$Hkey,0b01001110);
+	&pshufd		($T2,$Xi,0b01001110);
+	&pxor		($T1,$Hkey);		# Karatsuba pre-processing
+	&movdqu		(&QWP(0,$Htbl),$Hkey);	# save H
+	&pxor		($T2,$Xi);		# Karatsuba pre-processing
+	&movdqu		(&QWP(16,$Htbl),$Xi);	# save H^2
+	&palignr	($T2,$T1,8);		# low part is H.lo^H.hi
+	&movdqu		(&QWP(32,$Htbl),$T2);	# save Karatsuba "salt"
+
+	&ret		();
+&function_end_B("gcm_init_clmul");
+
+&function_begin_B("gcm_gmult_clmul");
+	&mov		($Xip,&wparam(0));
+	&mov		($Htbl,&wparam(1));
+
+	&call		(&label("pic"));
+&set_label("pic");
+	&blindpop	($const);
+	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
+
+	&movdqu		($Xi,&QWP(0,$Xip));
+	&movdqa		($T3,&QWP(0,$const));
+	&movups		($Hkey,&QWP(0,$Htbl));
+	&pshufb		($Xi,$T3);
+	&movups		($T2,&QWP(32,$Htbl));
+
+	&clmul64x64_T2	($Xhi,$Xi,$Hkey,$T2);
+	&reduction_alg9	($Xhi,$Xi);
+
+	&pshufb		($Xi,$T3);
+	&movdqu		(&QWP(0,$Xip),$Xi);
+
+	&ret	();
+&function_end_B("gcm_gmult_clmul");
+
+&function_begin("gcm_ghash_clmul");
+	&mov		($Xip,&wparam(0));
+	&mov		($Htbl,&wparam(1));
+	&mov		($inp,&wparam(2));
+	&mov		($len,&wparam(3));
+
+	&call		(&label("pic"));
+&set_label("pic");
+	&blindpop	($const);
+	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
+
+	&movdqu		($Xi,&QWP(0,$Xip));
+	&movdqa		($T3,&QWP(0,$const));
+	&movdqu		($Hkey,&QWP(0,$Htbl));
+	&pshufb		($Xi,$T3);
+
+	&sub		($len,0x10);
+	&jz		(&label("odd_tail"));
+
+	#######
+	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
+	#	[(H*Ii+1) + (H*Xi+1)] mod P =
+	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
+	#
+	&movdqu		($T1,&QWP(0,$inp));	# Ii
+	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
+	&pshufb		($T1,$T3);
+	&pshufb		($Xn,$T3);
+	&movdqu		($T3,&QWP(32,$Htbl));
+	&pxor		($Xi,$T1);		# Ii+Xi
+
+	&pshufd		($T1,$Xn,0b01001110);	# H*Ii+1
+	&movdqa		($Xhn,$Xn);
+	&pxor		($T1,$Xn);		#
+	&lea		($inp,&DWP(32,$inp));	# i+=2
+
+	&pclmulqdq	($Xn,$Hkey,0x00);	#######
+	&pclmulqdq	($Xhn,$Hkey,0x11);	#######
+	&pclmulqdq	($T1,$T3,0x00);		#######
+	&movups		($Hkey,&QWP(16,$Htbl));	# load H^2
+	&nop		();
+
+	&sub		($len,0x20);
+	&jbe		(&label("even_tail"));
+	&jmp		(&label("mod_loop"));
+
+&set_label("mod_loop",32);
+	&pshufd		($T2,$Xi,0b01001110);	# H^2*(Ii+Xi)
+	&movdqa		($Xhi,$Xi);
+	&pxor		($T2,$Xi);		#
+	&nop		();
+
+	&pclmulqdq	($Xi,$Hkey,0x00);	#######
+	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
+	&pclmulqdq	($T2,$T3,0x10);		#######
+	&movups		($Hkey,&QWP(0,$Htbl));	# load H
+
+	&xorps		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
+	&movdqa		($T3,&QWP(0,$const));
+	&xorps		($Xhi,$Xhn);
+	 &movdqu	($Xhn,&QWP(0,$inp));	# Ii
+	&pxor		($T1,$Xi);		# aggregated Karatsuba post-processing
+	 &movdqu	($Xn,&QWP(16,$inp));	# Ii+1
+	&pxor		($T1,$Xhi);		#
+
+	 &pshufb	($Xhn,$T3);
+	&pxor		($T2,$T1);		#
+
+	&movdqa		($T1,$T2);		#
+	&psrldq		($T2,8);
+	&pslldq		($T1,8);		#
+	&pxor		($Xhi,$T2);
+	&pxor		($Xi,$T1);		#
+	 &pshufb	($Xn,$T3);
+	 &pxor		($Xhi,$Xhn);		# "Ii+Xi", consume early
+
+	&movdqa		($Xhn,$Xn);		#&clmul64x64_TX	($Xhn,$Xn,$Hkey); H*Ii+1
+	  &movdqa	($T2,$Xi);		#&reduction_alg9($Xhi,$Xi); 1st phase
+	  &movdqa	($T1,$Xi);
+	  &psllq	($Xi,5);
+	  &pxor		($T1,$Xi);		#
+	  &psllq	($Xi,1);
+	  &pxor		($Xi,$T1);		#
+	&pclmulqdq	($Xn,$Hkey,0x00);	#######
+	&movups		($T3,&QWP(32,$Htbl));
+	  &psllq	($Xi,57);		#
+	  &movdqa	($T1,$Xi);		#
+	  &pslldq	($Xi,8);
+	  &psrldq	($T1,8);		#
+	  &pxor		($Xi,$T2);
+	  &pxor		($Xhi,$T1);		#
+	&pshufd		($T1,$Xhn,0b01001110);
+	  &movdqa	($T2,$Xi);		# 2nd phase
+	  &psrlq	($Xi,1);
+	&pxor		($T1,$Xhn);
+	  &pxor		($Xhi,$T2);		#
+	&pclmulqdq	($Xhn,$Hkey,0x11);	#######
+	&movups		($Hkey,&QWP(16,$Htbl));	# load H^2
+	  &pxor		($T2,$Xi);
+	  &psrlq	($Xi,5);
+	  &pxor		($Xi,$T2);		#
+	  &psrlq	($Xi,1);		#
+	  &pxor		($Xi,$Xhi)		#
+	&pclmulqdq	($T1,$T3,0x00);		#######
+
+	&lea		($inp,&DWP(32,$inp));
+	&sub		($len,0x20);
+	&ja		(&label("mod_loop"));
+
+&set_label("even_tail");
+	&pshufd		($T2,$Xi,0b01001110);	# H^2*(Ii+Xi)
+	&movdqa		($Xhi,$Xi);
+	&pxor		($T2,$Xi);		#
+
+	&pclmulqdq	($Xi,$Hkey,0x00);	#######
+	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
+	&pclmulqdq	($T2,$T3,0x10);		#######
+	&movdqa		($T3,&QWP(0,$const));
+
+	&xorps		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
+	&xorps		($Xhi,$Xhn);
+	&pxor		($T1,$Xi);		# aggregated Karatsuba post-processing
+	&pxor		($T1,$Xhi);		#
+
+	&pxor		($T2,$T1);		#
+
+	&movdqa		($T1,$T2);		#
+	&psrldq		($T2,8);
+	&pslldq		($T1,8);		#
+	&pxor		($Xhi,$T2);
+	&pxor		($Xi,$T1);		#
+
+	&reduction_alg9	($Xhi,$Xi);
+
+	&test		($len,$len);
+	&jnz		(&label("done"));
+
+	&movups		($Hkey,&QWP(0,$Htbl));	# load H
+&set_label("odd_tail");
+	&movdqu		($T1,&QWP(0,$inp));	# Ii
+	&pshufb		($T1,$T3);
+	&pxor		($Xi,$T1);		# Ii+Xi
+
+	&clmul64x64_T2	($Xhi,$Xi,$Hkey);	# H*(Ii+Xi)
+	&reduction_alg9	($Xhi,$Xi);
+
+&set_label("done");
+	&pshufb		($Xi,$T3);
+	&movdqu		(&QWP(0,$Xip),$Xi);
+&function_end("gcm_ghash_clmul");
+
+} else {		# Algorithm 5. Kept for reference purposes.
+
+sub reduction_alg5 {	# 19/16 times faster than Intel version
+my ($Xhi,$Xi)=@_;
+
+	# <<1
+	&movdqa		($T1,$Xi);		#
+	&movdqa		($T2,$Xhi);
+	&pslld		($Xi,1);
+	&pslld		($Xhi,1);		#
+	&psrld		($T1,31);
+	&psrld		($T2,31);		#
+	&movdqa		($T3,$T1);
+	&pslldq		($T1,4);
+	&psrldq		($T3,12);		#
+	&pslldq		($T2,4);
+	&por		($Xhi,$T3);		#
+	&por		($Xi,$T1);
+	&por		($Xhi,$T2);		#
+
+	# 1st phase
+	&movdqa		($T1,$Xi);
+	&movdqa		($T2,$Xi);
+	&movdqa		($T3,$Xi);		#
+	&pslld		($T1,31);
+	&pslld		($T2,30);
+	&pslld		($Xi,25);		#
+	&pxor		($T1,$T2);
+	&pxor		($T1,$Xi);		#
+	&movdqa		($T2,$T1);		#
+	&pslldq		($T1,12);
+	&psrldq		($T2,4);		#
+	&pxor		($T3,$T1);
+
+	# 2nd phase
+	&pxor		($Xhi,$T3);		#
+	&movdqa		($Xi,$T3);
+	&movdqa		($T1,$T3);
+	&psrld		($Xi,1);		#
+	&psrld		($T1,2);
+	&psrld		($T3,7);		#
+	&pxor		($Xi,$T1);
+	&pxor		($Xhi,$T2);
+	&pxor		($Xi,$T3);		#
+	&pxor		($Xi,$Xhi);		#
+}
+
+&function_begin_B("gcm_init_clmul");
+	&mov		($Htbl,&wparam(0));
+	&mov		($Xip,&wparam(1));
+
+	&call		(&label("pic"));
+&set_label("pic");
+	&blindpop	($const);
+	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
+
+	&movdqu		($Hkey,&QWP(0,$Xip));
+	&pshufd		($Hkey,$Hkey,0b01001110);# dword swap
+
+	# calculate H^2
+	&movdqa		($Xi,$Hkey);
+	&clmul64x64_T3	($Xhi,$Xi,$Hkey);
+	&reduction_alg5	($Xhi,$Xi);
+
+	&movdqu		(&QWP(0,$Htbl),$Hkey);	# save H
+	&movdqu		(&QWP(16,$Htbl),$Xi);	# save H^2
+
+	&ret		();
+&function_end_B("gcm_init_clmul");
+
+&function_begin_B("gcm_gmult_clmul");
+	&mov		($Xip,&wparam(0));
+	&mov		($Htbl,&wparam(1));
+
+	&call		(&label("pic"));
+&set_label("pic");
+	&blindpop	($const);
+	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
+
+	&movdqu		($Xi,&QWP(0,$Xip));
+	&movdqa		($Xn,&QWP(0,$const));
+	&movdqu		($Hkey,&QWP(0,$Htbl));
+	&pshufb		($Xi,$Xn);
+
+	&clmul64x64_T3	($Xhi,$Xi,$Hkey);
+	&reduction_alg5	($Xhi,$Xi);
+
+	&pshufb		($Xi,$Xn);
+	&movdqu		(&QWP(0,$Xip),$Xi);
+
+	&ret	();
+&function_end_B("gcm_gmult_clmul");
+
+&function_begin("gcm_ghash_clmul");
+	&mov		($Xip,&wparam(0));
+	&mov		($Htbl,&wparam(1));
+	&mov		($inp,&wparam(2));
+	&mov		($len,&wparam(3));
+
+	&call		(&label("pic"));
+&set_label("pic");
+	&blindpop	($const);
+	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
+
+	&movdqu		($Xi,&QWP(0,$Xip));
+	&movdqa		($T3,&QWP(0,$const));
+	&movdqu		($Hkey,&QWP(0,$Htbl));
+	&pshufb		($Xi,$T3);
+
+	&sub		($len,0x10);
+	&jz		(&label("odd_tail"));
+
+	#######
+	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
+	#	[(H*Ii+1) + (H*Xi+1)] mod P =
+	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
+	#
+	&movdqu		($T1,&QWP(0,$inp));	# Ii
+	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
+	&pshufb		($T1,$T3);
+	&pshufb		($Xn,$T3);
+	&pxor		($Xi,$T1);		# Ii+Xi
+
+	&clmul64x64_T3	($Xhn,$Xn,$Hkey);	# H*Ii+1
+	&movdqu		($Hkey,&QWP(16,$Htbl));	# load H^2
+
+	&sub		($len,0x20);
+	&lea		($inp,&DWP(32,$inp));	# i+=2
+	&jbe		(&label("even_tail"));
+
+&set_label("mod_loop");
+	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H^2*(Ii+Xi)
+	&movdqu		($Hkey,&QWP(0,$Htbl));	# load H
+
+	&pxor		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
+	&pxor		($Xhi,$Xhn);
+
+	&reduction_alg5	($Xhi,$Xi);
+
+	#######
+	&movdqa		($T3,&QWP(0,$const));
+	&movdqu		($T1,&QWP(0,$inp));	# Ii
+	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
+	&pshufb		($T1,$T3);
+	&pshufb		($Xn,$T3);
+	&pxor		($Xi,$T1);		# Ii+Xi
+
+	&clmul64x64_T3	($Xhn,$Xn,$Hkey);	# H*Ii+1
+	&movdqu		($Hkey,&QWP(16,$Htbl));	# load H^2
+
+	&sub		($len,0x20);
+	&lea		($inp,&DWP(32,$inp));
+	&ja		(&label("mod_loop"));
+
+&set_label("even_tail");
+	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H^2*(Ii+Xi)
+
+	&pxor		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
+	&pxor		($Xhi,$Xhn);
+
+	&reduction_alg5	($Xhi,$Xi);
+
+	&movdqa		($T3,&QWP(0,$const));
+	&test		($len,$len);
+	&jnz		(&label("done"));
+
+	&movdqu		($Hkey,&QWP(0,$Htbl));	# load H
+&set_label("odd_tail");
+	&movdqu		($T1,&QWP(0,$inp));	# Ii
+	&pshufb		($T1,$T3);
+	&pxor		($Xi,$T1);		# Ii+Xi
+
+	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H*(Ii+Xi)
+	&reduction_alg5	($Xhi,$Xi);
+
+	&movdqa		($T3,&QWP(0,$const));
+&set_label("done");
+	&pshufb		($Xi,$T3);
+	&movdqu		(&QWP(0,$Xip),$Xi);
+&function_end("gcm_ghash_clmul");
+
+}
+
+&set_label("bswap",64);
+	&data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
+	&data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2);	# 0x1c2_polynomial
+&set_label("rem_8bit",64);
+	&data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E);
+	&data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E);
+	&data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E);
+	&data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E);
+	&data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E);
+	&data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E);
+	&data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E);
+	&data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E);
+	&data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE);
+	&data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE);
+	&data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE);
+	&data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE);
+	&data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E);
+	&data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E);
+	&data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE);
+	&data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE);
+	&data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E);
+	&data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E);
+	&data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E);
+	&data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E);
+	&data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E);
+	&data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E);
+	&data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E);
+	&data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E);
+	&data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE);
+	&data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE);
+	&data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE);
+	&data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE);
+	&data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E);
+	&data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E);
+	&data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE);
+	&data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE);
+}}	# $sse2
+
+&set_label("rem_4bit",64);
+	&data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S);
+	&data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S);
+	&data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S);
+	&data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S);
+}}}	# !$x86only
+
+&asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>");
+&asm_finish();
+
+close STDOUT;
+
+# A question was risen about choice of vanilla MMX. Or rather why wasn't
+# SSE2 chosen instead? In addition to the fact that MMX runs on legacy
+# CPUs such as PIII, "4-bit" MMX version was observed to provide better
+# performance than *corresponding* SSE2 one even on contemporary CPUs.
+# SSE2 results were provided by Peter-Michael Hager. He maintains SSE2
+# implementation featuring full range of lookup-table sizes, but with
+# per-invocation lookup table setup. Latter means that table size is
+# chosen depending on how much data is to be hashed in every given call,
+# more data - larger table. Best reported result for Core2 is ~4 cycles
+# per processed byte out of 64KB block. This number accounts even for
+# 64KB table setup overhead. As discussed in gcm128.c we choose to be
+# more conservative in respect to lookup table sizes, but how do the
+# results compare? Minimalistic "256B" MMX version delivers ~11 cycles
+# on same platform. As also discussed in gcm128.c, next in line "8-bit
+# Shoup's" or "4KB" method should deliver twice the performance of
+# "256B" one, in other words not worse than ~6 cycles per byte. It
+# should be also be noted that in SSE2 case improvement can be "super-
+# linear," i.e. more than twice, mostly because >>8 maps to single
+# instruction on SSE2 register. This is unlike "4-bit" case when >>4
+# maps to same amount of instructions in both MMX and SSE2 cases.
+# Bottom line is that switch to SSE2 is considered to be justifiable
+# only in case we choose to implement "8-bit" method...
diff --git a/crypto/aesgcm/ghash-x86_64.pl b/crypto/aesgcm/ghash-x86_64.pl
new file mode 100644
index 0000000..ad94168
--- /dev/null
+++ b/crypto/aesgcm/ghash-x86_64.pl
@@ -0,0 +1,1766 @@
+#! /usr/bin/env perl
+# Copyright 2010-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# March, June 2010
+#
+# The module implements "4-bit" GCM GHASH function and underlying
+# single multiplication operation in GF(2^128). "4-bit" means that
+# it uses 256 bytes per-key table [+128 bytes shared table]. GHASH
+# function features so called "528B" variant utilizing additional
+# 256+16 bytes of per-key storage [+512 bytes shared table].
+# Performance results are for this streamed GHASH subroutine and are
+# expressed in cycles per processed byte, less is better:
+#
+#		gcc 3.4.x(*)	assembler
+#
+# P4		28.6		14.0		+100%
+# Opteron	19.3		7.7		+150%
+# Core2		17.8		8.1(**)		+120%
+# Atom		31.6		16.8		+88%
+# VIA Nano	21.8		10.1		+115%
+#
+# (*)	comparison is not completely fair, because C results are
+#	for vanilla "256B" implementation, while assembler results
+#	are for "528B";-)
+# (**)	it's mystery [to me] why Core2 result is not same as for
+#	Opteron;
+
+# May 2010
+#
+# Add PCLMULQDQ version performing at 2.02 cycles per processed byte.
+# See ghash-x86.pl for background information and details about coding
+# techniques.
+#
+# Special thanks to David Woodhouse <dwmw2@infradead.org> for
+# providing access to a Westmere-based system on behalf of Intel
+# Open Source Technology Centre.
+
+# December 2012
+#
+# Overhaul: aggregate Karatsuba post-processing, improve ILP in
+# reduction_alg9, increase reduction aggregate factor to 4x. As for
+# the latter. ghash-x86.pl discusses that it makes lesser sense to
+# increase aggregate factor. Then why increase here? Critical path
+# consists of 3 independent pclmulqdq instructions, Karatsuba post-
+# processing and reduction. "On top" of this we lay down aggregated
+# multiplication operations, triplets of independent pclmulqdq's. As
+# issue rate for pclmulqdq is limited, it makes lesser sense to
+# aggregate more multiplications than it takes to perform remaining
+# non-multiplication operations. 2x is near-optimal coefficient for
+# contemporary Intel CPUs (therefore modest improvement coefficient),
+# but not for Bulldozer. Latter is because logical SIMD operations
+# are twice as slow in comparison to Intel, so that critical path is
+# longer. A CPU with higher pclmulqdq issue rate would also benefit
+# from higher aggregate factor...
+#
+# Westmere	1.78(+13%)
+# Sandy Bridge	1.80(+8%)
+# Ivy Bridge	1.80(+7%)
+# Haswell	0.55(+93%) (if system doesn't support AVX)
+# Broadwell	0.45(+110%)(if system doesn't support AVX)
+# Skylake	0.44(+110%)(if system doesn't support AVX)
+# Bulldozer	1.49(+27%)
+# Silvermont	2.88(+13%)
+# Knights L	2.12(-)    (if system doesn't support AVX)
+# Goldmont	1.08(+24%)
+
+# March 2013
+#
+# ... 8x aggregate factor AVX code path is using reduction algorithm
+# suggested by Shay Gueron[1]. Even though contemporary AVX-capable
+# CPUs such as Sandy and Ivy Bridge can execute it, the code performs
+# sub-optimally in comparison to above mentioned version. But thanks
+# to Ilya Albrekht and Max Locktyukhin of Intel Corp. we knew that
+# it performs in 0.41 cycles per byte on Haswell processor, in
+# 0.29 on Broadwell, and in 0.36 on Skylake.
+#
+# Knights Landing achieves 1.09 cpb.
+#
+# [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest
+
+$flavour = shift;
+$output  = shift;
+if ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
+
+$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/);
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../x86_64-xlate.pl" and -f $xlate) or
+die "can't locate x86_64-xlate.pl";
+
+# See the notes about |$avx| in aesni-gcm-x86_64.pl; otherwise tags will be
+# computed incorrectly.
+#
+# In upstream, this is controlled by shelling out to the compiler to check
+# versions, but BoringSSL is intended to be used with pre-generated perlasm
+# output, so this isn't useful anyway.
+$avx = 1;
+
+open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
+*STDOUT=*OUT;
+
+$do4xaggr=1;
+
+# common register layout
+$nlo="%rax";
+$nhi="%rbx";
+$Zlo="%r8";
+$Zhi="%r9";
+$tmp="%r10";
+$rem_4bit = "%r11";
+
+$Xi="%rdi";
+$Htbl="%rsi";
+
+# per-function register layout
+$cnt="%rcx";
+$rem="%rdx";
+
+sub LB() { my $r=shift; $r =~ s/%[er]([a-d])x/%\1l/	or
+			$r =~ s/%[er]([sd]i)/%\1l/	or
+			$r =~ s/%[er](bp)/%\1l/		or
+			$r =~ s/%(r[0-9]+)[d]?/%\1b/;   $r; }
+
+sub AUTOLOAD()		# thunk [simplified] 32-bit style perlasm
+{ my $opcode = $AUTOLOAD; $opcode =~ s/.*:://;
+  my $arg = pop;
+    $arg = "\$$arg" if ($arg*1 eq $arg);
+    $code .= "\t$opcode\t".join(',',$arg,reverse @_)."\n";
+}
+
+{ my $N;
+  sub loop() {
+  my $inp = shift;
+
+	$N++;
+$code.=<<___;
+	xor	$nlo,$nlo
+	xor	$nhi,$nhi
+	mov	`&LB("$Zlo")`,`&LB("$nlo")`
+	mov	`&LB("$Zlo")`,`&LB("$nhi")`
+	shl	\$4,`&LB("$nlo")`
+	mov	\$14,$cnt
+	mov	8($Htbl,$nlo),$Zlo
+	mov	($Htbl,$nlo),$Zhi
+	and	\$0xf0,`&LB("$nhi")`
+	mov	$Zlo,$rem
+	jmp	.Loop$N
+
+.align	16
+.Loop$N:
+	shr	\$4,$Zlo
+	and	\$0xf,$rem
+	mov	$Zhi,$tmp
+	mov	($inp,$cnt),`&LB("$nlo")`
+	shr	\$4,$Zhi
+	xor	8($Htbl,$nhi),$Zlo
+	shl	\$60,$tmp
+	xor	($Htbl,$nhi),$Zhi
+	mov	`&LB("$nlo")`,`&LB("$nhi")`
+	xor	($rem_4bit,$rem,8),$Zhi
+	mov	$Zlo,$rem
+	shl	\$4,`&LB("$nlo")`
+	xor	$tmp,$Zlo
+	dec	$cnt
+	js	.Lbreak$N
+
+	shr	\$4,$Zlo
+	and	\$0xf,$rem
+	mov	$Zhi,$tmp
+	shr	\$4,$Zhi
+	xor	8($Htbl,$nlo),$Zlo
+	shl	\$60,$tmp
+	xor	($Htbl,$nlo),$Zhi
+	and	\$0xf0,`&LB("$nhi")`
+	xor	($rem_4bit,$rem,8),$Zhi
+	mov	$Zlo,$rem
+	xor	$tmp,$Zlo
+	jmp	.Loop$N
+
+.align	16
+.Lbreak$N:
+	shr	\$4,$Zlo
+	and	\$0xf,$rem
+	mov	$Zhi,$tmp
+	shr	\$4,$Zhi
+	xor	8($Htbl,$nlo),$Zlo
+	shl	\$60,$tmp
+	xor	($Htbl,$nlo),$Zhi
+	and	\$0xf0,`&LB("$nhi")`
+	xor	($rem_4bit,$rem,8),$Zhi
+	mov	$Zlo,$rem
+	xor	$tmp,$Zlo
+
+	shr	\$4,$Zlo
+	and	\$0xf,$rem
+	mov	$Zhi,$tmp
+	shr	\$4,$Zhi
+	xor	8($Htbl,$nhi),$Zlo
+	shl	\$60,$tmp
+	xor	($Htbl,$nhi),$Zhi
+	xor	$tmp,$Zlo
+	xor	($rem_4bit,$rem,8),$Zhi
+
+	bswap	$Zlo
+	bswap	$Zhi
+___
+}}
+
+$code=<<___;
+.text
+#.extern	OPENSSL_ia32cap_P
+
+.globl	gcm_gmult_4bit
+.type	gcm_gmult_4bit,\@function,2
+.align	16
+gcm_gmult_4bit:
+	push	%rbx
+	push	%rbp		# %rbp and others are pushed exclusively in
+	push	%r12		# order to reuse Win64 exception handler...
+	push	%r13
+	push	%r14
+	push	%r15
+	sub	\$280,%rsp
+.Lgmult_prologue:
+
+	movzb	15($Xi),$Zlo
+	lea	.Lrem_4bit(%rip),$rem_4bit
+___
+	&loop	($Xi);
+$code.=<<___;
+	mov	$Zlo,8($Xi)
+	mov	$Zhi,($Xi)
+
+	lea	280+48(%rsp),%rsi
+	mov	-8(%rsi),%rbx
+	lea	(%rsi),%rsp
+.Lgmult_epilogue:
+	ret
+.size	gcm_gmult_4bit,.-gcm_gmult_4bit
+___
+
+# per-function register layout
+$inp="%rdx";
+$len="%rcx";
+$rem_8bit=$rem_4bit;
+
+$code.=<<___;
+.globl	gcm_ghash_4bit
+.type	gcm_ghash_4bit,\@function,4
+.align	16
+gcm_ghash_4bit:
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	sub	\$280,%rsp
+.Lghash_prologue:
+	mov	$inp,%r14		# reassign couple of args
+	mov	$len,%r15
+___
+{ my $inp="%r14";
+  my $dat="%edx";
+  my $len="%r15";
+  my @nhi=("%ebx","%ecx");
+  my @rem=("%r12","%r13");
+  my $Hshr4="%rbp";
+
+	&sub	($Htbl,-128);		# size optimization
+	&lea	($Hshr4,"16+128(%rsp)");
+	{ my @lo =($nlo,$nhi);
+          my @hi =($Zlo,$Zhi);
+
+	  &xor	($dat,$dat);
+	  for ($i=0,$j=-2;$i<18;$i++,$j++) {
+	    &mov	("$j(%rsp)",&LB($dat))		if ($i>1);
+	    &or		($lo[0],$tmp)			if ($i>1);
+	    &mov	(&LB($dat),&LB($lo[1]))		if ($i>0 && $i<17);
+	    &shr	($lo[1],4)			if ($i>0 && $i<17);
+	    &mov	($tmp,$hi[1])			if ($i>0 && $i<17);
+	    &shr	($hi[1],4)			if ($i>0 && $i<17);
+	    &mov	("8*$j($Hshr4)",$hi[0])		if ($i>1);
+	    &mov	($hi[0],"16*$i+0-128($Htbl)")	if ($i<16);
+	    &shl	(&LB($dat),4)			if ($i>0 && $i<17);
+	    &mov	("8*$j-128($Hshr4)",$lo[0])	if ($i>1);
+	    &mov	($lo[0],"16*$i+8-128($Htbl)")	if ($i<16);
+	    &shl	($tmp,60)			if ($i>0 && $i<17);
+
+	    push	(@lo,shift(@lo));
+	    push	(@hi,shift(@hi));
+	  }
+	}
+	&add	($Htbl,-128);
+	&mov	($Zlo,"8($Xi)");
+	&mov	($Zhi,"0($Xi)");
+	&add	($len,$inp);		# pointer to the end of data
+	&lea	($rem_8bit,".Lrem_8bit(%rip)");
+	&jmp	(".Louter_loop");
+
+$code.=".align	16\n.Louter_loop:\n";
+	&xor	($Zhi,"($inp)");
+	&mov	("%rdx","8($inp)");
+	&lea	($inp,"16($inp)");
+	&xor	("%rdx",$Zlo);
+	&mov	("($Xi)",$Zhi);
+	&mov	("8($Xi)","%rdx");
+	&shr	("%rdx",32);
+
+	&xor	($nlo,$nlo);
+	&rol	($dat,8);
+	&mov	(&LB($nlo),&LB($dat));
+	&movz	($nhi[0],&LB($dat));
+	&shl	(&LB($nlo),4);
+	&shr	($nhi[0],4);
+
+	for ($j=11,$i=0;$i<15;$i++) {
+	    &rol	($dat,8);
+	    &xor	($Zlo,"8($Htbl,$nlo)")			if ($i>0);
+	    &xor	($Zhi,"($Htbl,$nlo)")			if ($i>0);
+	    &mov	($Zlo,"8($Htbl,$nlo)")			if ($i==0);
+	    &mov	($Zhi,"($Htbl,$nlo)")			if ($i==0);
+
+	    &mov	(&LB($nlo),&LB($dat));
+	    &xor	($Zlo,$tmp)				if ($i>0);
+	    &movzw	($rem[1],"($rem_8bit,$rem[1],2)")	if ($i>0);
+
+	    &movz	($nhi[1],&LB($dat));
+	    &shl	(&LB($nlo),4);
+	    &movzb	($rem[0],"(%rsp,$nhi[0])");
+
+	    &shr	($nhi[1],4)				if ($i<14);
+	    &and	($nhi[1],0xf0)				if ($i==14);
+	    &shl	($rem[1],48)				if ($i>0);
+	    &xor	($rem[0],$Zlo);
+
+	    &mov	($tmp,$Zhi);
+	    &xor	($Zhi,$rem[1])				if ($i>0);
+	    &shr	($Zlo,8);
+
+	    &movz	($rem[0],&LB($rem[0]));
+	    &mov	($dat,"$j($Xi)")			if (--$j%4==0);
+	    &shr	($Zhi,8);
+
+	    &xor	($Zlo,"-128($Hshr4,$nhi[0],8)");
+	    &shl	($tmp,56);
+	    &xor	($Zhi,"($Hshr4,$nhi[0],8)");
+
+	    unshift	(@nhi,pop(@nhi));		# "rotate" registers
+	    unshift	(@rem,pop(@rem));
+	}
+	&movzw	($rem[1],"($rem_8bit,$rem[1],2)");
+	&xor	($Zlo,"8($Htbl,$nlo)");
+	&xor	($Zhi,"($Htbl,$nlo)");
+
+	&shl	($rem[1],48);
+	&xor	($Zlo,$tmp);
+
+	&xor	($Zhi,$rem[1]);
+	&movz	($rem[0],&LB($Zlo));
+	&shr	($Zlo,4);
+
+	&mov	($tmp,$Zhi);
+	&shl	(&LB($rem[0]),4);
+	&shr	($Zhi,4);
+
+	&xor	($Zlo,"8($Htbl,$nhi[0])");
+	&movzw	($rem[0],"($rem_8bit,$rem[0],2)");
+	&shl	($tmp,60);
+
+	&xor	($Zhi,"($Htbl,$nhi[0])");
+	&xor	($Zlo,$tmp);
+	&shl	($rem[0],48);
+
+	&bswap	($Zlo);
+	&xor	($Zhi,$rem[0]);
+
+	&bswap	($Zhi);
+	&cmp	($inp,$len);
+	&jb	(".Louter_loop");
+}
+$code.=<<___;
+	mov	$Zlo,8($Xi)
+	mov	$Zhi,($Xi)
+
+	lea	280+48(%rsp),%rsi
+	mov	-48(%rsi),%r15
+	mov	-40(%rsi),%r14
+	mov	-32(%rsi),%r13
+	mov	-24(%rsi),%r12
+	mov	-16(%rsi),%rbp
+	mov	-8(%rsi),%rbx
+	lea	0(%rsi),%rsp
+.Lghash_epilogue:
+	ret
+.size	gcm_ghash_4bit,.-gcm_ghash_4bit
+___
+
+######################################################################
+# PCLMULQDQ version.
+
+@_4args=$win64?	("%rcx","%rdx","%r8", "%r9") :	# Win64 order
+		("%rdi","%rsi","%rdx","%rcx");	# Unix order
+
+($Xi,$Xhi)=("%xmm0","%xmm1");	$Hkey="%xmm2";
+($T1,$T2,$T3)=("%xmm3","%xmm4","%xmm5");
+
+sub clmul64x64_T2 {	# minimal register pressure
+my ($Xhi,$Xi,$Hkey,$HK)=@_;
+
+if (!defined($HK)) {	$HK = $T2;
+$code.=<<___;
+	movdqa		$Xi,$Xhi		#
+	pshufd		\$0b01001110,$Xi,$T1
+	pshufd		\$0b01001110,$Hkey,$T2
+	pxor		$Xi,$T1			#
+	pxor		$Hkey,$T2
+___
+} else {
+$code.=<<___;
+	movdqa		$Xi,$Xhi		#
+	pshufd		\$0b01001110,$Xi,$T1
+	pxor		$Xi,$T1			#
+___
+}
+$code.=<<___;
+	pclmulqdq	\$0x00,$Hkey,$Xi	#######
+	pclmulqdq	\$0x11,$Hkey,$Xhi	#######
+	pclmulqdq	\$0x00,$HK,$T1		#######
+	pxor		$Xi,$T1			#
+	pxor		$Xhi,$T1		#
+
+	movdqa		$T1,$T2			#
+	psrldq		\$8,$T1
+	pslldq		\$8,$T2			#
+	pxor		$T1,$Xhi
+	pxor		$T2,$Xi			#
+___
+}
+
+sub reduction_alg9 {	# 17/11 times faster than Intel version
+my ($Xhi,$Xi) = @_;
+
+$code.=<<___;
+	# 1st phase
+	movdqa		$Xi,$T2			#
+	movdqa		$Xi,$T1
+	psllq		\$5,$Xi
+	pxor		$Xi,$T1			#
+	psllq		\$1,$Xi
+	pxor		$T1,$Xi			#
+	psllq		\$57,$Xi		#
+	movdqa		$Xi,$T1			#
+	pslldq		\$8,$Xi
+	psrldq		\$8,$T1			#
+	pxor		$T2,$Xi
+	pxor		$T1,$Xhi		#
+
+	# 2nd phase
+	movdqa		$Xi,$T2
+	psrlq		\$1,$Xi
+	pxor		$T2,$Xhi		#
+	pxor		$Xi,$T2
+	psrlq		\$5,$Xi
+	pxor		$T2,$Xi			#
+	psrlq		\$1,$Xi			#
+	pxor		$Xhi,$Xi		#
+___
+}
+
+{ my ($Htbl,$Xip)=@_4args;
+  my $HK="%xmm6";
+
+$code.=<<___;
+.globl	gcm_init_clmul
+.type	gcm_init_clmul,\@abi-omnipotent
+.align	16
+gcm_init_clmul:
+.L_init_clmul:
+___
+$code.=<<___ if ($win64);
+.LSEH_begin_gcm_init_clmul:
+	# I can't trust assembler to use specific encoding:-(
+	.byte	0x48,0x83,0xec,0x18		#sub	$0x18,%rsp
+	.byte	0x0f,0x29,0x34,0x24		#movaps	%xmm6,(%rsp)
+___
+$code.=<<___;
+	movdqu		($Xip),$Hkey
+	pshufd		\$0b01001110,$Hkey,$Hkey	# dword swap
+
+	# <<1 twist
+	pshufd		\$0b11111111,$Hkey,$T2	# broadcast uppermost dword
+	movdqa		$Hkey,$T1
+	psllq		\$1,$Hkey
+	pxor		$T3,$T3			#
+	psrlq		\$63,$T1
+	pcmpgtd		$T2,$T3			# broadcast carry bit
+	pslldq		\$8,$T1
+	por		$T1,$Hkey		# H<<=1
+
+	# magic reduction
+	pand		.L0x1c2_polynomial(%rip),$T3
+	pxor		$T3,$Hkey		# if(carry) H^=0x1c2_polynomial
+
+	# calculate H^2
+	pshufd		\$0b01001110,$Hkey,$HK
+	movdqa		$Hkey,$Xi
+	pxor		$Hkey,$HK
+___
+	&clmul64x64_T2	($Xhi,$Xi,$Hkey,$HK);
+	&reduction_alg9	($Xhi,$Xi);
+$code.=<<___;
+	pshufd		\$0b01001110,$Hkey,$T1
+	pshufd		\$0b01001110,$Xi,$T2
+	pxor		$Hkey,$T1		# Karatsuba pre-processing
+	movdqu		$Hkey,0x00($Htbl)	# save H
+	pxor		$Xi,$T2			# Karatsuba pre-processing
+	movdqu		$Xi,0x10($Htbl)		# save H^2
+	palignr		\$8,$T1,$T2		# low part is H.lo^H.hi...
+	movdqu		$T2,0x20($Htbl)		# save Karatsuba "salt"
+___
+if ($do4xaggr) {
+	&clmul64x64_T2	($Xhi,$Xi,$Hkey,$HK);	# H^3
+	&reduction_alg9	($Xhi,$Xi);
+$code.=<<___;
+	movdqa		$Xi,$T3
+___
+	&clmul64x64_T2	($Xhi,$Xi,$Hkey,$HK);	# H^4
+	&reduction_alg9	($Xhi,$Xi);
+$code.=<<___;
+	pshufd		\$0b01001110,$T3,$T1
+	pshufd		\$0b01001110,$Xi,$T2
+	pxor		$T3,$T1			# Karatsuba pre-processing
+	movdqu		$T3,0x30($Htbl)		# save H^3
+	pxor		$Xi,$T2			# Karatsuba pre-processing
+	movdqu		$Xi,0x40($Htbl)		# save H^4
+	palignr		\$8,$T1,$T2		# low part is H^3.lo^H^3.hi...
+	movdqu		$T2,0x50($Htbl)		# save Karatsuba "salt"
+___
+}
+$code.=<<___ if ($win64);
+	movaps	(%rsp),%xmm6
+	lea	0x18(%rsp),%rsp
+.LSEH_end_gcm_init_clmul:
+___
+$code.=<<___;
+	ret
+.size	gcm_init_clmul,.-gcm_init_clmul
+___
+}
+
+{ my ($Xip,$Htbl)=@_4args;
+
+$code.=<<___;
+.globl	gcm_gmult_clmul
+.type	gcm_gmult_clmul,\@abi-omnipotent
+.align	16
+gcm_gmult_clmul:
+.L_gmult_clmul:
+	movdqu		($Xip),$Xi
+	movdqa		.Lbswap_mask(%rip),$T3
+	movdqu		($Htbl),$Hkey
+	movdqu		0x20($Htbl),$T2
+	pshufb		$T3,$Xi
+___
+	&clmul64x64_T2	($Xhi,$Xi,$Hkey,$T2);
+$code.=<<___ if (0 || (&reduction_alg9($Xhi,$Xi)&&0));
+	# experimental alternative. special thing about is that there
+	# no dependency between the two multiplications...
+	mov		\$`0xE1<<1`,%eax
+	mov		\$0xA040608020C0E000,%r10	# ((7..0)·0xE0)&0xff
+	mov		\$0x07,%r11d
+	movq		%rax,$T1
+	movq		%r10,$T2
+	movq		%r11,$T3		# borrow $T3
+	pand		$Xi,$T3
+	pshufb		$T3,$T2			# ($Xi&7)·0xE0
+	movq		%rax,$T3
+	pclmulqdq	\$0x00,$Xi,$T1		# ·(0xE1<<1)
+	pxor		$Xi,$T2
+	pslldq		\$15,$T2
+	paddd		$T2,$T2			# <<(64+56+1)
+	pxor		$T2,$Xi
+	pclmulqdq	\$0x01,$T3,$Xi
+	movdqa		.Lbswap_mask(%rip),$T3	# reload $T3
+	psrldq		\$1,$T1
+	pxor		$T1,$Xhi
+	pslldq		\$7,$Xi
+	pxor		$Xhi,$Xi
+___
+$code.=<<___;
+	pshufb		$T3,$Xi
+	movdqu		$Xi,($Xip)
+	ret
+.size	gcm_gmult_clmul,.-gcm_gmult_clmul
+___
+}
+
+{ my ($Xip,$Htbl,$inp,$len)=@_4args;
+  my ($Xln,$Xmn,$Xhn,$Hkey2,$HK) = map("%xmm$_",(3..7));
+  my ($T1,$T2,$T3)=map("%xmm$_",(8..10));
+
+$code.=<<___;
+.globl	gcm_ghash_clmul
+.type	gcm_ghash_clmul,\@abi-omnipotent
+.align	32
+gcm_ghash_clmul:
+.L_ghash_clmul:
+___
+$code.=<<___ if ($win64);
+	lea	-0x88(%rsp),%rax
+.LSEH_begin_gcm_ghash_clmul:
+	# I can't trust assembler to use specific encoding:-(
+	.byte	0x48,0x8d,0x60,0xe0		#lea	-0x20(%rax),%rsp
+	.byte	0x0f,0x29,0x70,0xe0		#movaps	%xmm6,-0x20(%rax)
+	.byte	0x0f,0x29,0x78,0xf0		#movaps	%xmm7,-0x10(%rax)
+	.byte	0x44,0x0f,0x29,0x00		#movaps	%xmm8,0(%rax)
+	.byte	0x44,0x0f,0x29,0x48,0x10	#movaps	%xmm9,0x10(%rax)
+	.byte	0x44,0x0f,0x29,0x50,0x20	#movaps	%xmm10,0x20(%rax)
+	.byte	0x44,0x0f,0x29,0x58,0x30	#movaps	%xmm11,0x30(%rax)
+	.byte	0x44,0x0f,0x29,0x60,0x40	#movaps	%xmm12,0x40(%rax)
+	.byte	0x44,0x0f,0x29,0x68,0x50	#movaps	%xmm13,0x50(%rax)
+	.byte	0x44,0x0f,0x29,0x70,0x60	#movaps	%xmm14,0x60(%rax)
+	.byte	0x44,0x0f,0x29,0x78,0x70	#movaps	%xmm15,0x70(%rax)
+___
+$code.=<<___;
+	movdqa		.Lbswap_mask(%rip),$T3
+
+	movdqu		($Xip),$Xi
+	movdqu		($Htbl),$Hkey
+	movdqu		0x20($Htbl),$HK
+	pshufb		$T3,$Xi
+
+	sub		\$0x10,$len
+	jz		.Lodd_tail
+
+	movdqu		0x10($Htbl),$Hkey2
+___
+if ($do4xaggr) {
+my ($Xl,$Xm,$Xh,$Hkey3,$Hkey4)=map("%xmm$_",(11..15));
+
+$code.=<<___;
+#	leaq		OPENSSL_ia32cap_P(%rip),%rax
+#	mov		4(%rax),%eax
+	cmp		\$0x30,$len
+	jb		.Lskip4x
+
+#	and		\$`1<<26|1<<22`,%eax	# isolate MOVBE+XSAVE
+#	cmp		\$`1<<22`,%eax		# check for MOVBE without XSAVE
+#	je		.Lskip4x
+
+	sub		\$0x30,$len
+	mov		\$0xA040608020C0E000,%rax	# ((7..0)·0xE0)&0xff
+	movdqu		0x30($Htbl),$Hkey3
+	movdqu		0x40($Htbl),$Hkey4
+
+	#######
+	# Xi+4 =[(H*Ii+3) + (H^2*Ii+2) + (H^3*Ii+1) + H^4*(Ii+Xi)] mod P
+	#
+	movdqu		0x30($inp),$Xln
+	 movdqu		0x20($inp),$Xl
+	pshufb		$T3,$Xln
+	 pshufb		$T3,$Xl
+	movdqa		$Xln,$Xhn
+	pshufd		\$0b01001110,$Xln,$Xmn
+	pxor		$Xln,$Xmn
+	pclmulqdq	\$0x00,$Hkey,$Xln
+	pclmulqdq	\$0x11,$Hkey,$Xhn
+	pclmulqdq	\$0x00,$HK,$Xmn
+
+	movdqa		$Xl,$Xh
+	pshufd		\$0b01001110,$Xl,$Xm
+	pxor		$Xl,$Xm
+	pclmulqdq	\$0x00,$Hkey2,$Xl
+	pclmulqdq	\$0x11,$Hkey2,$Xh
+	pclmulqdq	\$0x10,$HK,$Xm
+	xorps		$Xl,$Xln
+	xorps		$Xh,$Xhn
+	movups		0x50($Htbl),$HK
+	xorps		$Xm,$Xmn
+
+	movdqu		0x10($inp),$Xl
+	 movdqu		0($inp),$T1
+	pshufb		$T3,$Xl
+	 pshufb		$T3,$T1
+	movdqa		$Xl,$Xh
+	pshufd		\$0b01001110,$Xl,$Xm
+	 pxor		$T1,$Xi
+	pxor		$Xl,$Xm
+	pclmulqdq	\$0x00,$Hkey3,$Xl
+	 movdqa		$Xi,$Xhi
+	 pshufd		\$0b01001110,$Xi,$T1
+	 pxor		$Xi,$T1
+	pclmulqdq	\$0x11,$Hkey3,$Xh
+	pclmulqdq	\$0x00,$HK,$Xm
+	xorps		$Xl,$Xln
+	xorps		$Xh,$Xhn
+
+	lea	0x40($inp),$inp
+	sub	\$0x40,$len
+	jc	.Ltail4x
+
+	jmp	.Lmod4_loop
+.align	32
+.Lmod4_loop:
+	pclmulqdq	\$0x00,$Hkey4,$Xi
+	xorps		$Xm,$Xmn
+	 movdqu		0x30($inp),$Xl
+	 pshufb		$T3,$Xl
+	pclmulqdq	\$0x11,$Hkey4,$Xhi
+	xorps		$Xln,$Xi
+	 movdqu		0x20($inp),$Xln
+	 movdqa		$Xl,$Xh
+	pclmulqdq	\$0x10,$HK,$T1
+	 pshufd		\$0b01001110,$Xl,$Xm
+	xorps		$Xhn,$Xhi
+	 pxor		$Xl,$Xm
+	 pshufb		$T3,$Xln
+	movups		0x20($Htbl),$HK
+	xorps		$Xmn,$T1
+	 pclmulqdq	\$0x00,$Hkey,$Xl
+	 pshufd		\$0b01001110,$Xln,$Xmn
+
+	pxor		$Xi,$T1			# aggregated Karatsuba post-processing
+	 movdqa		$Xln,$Xhn
+	pxor		$Xhi,$T1		#
+	 pxor		$Xln,$Xmn
+	movdqa		$T1,$T2			#
+	 pclmulqdq	\$0x11,$Hkey,$Xh
+	pslldq		\$8,$T1
+	psrldq		\$8,$T2			#
+	pxor		$T1,$Xi
+	movdqa		.L7_mask(%rip),$T1
+	pxor		$T2,$Xhi		#
+	movq		%rax,$T2
+
+	pand		$Xi,$T1			# 1st phase
+	pshufb		$T1,$T2			#
+	pxor		$Xi,$T2			#
+	 pclmulqdq	\$0x00,$HK,$Xm
+	psllq		\$57,$T2		#
+	movdqa		$T2,$T1			#
+	pslldq		\$8,$T2
+	 pclmulqdq	\$0x00,$Hkey2,$Xln
+	psrldq		\$8,$T1			#
+	pxor		$T2,$Xi
+	pxor		$T1,$Xhi		#
+	movdqu		0($inp),$T1
+
+	movdqa		$Xi,$T2			# 2nd phase
+	psrlq		\$1,$Xi
+	 pclmulqdq	\$0x11,$Hkey2,$Xhn
+	 xorps		$Xl,$Xln
+	 movdqu		0x10($inp),$Xl
+	 pshufb		$T3,$Xl
+	 pclmulqdq	\$0x10,$HK,$Xmn
+	 xorps		$Xh,$Xhn
+	 movups		0x50($Htbl),$HK
+	pshufb		$T3,$T1
+	pxor		$T2,$Xhi		#
+	pxor		$Xi,$T2
+	psrlq		\$5,$Xi
+
+	 movdqa		$Xl,$Xh
+	 pxor		$Xm,$Xmn
+	 pshufd		\$0b01001110,$Xl,$Xm
+	pxor		$T2,$Xi			#
+	pxor		$T1,$Xhi
+	 pxor		$Xl,$Xm
+	 pclmulqdq	\$0x00,$Hkey3,$Xl
+	psrlq		\$1,$Xi			#
+	pxor		$Xhi,$Xi		#
+	movdqa		$Xi,$Xhi
+	 pclmulqdq	\$0x11,$Hkey3,$Xh
+	 xorps		$Xl,$Xln
+	pshufd		\$0b01001110,$Xi,$T1
+	pxor		$Xi,$T1
+
+	 pclmulqdq	\$0x00,$HK,$Xm
+	 xorps		$Xh,$Xhn
+
+	lea	0x40($inp),$inp
+	sub	\$0x40,$len
+	jnc	.Lmod4_loop
+
+.Ltail4x:
+	pclmulqdq	\$0x00,$Hkey4,$Xi
+	pclmulqdq	\$0x11,$Hkey4,$Xhi
+	pclmulqdq	\$0x10,$HK,$T1
+	xorps		$Xm,$Xmn
+	xorps		$Xln,$Xi
+	xorps		$Xhn,$Xhi
+	pxor		$Xi,$Xhi		# aggregated Karatsuba post-processing
+	pxor		$Xmn,$T1
+
+	pxor		$Xhi,$T1		#
+	pxor		$Xi,$Xhi
+
+	movdqa		$T1,$T2			#
+	psrldq		\$8,$T1
+	pslldq		\$8,$T2			#
+	pxor		$T1,$Xhi
+	pxor		$T2,$Xi			#
+___
+	&reduction_alg9($Xhi,$Xi);
+$code.=<<___;
+	add	\$0x40,$len
+	jz	.Ldone
+	movdqu	0x20($Htbl),$HK
+	sub	\$0x10,$len
+	jz	.Lodd_tail
+.Lskip4x:
+___
+}
+$code.=<<___;
+	#######
+	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
+	#	[(H*Ii+1) + (H*Xi+1)] mod P =
+	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
+	#
+	movdqu		($inp),$T1		# Ii
+	movdqu		16($inp),$Xln		# Ii+1
+	pshufb		$T3,$T1
+	pshufb		$T3,$Xln
+	pxor		$T1,$Xi			# Ii+Xi
+
+	movdqa		$Xln,$Xhn
+	pshufd		\$0b01001110,$Xln,$Xmn
+	pxor		$Xln,$Xmn
+	pclmulqdq	\$0x00,$Hkey,$Xln
+	pclmulqdq	\$0x11,$Hkey,$Xhn
+	pclmulqdq	\$0x00,$HK,$Xmn
+
+	lea		32($inp),$inp		# i+=2
+	nop
+	sub		\$0x20,$len
+	jbe		.Leven_tail
+	nop
+	jmp		.Lmod_loop
+
+.align	32
+.Lmod_loop:
+	movdqa		$Xi,$Xhi
+	movdqa		$Xmn,$T1
+	pshufd		\$0b01001110,$Xi,$Xmn	#
+	pxor		$Xi,$Xmn		#
+
+	pclmulqdq	\$0x00,$Hkey2,$Xi
+	pclmulqdq	\$0x11,$Hkey2,$Xhi
+	pclmulqdq	\$0x10,$HK,$Xmn
+
+	pxor		$Xln,$Xi		# (H*Ii+1) + H^2*(Ii+Xi)
+	pxor		$Xhn,$Xhi
+	  movdqu	($inp),$T2		# Ii
+	pxor		$Xi,$T1			# aggregated Karatsuba post-processing
+	  pshufb	$T3,$T2
+	  movdqu	16($inp),$Xln		# Ii+1
+
+	pxor		$Xhi,$T1
+	  pxor		$T2,$Xhi		# "Ii+Xi", consume early
+	pxor		$T1,$Xmn
+	 pshufb		$T3,$Xln
+	movdqa		$Xmn,$T1		#
+	psrldq		\$8,$T1
+	pslldq		\$8,$Xmn		#
+	pxor		$T1,$Xhi
+	pxor		$Xmn,$Xi		#
+
+	movdqa		$Xln,$Xhn		#
+
+	  movdqa	$Xi,$T2			# 1st phase
+	  movdqa	$Xi,$T1
+	  psllq		\$5,$Xi
+	  pxor		$Xi,$T1			#
+	pclmulqdq	\$0x00,$Hkey,$Xln	#######
+	  psllq		\$1,$Xi
+	  pxor		$T1,$Xi			#
+	  psllq		\$57,$Xi		#
+	  movdqa	$Xi,$T1			#
+	  pslldq	\$8,$Xi
+	  psrldq	\$8,$T1			#
+	  pxor		$T2,$Xi
+	pshufd		\$0b01001110,$Xhn,$Xmn
+	  pxor		$T1,$Xhi		#
+	pxor		$Xhn,$Xmn		#
+
+	  movdqa	$Xi,$T2			# 2nd phase
+	  psrlq		\$1,$Xi
+	pclmulqdq	\$0x11,$Hkey,$Xhn	#######
+	  pxor		$T2,$Xhi		#
+	  pxor		$Xi,$T2
+	  psrlq		\$5,$Xi
+	  pxor		$T2,$Xi			#
+	lea		32($inp),$inp
+	  psrlq		\$1,$Xi			#
+	pclmulqdq	\$0x00,$HK,$Xmn		#######
+	  pxor		$Xhi,$Xi		#
+
+	sub		\$0x20,$len
+	ja		.Lmod_loop
+
+.Leven_tail:
+	 movdqa		$Xi,$Xhi
+	 movdqa		$Xmn,$T1
+	 pshufd		\$0b01001110,$Xi,$Xmn	#
+	 pxor		$Xi,$Xmn		#
+
+	pclmulqdq	\$0x00,$Hkey2,$Xi
+	pclmulqdq	\$0x11,$Hkey2,$Xhi
+	pclmulqdq	\$0x10,$HK,$Xmn
+
+	pxor		$Xln,$Xi		# (H*Ii+1) + H^2*(Ii+Xi)
+	pxor		$Xhn,$Xhi
+	pxor		$Xi,$T1
+	pxor		$Xhi,$T1
+	pxor		$T1,$Xmn
+	movdqa		$Xmn,$T1		#
+	psrldq		\$8,$T1
+	pslldq		\$8,$Xmn		#
+	pxor		$T1,$Xhi
+	pxor		$Xmn,$Xi		#
+___
+	&reduction_alg9	($Xhi,$Xi);
+$code.=<<___;
+	test		$len,$len
+	jnz		.Ldone
+
+.Lodd_tail:
+	movdqu		($inp),$T1		# Ii
+	pshufb		$T3,$T1
+	pxor		$T1,$Xi			# Ii+Xi
+___
+	&clmul64x64_T2	($Xhi,$Xi,$Hkey,$HK);	# H*(Ii+Xi)
+	&reduction_alg9	($Xhi,$Xi);
+$code.=<<___;
+.Ldone:
+	pshufb		$T3,$Xi
+	movdqu		$Xi,($Xip)
+___
+$code.=<<___ if ($win64);
+	movaps	(%rsp),%xmm6
+	movaps	0x10(%rsp),%xmm7
+	movaps	0x20(%rsp),%xmm8
+	movaps	0x30(%rsp),%xmm9
+	movaps	0x40(%rsp),%xmm10
+	movaps	0x50(%rsp),%xmm11
+	movaps	0x60(%rsp),%xmm12
+	movaps	0x70(%rsp),%xmm13
+	movaps	0x80(%rsp),%xmm14
+	movaps	0x90(%rsp),%xmm15
+	lea	0xa8(%rsp),%rsp
+.LSEH_end_gcm_ghash_clmul:
+___
+$code.=<<___;
+	ret
+.size	gcm_ghash_clmul,.-gcm_ghash_clmul
+___
+}
+
+$code.=<<___;
+.globl	gcm_init_avx
+.type	gcm_init_avx,\@abi-omnipotent
+.align	32
+gcm_init_avx:
+___
+if ($avx) {
+my ($Htbl,$Xip)=@_4args;
+my $HK="%xmm6";
+
+$code.=<<___ if ($win64);
+.LSEH_begin_gcm_init_avx:
+	# I can't trust assembler to use specific encoding:-(
+	.byte	0x48,0x83,0xec,0x18		#sub	$0x18,%rsp
+	.byte	0x0f,0x29,0x34,0x24		#movaps	%xmm6,(%rsp)
+___
+$code.=<<___;
+	vzeroupper
+
+	vmovdqu		($Xip),$Hkey
+	vpshufd		\$0b01001110,$Hkey,$Hkey	# dword swap
+
+	# <<1 twist
+	vpshufd		\$0b11111111,$Hkey,$T2	# broadcast uppermost dword
+	vpsrlq		\$63,$Hkey,$T1
+	vpsllq		\$1,$Hkey,$Hkey
+	vpxor		$T3,$T3,$T3		#
+	vpcmpgtd	$T2,$T3,$T3		# broadcast carry bit
+	vpslldq		\$8,$T1,$T1
+	vpor		$T1,$Hkey,$Hkey		# H<<=1
+
+	# magic reduction
+	vpand		.L0x1c2_polynomial(%rip),$T3,$T3
+	vpxor		$T3,$Hkey,$Hkey		# if(carry) H^=0x1c2_polynomial
+
+	vpunpckhqdq	$Hkey,$Hkey,$HK
+	vmovdqa		$Hkey,$Xi
+	vpxor		$Hkey,$HK,$HK
+	mov		\$4,%r10		# up to H^8
+	jmp		.Linit_start_avx
+___
+
+sub clmul64x64_avx {
+my ($Xhi,$Xi,$Hkey,$HK)=@_;
+
+if (!defined($HK)) {	$HK = $T2;
+$code.=<<___;
+	vpunpckhqdq	$Xi,$Xi,$T1
+	vpunpckhqdq	$Hkey,$Hkey,$T2
+	vpxor		$Xi,$T1,$T1		#
+	vpxor		$Hkey,$T2,$T2
+___
+} else {
+$code.=<<___;
+	vpunpckhqdq	$Xi,$Xi,$T1
+	vpxor		$Xi,$T1,$T1		#
+___
+}
+$code.=<<___;
+	vpclmulqdq	\$0x11,$Hkey,$Xi,$Xhi	#######
+	vpclmulqdq	\$0x00,$Hkey,$Xi,$Xi	#######
+	vpclmulqdq	\$0x00,$HK,$T1,$T1	#######
+	vpxor		$Xi,$Xhi,$T2		#
+	vpxor		$T2,$T1,$T1		#
+
+	vpslldq		\$8,$T1,$T2		#
+	vpsrldq		\$8,$T1,$T1
+	vpxor		$T2,$Xi,$Xi		#
+	vpxor		$T1,$Xhi,$Xhi
+___
+}
+
+sub reduction_avx {
+my ($Xhi,$Xi) = @_;
+
+$code.=<<___;
+	vpsllq		\$57,$Xi,$T1		# 1st phase
+	vpsllq		\$62,$Xi,$T2
+	vpxor		$T1,$T2,$T2		#
+	vpsllq		\$63,$Xi,$T1
+	vpxor		$T1,$T2,$T2		#
+	vpslldq		\$8,$T2,$T1		#
+	vpsrldq		\$8,$T2,$T2
+	vpxor		$T1,$Xi,$Xi		#
+	vpxor		$T2,$Xhi,$Xhi
+
+	vpsrlq		\$1,$Xi,$T2		# 2nd phase
+	vpxor		$Xi,$Xhi,$Xhi
+	vpxor		$T2,$Xi,$Xi		#
+	vpsrlq		\$5,$T2,$T2
+	vpxor		$T2,$Xi,$Xi		#
+	vpsrlq		\$1,$Xi,$Xi		#
+	vpxor		$Xhi,$Xi,$Xi		#
+___
+}
+
+$code.=<<___;
+.align	32
+.Linit_loop_avx:
+	vpalignr	\$8,$T1,$T2,$T3		# low part is H.lo^H.hi...
+	vmovdqu		$T3,-0x10($Htbl)	# save Karatsuba "salt"
+___
+	&clmul64x64_avx	($Xhi,$Xi,$Hkey,$HK);	# calculate H^3,5,7
+	&reduction_avx	($Xhi,$Xi);
+$code.=<<___;
+.Linit_start_avx:
+	vmovdqa		$Xi,$T3
+___
+	&clmul64x64_avx	($Xhi,$Xi,$Hkey,$HK);	# calculate H^2,4,6,8
+	&reduction_avx	($Xhi,$Xi);
+$code.=<<___;
+	vpshufd		\$0b01001110,$T3,$T1
+	vpshufd		\$0b01001110,$Xi,$T2
+	vpxor		$T3,$T1,$T1		# Karatsuba pre-processing
+	vmovdqu		$T3,0x00($Htbl)		# save H^1,3,5,7
+	vpxor		$Xi,$T2,$T2		# Karatsuba pre-processing
+	vmovdqu		$Xi,0x10($Htbl)		# save H^2,4,6,8
+	lea		0x30($Htbl),$Htbl
+	sub		\$1,%r10
+	jnz		.Linit_loop_avx
+
+	vpalignr	\$8,$T2,$T1,$T3		# last "salt" is flipped
+	vmovdqu		$T3,-0x10($Htbl)
+
+	vzeroupper
+___
+$code.=<<___ if ($win64);
+	movaps	(%rsp),%xmm6
+	lea	0x18(%rsp),%rsp
+.LSEH_end_gcm_init_avx:
+___
+$code.=<<___;
+	ret
+.size	gcm_init_avx,.-gcm_init_avx
+___
+} else {
+$code.=<<___;
+	jmp	.L_init_clmul
+.size	gcm_init_avx,.-gcm_init_avx
+___
+}
+
+$code.=<<___;
+.globl	gcm_gmult_avx
+.type	gcm_gmult_avx,\@abi-omnipotent
+.align	32
+gcm_gmult_avx:
+	jmp	.L_gmult_clmul
+.size	gcm_gmult_avx,.-gcm_gmult_avx
+___
+
+$code.=<<___;
+.globl	gcm_ghash_avx
+.type	gcm_ghash_avx,\@abi-omnipotent
+.align	32
+gcm_ghash_avx:
+___
+if ($avx) {
+my ($Xip,$Htbl,$inp,$len)=@_4args;
+my ($Xlo,$Xhi,$Xmi,
+    $Zlo,$Zhi,$Zmi,
+    $Hkey,$HK,$T1,$T2,
+    $Xi,$Xo,$Tred,$bswap,$Ii,$Ij) = map("%xmm$_",(0..15));
+
+$code.=<<___ if ($win64);
+	lea	-0x88(%rsp),%rax
+.LSEH_begin_gcm_ghash_avx:
+	# I can't trust assembler to use specific encoding:-(
+	.byte	0x48,0x8d,0x60,0xe0		#lea	-0x20(%rax),%rsp
+	.byte	0x0f,0x29,0x70,0xe0		#movaps	%xmm6,-0x20(%rax)
+	.byte	0x0f,0x29,0x78,0xf0		#movaps	%xmm7,-0x10(%rax)
+	.byte	0x44,0x0f,0x29,0x00		#movaps	%xmm8,0(%rax)
+	.byte	0x44,0x0f,0x29,0x48,0x10	#movaps	%xmm9,0x10(%rax)
+	.byte	0x44,0x0f,0x29,0x50,0x20	#movaps	%xmm10,0x20(%rax)
+	.byte	0x44,0x0f,0x29,0x58,0x30	#movaps	%xmm11,0x30(%rax)
+	.byte	0x44,0x0f,0x29,0x60,0x40	#movaps	%xmm12,0x40(%rax)
+	.byte	0x44,0x0f,0x29,0x68,0x50	#movaps	%xmm13,0x50(%rax)
+	.byte	0x44,0x0f,0x29,0x70,0x60	#movaps	%xmm14,0x60(%rax)
+	.byte	0x44,0x0f,0x29,0x78,0x70	#movaps	%xmm15,0x70(%rax)
+___
+$code.=<<___;
+	vzeroupper
+
+	vmovdqu		($Xip),$Xi		# load $Xi
+	lea		.L0x1c2_polynomial(%rip),%r10
+	lea		0x40($Htbl),$Htbl	# size optimization
+	vmovdqu		.Lbswap_mask(%rip),$bswap
+	vpshufb		$bswap,$Xi,$Xi
+	cmp		\$0x80,$len
+	jb		.Lshort_avx
+	sub		\$0x80,$len
+
+	vmovdqu		0x70($inp),$Ii		# I[7]
+	vmovdqu		0x00-0x40($Htbl),$Hkey	# $Hkey^1
+	vpshufb		$bswap,$Ii,$Ii
+	vmovdqu		0x20-0x40($Htbl),$HK
+
+	vpunpckhqdq	$Ii,$Ii,$T2
+	 vmovdqu	0x60($inp),$Ij		# I[6]
+	vpclmulqdq	\$0x00,$Hkey,$Ii,$Xlo
+	vpxor		$Ii,$T2,$T2
+	 vpshufb	$bswap,$Ij,$Ij
+	vpclmulqdq	\$0x11,$Hkey,$Ii,$Xhi
+	 vmovdqu	0x10-0x40($Htbl),$Hkey	# $Hkey^2
+	 vpunpckhqdq	$Ij,$Ij,$T1
+	 vmovdqu	0x50($inp),$Ii		# I[5]
+	vpclmulqdq	\$0x00,$HK,$T2,$Xmi
+	 vpxor		$Ij,$T1,$T1
+
+	 vpshufb	$bswap,$Ii,$Ii
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Zlo
+	 vpunpckhqdq	$Ii,$Ii,$T2
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Zhi
+	 vmovdqu	0x30-0x40($Htbl),$Hkey	# $Hkey^3
+	 vpxor		$Ii,$T2,$T2
+	 vmovdqu	0x40($inp),$Ij		# I[4]
+	vpclmulqdq	\$0x10,$HK,$T1,$Zmi
+	 vmovdqu	0x50-0x40($Htbl),$HK
+
+	 vpshufb	$bswap,$Ij,$Ij
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ii,$Xlo
+	vpxor		$Xhi,$Zhi,$Zhi
+	 vpunpckhqdq	$Ij,$Ij,$T1
+	vpclmulqdq	\$0x11,$Hkey,$Ii,$Xhi
+	 vmovdqu	0x40-0x40($Htbl),$Hkey	# $Hkey^4
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T2,$Xmi
+	 vpxor		$Ij,$T1,$T1
+
+	 vmovdqu	0x30($inp),$Ii		# I[3]
+	vpxor		$Zlo,$Xlo,$Xlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Zlo
+	vpxor		$Zhi,$Xhi,$Xhi
+	 vpshufb	$bswap,$Ii,$Ii
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Zhi
+	 vmovdqu	0x60-0x40($Htbl),$Hkey	# $Hkey^5
+	vpxor		$Zmi,$Xmi,$Xmi
+	 vpunpckhqdq	$Ii,$Ii,$T2
+	vpclmulqdq	\$0x10,$HK,$T1,$Zmi
+	 vmovdqu	0x80-0x40($Htbl),$HK
+	 vpxor		$Ii,$T2,$T2
+
+	 vmovdqu	0x20($inp),$Ij		# I[2]
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ii,$Xlo
+	vpxor		$Xhi,$Zhi,$Zhi
+	 vpshufb	$bswap,$Ij,$Ij
+	vpclmulqdq	\$0x11,$Hkey,$Ii,$Xhi
+	 vmovdqu	0x70-0x40($Htbl),$Hkey	# $Hkey^6
+	vpxor		$Xmi,$Zmi,$Zmi
+	 vpunpckhqdq	$Ij,$Ij,$T1
+	vpclmulqdq	\$0x00,$HK,$T2,$Xmi
+	 vpxor		$Ij,$T1,$T1
+
+	 vmovdqu	0x10($inp),$Ii		# I[1]
+	vpxor		$Zlo,$Xlo,$Xlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Zlo
+	vpxor		$Zhi,$Xhi,$Xhi
+	 vpshufb	$bswap,$Ii,$Ii
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Zhi
+	 vmovdqu	0x90-0x40($Htbl),$Hkey	# $Hkey^7
+	vpxor		$Zmi,$Xmi,$Xmi
+	 vpunpckhqdq	$Ii,$Ii,$T2
+	vpclmulqdq	\$0x10,$HK,$T1,$Zmi
+	 vmovdqu	0xb0-0x40($Htbl),$HK
+	 vpxor		$Ii,$T2,$T2
+
+	 vmovdqu	($inp),$Ij		# I[0]
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ii,$Xlo
+	vpxor		$Xhi,$Zhi,$Zhi
+	 vpshufb	$bswap,$Ij,$Ij
+	vpclmulqdq	\$0x11,$Hkey,$Ii,$Xhi
+	 vmovdqu	0xa0-0x40($Htbl),$Hkey	# $Hkey^8
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x10,$HK,$T2,$Xmi
+
+	lea		0x80($inp),$inp
+	cmp		\$0x80,$len
+	jb		.Ltail_avx
+
+	vpxor		$Xi,$Ij,$Ij		# accumulate $Xi
+	sub		\$0x80,$len
+	jmp		.Loop8x_avx
+
+.align	32
+.Loop8x_avx:
+	vpunpckhqdq	$Ij,$Ij,$T1
+	 vmovdqu	0x70($inp),$Ii		# I[7]
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpxor		$Ij,$T1,$T1
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Xi
+	 vpshufb	$bswap,$Ii,$Ii
+	vpxor		$Xhi,$Zhi,$Zhi
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Xo
+	 vmovdqu	0x00-0x40($Htbl),$Hkey	# $Hkey^1
+	 vpunpckhqdq	$Ii,$Ii,$T2
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T1,$Tred
+	 vmovdqu	0x20-0x40($Htbl),$HK
+	 vpxor		$Ii,$T2,$T2
+
+	  vmovdqu	0x60($inp),$Ij		# I[6]
+	 vpclmulqdq	\$0x00,$Hkey,$Ii,$Xlo
+	vpxor		$Zlo,$Xi,$Xi		# collect result
+	  vpshufb	$bswap,$Ij,$Ij
+	 vpclmulqdq	\$0x11,$Hkey,$Ii,$Xhi
+	vxorps		$Zhi,$Xo,$Xo
+	  vmovdqu	0x10-0x40($Htbl),$Hkey	# $Hkey^2
+	 vpunpckhqdq	$Ij,$Ij,$T1
+	 vpclmulqdq	\$0x00,$HK,  $T2,$Xmi
+	vpxor		$Zmi,$Tred,$Tred
+	 vxorps		$Ij,$T1,$T1
+
+	  vmovdqu	0x50($inp),$Ii		# I[5]
+	vpxor		$Xi,$Tred,$Tred		# aggregated Karatsuba post-processing
+	 vpclmulqdq	\$0x00,$Hkey,$Ij,$Zlo
+	vpxor		$Xo,$Tred,$Tred
+	vpslldq		\$8,$Tred,$T2
+	 vpxor		$Xlo,$Zlo,$Zlo
+	 vpclmulqdq	\$0x11,$Hkey,$Ij,$Zhi
+	vpsrldq		\$8,$Tred,$Tred
+	vpxor		$T2, $Xi, $Xi
+	  vmovdqu	0x30-0x40($Htbl),$Hkey	# $Hkey^3
+	  vpshufb	$bswap,$Ii,$Ii
+	vxorps		$Tred,$Xo, $Xo
+	 vpxor		$Xhi,$Zhi,$Zhi
+	 vpunpckhqdq	$Ii,$Ii,$T2
+	 vpclmulqdq	\$0x10,$HK,  $T1,$Zmi
+	  vmovdqu	0x50-0x40($Htbl),$HK
+	 vpxor		$Ii,$T2,$T2
+	 vpxor		$Xmi,$Zmi,$Zmi
+
+	  vmovdqu	0x40($inp),$Ij		# I[4]
+	vpalignr	\$8,$Xi,$Xi,$Tred	# 1st phase
+	 vpclmulqdq	\$0x00,$Hkey,$Ii,$Xlo
+	  vpshufb	$bswap,$Ij,$Ij
+	 vpxor		$Zlo,$Xlo,$Xlo
+	 vpclmulqdq	\$0x11,$Hkey,$Ii,$Xhi
+	  vmovdqu	0x40-0x40($Htbl),$Hkey	# $Hkey^4
+	 vpunpckhqdq	$Ij,$Ij,$T1
+	 vpxor		$Zhi,$Xhi,$Xhi
+	 vpclmulqdq	\$0x00,$HK,  $T2,$Xmi
+	 vxorps		$Ij,$T1,$T1
+	 vpxor		$Zmi,$Xmi,$Xmi
+
+	  vmovdqu	0x30($inp),$Ii		# I[3]
+	vpclmulqdq	\$0x10,(%r10),$Xi,$Xi
+	 vpclmulqdq	\$0x00,$Hkey,$Ij,$Zlo
+	  vpshufb	$bswap,$Ii,$Ii
+	 vpxor		$Xlo,$Zlo,$Zlo
+	 vpclmulqdq	\$0x11,$Hkey,$Ij,$Zhi
+	  vmovdqu	0x60-0x40($Htbl),$Hkey	# $Hkey^5
+	 vpunpckhqdq	$Ii,$Ii,$T2
+	 vpxor		$Xhi,$Zhi,$Zhi
+	 vpclmulqdq	\$0x10,$HK,  $T1,$Zmi
+	  vmovdqu	0x80-0x40($Htbl),$HK
+	 vpxor		$Ii,$T2,$T2
+	 vpxor		$Xmi,$Zmi,$Zmi
+
+	  vmovdqu	0x20($inp),$Ij		# I[2]
+	 vpclmulqdq	\$0x00,$Hkey,$Ii,$Xlo
+	  vpshufb	$bswap,$Ij,$Ij
+	 vpxor		$Zlo,$Xlo,$Xlo
+	 vpclmulqdq	\$0x11,$Hkey,$Ii,$Xhi
+	  vmovdqu	0x70-0x40($Htbl),$Hkey	# $Hkey^6
+	 vpunpckhqdq	$Ij,$Ij,$T1
+	 vpxor		$Zhi,$Xhi,$Xhi
+	 vpclmulqdq	\$0x00,$HK,  $T2,$Xmi
+	 vpxor		$Ij,$T1,$T1
+	 vpxor		$Zmi,$Xmi,$Xmi
+	vxorps		$Tred,$Xi,$Xi
+
+	  vmovdqu	0x10($inp),$Ii		# I[1]
+	vpalignr	\$8,$Xi,$Xi,$Tred	# 2nd phase
+	 vpclmulqdq	\$0x00,$Hkey,$Ij,$Zlo
+	  vpshufb	$bswap,$Ii,$Ii
+	 vpxor		$Xlo,$Zlo,$Zlo
+	 vpclmulqdq	\$0x11,$Hkey,$Ij,$Zhi
+	  vmovdqu	0x90-0x40($Htbl),$Hkey	# $Hkey^7
+	vpclmulqdq	\$0x10,(%r10),$Xi,$Xi
+	vxorps		$Xo,$Tred,$Tred
+	 vpunpckhqdq	$Ii,$Ii,$T2
+	 vpxor		$Xhi,$Zhi,$Zhi
+	 vpclmulqdq	\$0x10,$HK,  $T1,$Zmi
+	  vmovdqu	0xb0-0x40($Htbl),$HK
+	 vpxor		$Ii,$T2,$T2
+	 vpxor		$Xmi,$Zmi,$Zmi
+
+	  vmovdqu	($inp),$Ij		# I[0]
+	 vpclmulqdq	\$0x00,$Hkey,$Ii,$Xlo
+	  vpshufb	$bswap,$Ij,$Ij
+	 vpclmulqdq	\$0x11,$Hkey,$Ii,$Xhi
+	  vmovdqu	0xa0-0x40($Htbl),$Hkey	# $Hkey^8
+	vpxor		$Tred,$Ij,$Ij
+	 vpclmulqdq	\$0x10,$HK,  $T2,$Xmi
+	vpxor		$Xi,$Ij,$Ij		# accumulate $Xi
+
+	lea		0x80($inp),$inp
+	sub		\$0x80,$len
+	jnc		.Loop8x_avx
+
+	add		\$0x80,$len
+	jmp		.Ltail_no_xor_avx
+
+.align	32
+.Lshort_avx:
+	vmovdqu		-0x10($inp,$len),$Ii	# very last word
+	lea		($inp,$len),$inp
+	vmovdqu		0x00-0x40($Htbl),$Hkey	# $Hkey^1
+	vmovdqu		0x20-0x40($Htbl),$HK
+	vpshufb		$bswap,$Ii,$Ij
+
+	vmovdqa		$Xlo,$Zlo		# subtle way to zero $Zlo,
+	vmovdqa		$Xhi,$Zhi		# $Zhi and
+	vmovdqa		$Xmi,$Zmi		# $Zmi
+	sub		\$0x10,$len
+	jz		.Ltail_avx
+
+	vpunpckhqdq	$Ij,$Ij,$T1
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Xlo
+	vpxor		$Ij,$T1,$T1
+	 vmovdqu	-0x20($inp),$Ii
+	vpxor		$Xhi,$Zhi,$Zhi
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Xhi
+	vmovdqu		0x10-0x40($Htbl),$Hkey	# $Hkey^2
+	 vpshufb	$bswap,$Ii,$Ij
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T1,$Xmi
+	vpsrldq		\$8,$HK,$HK
+	sub		\$0x10,$len
+	jz		.Ltail_avx
+
+	vpunpckhqdq	$Ij,$Ij,$T1
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Xlo
+	vpxor		$Ij,$T1,$T1
+	 vmovdqu	-0x30($inp),$Ii
+	vpxor		$Xhi,$Zhi,$Zhi
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Xhi
+	vmovdqu		0x30-0x40($Htbl),$Hkey	# $Hkey^3
+	 vpshufb	$bswap,$Ii,$Ij
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T1,$Xmi
+	vmovdqu		0x50-0x40($Htbl),$HK
+	sub		\$0x10,$len
+	jz		.Ltail_avx
+
+	vpunpckhqdq	$Ij,$Ij,$T1
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Xlo
+	vpxor		$Ij,$T1,$T1
+	 vmovdqu	-0x40($inp),$Ii
+	vpxor		$Xhi,$Zhi,$Zhi
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Xhi
+	vmovdqu		0x40-0x40($Htbl),$Hkey	# $Hkey^4
+	 vpshufb	$bswap,$Ii,$Ij
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T1,$Xmi
+	vpsrldq		\$8,$HK,$HK
+	sub		\$0x10,$len
+	jz		.Ltail_avx
+
+	vpunpckhqdq	$Ij,$Ij,$T1
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Xlo
+	vpxor		$Ij,$T1,$T1
+	 vmovdqu	-0x50($inp),$Ii
+	vpxor		$Xhi,$Zhi,$Zhi
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Xhi
+	vmovdqu		0x60-0x40($Htbl),$Hkey	# $Hkey^5
+	 vpshufb	$bswap,$Ii,$Ij
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T1,$Xmi
+	vmovdqu		0x80-0x40($Htbl),$HK
+	sub		\$0x10,$len
+	jz		.Ltail_avx
+
+	vpunpckhqdq	$Ij,$Ij,$T1
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Xlo
+	vpxor		$Ij,$T1,$T1
+	 vmovdqu	-0x60($inp),$Ii
+	vpxor		$Xhi,$Zhi,$Zhi
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Xhi
+	vmovdqu		0x70-0x40($Htbl),$Hkey	# $Hkey^6
+	 vpshufb	$bswap,$Ii,$Ij
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T1,$Xmi
+	vpsrldq		\$8,$HK,$HK
+	sub		\$0x10,$len
+	jz		.Ltail_avx
+
+	vpunpckhqdq	$Ij,$Ij,$T1
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Xlo
+	vpxor		$Ij,$T1,$T1
+	 vmovdqu	-0x70($inp),$Ii
+	vpxor		$Xhi,$Zhi,$Zhi
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Xhi
+	vmovdqu		0x90-0x40($Htbl),$Hkey	# $Hkey^7
+	 vpshufb	$bswap,$Ii,$Ij
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T1,$Xmi
+	vmovq		0xb8-0x40($Htbl),$HK
+	sub		\$0x10,$len
+	jmp		.Ltail_avx
+
+.align	32
+.Ltail_avx:
+	vpxor		$Xi,$Ij,$Ij		# accumulate $Xi
+.Ltail_no_xor_avx:
+	vpunpckhqdq	$Ij,$Ij,$T1
+	vpxor		$Xlo,$Zlo,$Zlo
+	vpclmulqdq	\$0x00,$Hkey,$Ij,$Xlo
+	vpxor		$Ij,$T1,$T1
+	vpxor		$Xhi,$Zhi,$Zhi
+	vpclmulqdq	\$0x11,$Hkey,$Ij,$Xhi
+	vpxor		$Xmi,$Zmi,$Zmi
+	vpclmulqdq	\$0x00,$HK,$T1,$Xmi
+
+	vmovdqu		(%r10),$Tred
+
+	vpxor		$Xlo,$Zlo,$Xi
+	vpxor		$Xhi,$Zhi,$Xo
+	vpxor		$Xmi,$Zmi,$Zmi
+
+	vpxor		$Xi, $Zmi,$Zmi		# aggregated Karatsuba post-processing
+	vpxor		$Xo, $Zmi,$Zmi
+	vpslldq		\$8, $Zmi,$T2
+	vpsrldq		\$8, $Zmi,$Zmi
+	vpxor		$T2, $Xi, $Xi
+	vpxor		$Zmi,$Xo, $Xo
+
+	vpclmulqdq	\$0x10,$Tred,$Xi,$T2	# 1st phase
+	vpalignr	\$8,$Xi,$Xi,$Xi
+	vpxor		$T2,$Xi,$Xi
+
+	vpclmulqdq	\$0x10,$Tred,$Xi,$T2	# 2nd phase
+	vpalignr	\$8,$Xi,$Xi,$Xi
+	vpxor		$Xo,$Xi,$Xi
+	vpxor		$T2,$Xi,$Xi
+
+	cmp		\$0,$len
+	jne		.Lshort_avx
+
+	vpshufb		$bswap,$Xi,$Xi
+	vmovdqu		$Xi,($Xip)
+	vzeroupper
+___
+$code.=<<___ if ($win64);
+	movaps	(%rsp),%xmm6
+	movaps	0x10(%rsp),%xmm7
+	movaps	0x20(%rsp),%xmm8
+	movaps	0x30(%rsp),%xmm9
+	movaps	0x40(%rsp),%xmm10
+	movaps	0x50(%rsp),%xmm11
+	movaps	0x60(%rsp),%xmm12
+	movaps	0x70(%rsp),%xmm13
+	movaps	0x80(%rsp),%xmm14
+	movaps	0x90(%rsp),%xmm15
+	lea	0xa8(%rsp),%rsp
+.LSEH_end_gcm_ghash_avx:
+___
+$code.=<<___;
+	ret
+.size	gcm_ghash_avx,.-gcm_ghash_avx
+___
+} else {
+$code.=<<___;
+	jmp	.L_ghash_clmul
+.size	gcm_ghash_avx,.-gcm_ghash_avx
+___
+}
+
+$code.=<<___;
+.align	64
+.Lbswap_mask:
+	.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+.L0x1c2_polynomial:
+	.byte	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+.L7_mask:
+	.long	7,0,7,0
+.L7_mask_poly:
+	.long	7,0,`0xE1<<1`,0
+.align	64
+.type	.Lrem_4bit,\@object
+.Lrem_4bit:
+	.long	0,`0x0000<<16`,0,`0x1C20<<16`,0,`0x3840<<16`,0,`0x2460<<16`
+	.long	0,`0x7080<<16`,0,`0x6CA0<<16`,0,`0x48C0<<16`,0,`0x54E0<<16`
+	.long	0,`0xE100<<16`,0,`0xFD20<<16`,0,`0xD940<<16`,0,`0xC560<<16`
+	.long	0,`0x9180<<16`,0,`0x8DA0<<16`,0,`0xA9C0<<16`,0,`0xB5E0<<16`
+.type	.Lrem_8bit,\@object
+.Lrem_8bit:
+	.value	0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E
+	.value	0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E
+	.value	0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E
+	.value	0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E
+	.value	0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E
+	.value	0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E
+	.value	0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E
+	.value	0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E
+	.value	0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE
+	.value	0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE
+	.value	0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE
+	.value	0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE
+	.value	0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E
+	.value	0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E
+	.value	0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE
+	.value	0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE
+	.value	0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E
+	.value	0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E
+	.value	0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E
+	.value	0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E
+	.value	0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E
+	.value	0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E
+	.value	0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E
+	.value	0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E
+	.value	0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE
+	.value	0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE
+	.value	0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE
+	.value	0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE
+	.value	0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E
+	.value	0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E
+	.value	0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE
+	.value	0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE
+
+.asciz	"GHASH for x86_64, CRYPTOGAMS by <appro\@l.org>"
+.align	64
+___
+
+# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
+#		CONTEXT *context,DISPATCHER_CONTEXT *disp)
+if ($win64) {
+$rec="%rcx";
+$frame="%rdx";
+$context="%r8";
+$disp="%r9";
+
+$code.=<<___;
+.extern	__imp_RtlVirtualUnwind
+.type	se_handler,\@abi-omnipotent
+.align	16
+se_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue label
+	cmp	%r10,%rbx		# context->Rip<prologue label
+	jb	.Lin_prologue
+
+	mov	152($context),%rax	# pull context->Rsp
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=epilogue label
+	jae	.Lin_prologue
+
+	lea	48+280(%rax),%rax	# adjust "rsp"
+
+	mov	-8(%rax),%rbx
+	mov	-16(%rax),%rbp
+	mov	-24(%rax),%r12
+	mov	-32(%rax),%r13
+	mov	-40(%rax),%r14
+	mov	-48(%rax),%r15
+	mov	%rbx,144($context)	# restore context->Rbx
+	mov	%rbp,160($context)	# restore context->Rbp
+	mov	%r12,216($context)	# restore context->R12
+	mov	%r13,224($context)	# restore context->R13
+	mov	%r14,232($context)	# restore context->R14
+	mov	%r15,240($context)	# restore context->R15
+
+.Lin_prologue:
+	mov	8(%rax),%rdi
+	mov	16(%rax),%rsi
+	mov	%rax,152($context)	# restore context->Rsp
+	mov	%rsi,168($context)	# restore context->Rsi
+	mov	%rdi,176($context)	# restore context->Rdi
+
+	mov	40($disp),%rdi		# disp->ContextRecord
+	mov	$context,%rsi		# context
+	mov	\$`1232/8`,%ecx		# sizeof(CONTEXT)
+	.long	0xa548f3fc		# cld; rep movsq
+
+	mov	$disp,%rsi
+	xor	%rcx,%rcx		# arg1, UNW_FLAG_NHANDLER
+	mov	8(%rsi),%rdx		# arg2, disp->ImageBase
+	mov	0(%rsi),%r8		# arg3, disp->ControlPc
+	mov	16(%rsi),%r9		# arg4, disp->FunctionEntry
+	mov	40(%rsi),%r10		# disp->ContextRecord
+	lea	56(%rsi),%r11		# &disp->HandlerData
+	lea	24(%rsi),%r12		# &disp->EstablisherFrame
+	mov	%r10,32(%rsp)		# arg5
+	mov	%r11,40(%rsp)		# arg6
+	mov	%r12,48(%rsp)		# arg7
+	mov	%rcx,56(%rsp)		# arg8, (NULL)
+	call	*__imp_RtlVirtualUnwind(%rip)
+
+	mov	\$1,%eax		# ExceptionContinueSearch
+	add	\$64,%rsp
+	popfq
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+	pop	%rbp
+	pop	%rbx
+	pop	%rdi
+	pop	%rsi
+	ret
+.size	se_handler,.-se_handler
+
+.section	.pdata
+.align	4
+	.rva	.LSEH_begin_gcm_gmult_4bit
+	.rva	.LSEH_end_gcm_gmult_4bit
+	.rva	.LSEH_info_gcm_gmult_4bit
+
+	.rva	.LSEH_begin_gcm_ghash_4bit
+	.rva	.LSEH_end_gcm_ghash_4bit
+	.rva	.LSEH_info_gcm_ghash_4bit
+
+	.rva	.LSEH_begin_gcm_init_clmul
+	.rva	.LSEH_end_gcm_init_clmul
+	.rva	.LSEH_info_gcm_init_clmul
+
+	.rva	.LSEH_begin_gcm_ghash_clmul
+	.rva	.LSEH_end_gcm_ghash_clmul
+	.rva	.LSEH_info_gcm_ghash_clmul
+___
+$code.=<<___	if ($avx);
+	.rva	.LSEH_begin_gcm_init_avx
+	.rva	.LSEH_end_gcm_init_avx
+	.rva	.LSEH_info_gcm_init_clmul
+
+	.rva	.LSEH_begin_gcm_ghash_avx
+	.rva	.LSEH_end_gcm_ghash_avx
+	.rva	.LSEH_info_gcm_ghash_clmul
+___
+$code.=<<___;
+.section	.xdata
+.align	8
+.LSEH_info_gcm_gmult_4bit:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.Lgmult_prologue,.Lgmult_epilogue	# HandlerData
+.LSEH_info_gcm_ghash_4bit:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.Lghash_prologue,.Lghash_epilogue	# HandlerData
+.LSEH_info_gcm_init_clmul:
+	.byte	0x01,0x08,0x03,0x00
+	.byte	0x08,0x68,0x00,0x00	#movaps	0x00(rsp),xmm6
+	.byte	0x04,0x22,0x00,0x00	#sub	rsp,0x18
+.LSEH_info_gcm_ghash_clmul:
+	.byte	0x01,0x33,0x16,0x00
+	.byte	0x33,0xf8,0x09,0x00	#movaps 0x90(rsp),xmm15
+	.byte	0x2e,0xe8,0x08,0x00	#movaps 0x80(rsp),xmm14
+	.byte	0x29,0xd8,0x07,0x00	#movaps 0x70(rsp),xmm13
+	.byte	0x24,0xc8,0x06,0x00	#movaps 0x60(rsp),xmm12
+	.byte	0x1f,0xb8,0x05,0x00	#movaps 0x50(rsp),xmm11
+	.byte	0x1a,0xa8,0x04,0x00	#movaps 0x40(rsp),xmm10
+	.byte	0x15,0x98,0x03,0x00	#movaps 0x30(rsp),xmm9
+	.byte	0x10,0x88,0x02,0x00	#movaps 0x20(rsp),xmm8
+	.byte	0x0c,0x78,0x01,0x00	#movaps 0x10(rsp),xmm7
+	.byte	0x08,0x68,0x00,0x00	#movaps 0x00(rsp),xmm6
+	.byte	0x04,0x01,0x15,0x00	#sub	rsp,0xa8
+___
+}
+
+$code =~ s/\`([^\`]*)\`/eval($1)/gem;
+
+print $code;
+
+close STDOUT;
diff --git a/crypto/aesgcm/ghash_x64_gas.s b/crypto/aesgcm/ghash_x64_gas.s
new file mode 100644
index 0000000..07d5456
--- /dev/null
+++ b/crypto/aesgcm/ghash_x64_gas.s
@@ -0,0 +1,1795 @@
+.text	
+
+
+.globl	gcm_gmult_4bit
+.type	gcm_gmult_4bit,@function
+.align	16
+gcm_gmult_4bit:
+	pushq	%rbx
+	pushq	%rbp
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	subq	$280,%rsp
+.Lgmult_prologue:
+
+	movzbq	15(%rdi),%r8
+	leaq	.Lrem_4bit(%rip),%r11
+	xorq	%rax,%rax
+	xorq	%rbx,%rbx
+	movb	%r8b,%al
+	movb	%r8b,%bl
+	shlb	$4,%al
+	movq	$14,%rcx
+	movq	8(%rsi,%rax,1),%r8
+	movq	(%rsi,%rax,1),%r9
+	andb	$0xf0,%bl
+	movq	%r8,%rdx
+	jmp	.Loop1
+
+.align	16
+.Loop1:
+	shrq	$4,%r8
+	andq	$0xf,%rdx
+	movq	%r9,%r10
+	movb	(%rdi,%rcx,1),%al
+	shrq	$4,%r9
+	xorq	8(%rsi,%rbx,1),%r8
+	shlq	$60,%r10
+	xorq	(%rsi,%rbx,1),%r9
+	movb	%al,%bl
+	xorq	(%r11,%rdx,8),%r9
+	movq	%r8,%rdx
+	shlb	$4,%al
+	xorq	%r10,%r8
+	decq	%rcx
+	js	.Lbreak1
+
+	shrq	$4,%r8
+	andq	$0xf,%rdx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	xorq	8(%rsi,%rax,1),%r8
+	shlq	$60,%r10
+	xorq	(%rsi,%rax,1),%r9
+	andb	$0xf0,%bl
+	xorq	(%r11,%rdx,8),%r9
+	movq	%r8,%rdx
+	xorq	%r10,%r8
+	jmp	.Loop1
+
+.align	16
+.Lbreak1:
+	shrq	$4,%r8
+	andq	$0xf,%rdx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	xorq	8(%rsi,%rax,1),%r8
+	shlq	$60,%r10
+	xorq	(%rsi,%rax,1),%r9
+	andb	$0xf0,%bl
+	xorq	(%r11,%rdx,8),%r9
+	movq	%r8,%rdx
+	xorq	%r10,%r8
+
+	shrq	$4,%r8
+	andq	$0xf,%rdx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	xorq	8(%rsi,%rbx,1),%r8
+	shlq	$60,%r10
+	xorq	(%rsi,%rbx,1),%r9
+	xorq	%r10,%r8
+	xorq	(%r11,%rdx,8),%r9
+
+	bswapq	%r8
+	bswapq	%r9
+	movq	%r8,8(%rdi)
+	movq	%r9,(%rdi)
+
+	leaq	280+48(%rsp),%rsi
+	movq	-8(%rsi),%rbx
+	leaq	(%rsi),%rsp
+.Lgmult_epilogue:
+	ret
+.size	gcm_gmult_4bit,.-gcm_gmult_4bit
+.globl	gcm_ghash_4bit
+.type	gcm_ghash_4bit,@function
+.align	16
+gcm_ghash_4bit:
+	pushq	%rbx
+	pushq	%rbp
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	subq	$280,%rsp
+.Lghash_prologue:
+	movq	%rdx,%r14
+	movq	%rcx,%r15
+	subq	$-128,%rsi
+	leaq	16+128(%rsp),%rbp
+	xorl	%edx,%edx
+	movq	0+0-128(%rsi),%r8
+	movq	0+8-128(%rsi),%rax
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	16+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	16+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,0(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,0(%rbp)
+	movq	32+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,0-128(%rbp)
+	movq	32+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,1(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,8(%rbp)
+	movq	48+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,8-128(%rbp)
+	movq	48+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,2(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,16(%rbp)
+	movq	64+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,16-128(%rbp)
+	movq	64+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,3(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,24(%rbp)
+	movq	80+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,24-128(%rbp)
+	movq	80+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,4(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,32(%rbp)
+	movq	96+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,32-128(%rbp)
+	movq	96+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,5(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,40(%rbp)
+	movq	112+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,40-128(%rbp)
+	movq	112+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,6(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,48(%rbp)
+	movq	128+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,48-128(%rbp)
+	movq	128+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,7(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,56(%rbp)
+	movq	144+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,56-128(%rbp)
+	movq	144+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,8(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,64(%rbp)
+	movq	160+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,64-128(%rbp)
+	movq	160+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,9(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,72(%rbp)
+	movq	176+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,72-128(%rbp)
+	movq	176+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,10(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,80(%rbp)
+	movq	192+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,80-128(%rbp)
+	movq	192+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,11(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,88(%rbp)
+	movq	208+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,88-128(%rbp)
+	movq	208+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,12(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,96(%rbp)
+	movq	224+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,96-128(%rbp)
+	movq	224+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,13(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,104(%rbp)
+	movq	240+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,104-128(%rbp)
+	movq	240+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,14(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,112(%rbp)
+	shlb	$4,%dl
+	movq	%rax,112-128(%rbp)
+	shlq	$60,%r10
+	movb	%dl,15(%rsp)
+	orq	%r10,%rbx
+	movq	%r9,120(%rbp)
+	movq	%rbx,120-128(%rbp)
+	addq	$-128,%rsi
+	movq	8(%rdi),%r8
+	movq	0(%rdi),%r9
+	addq	%r14,%r15
+	leaq	.Lrem_8bit(%rip),%r11
+	jmp	.Louter_loop
+.align	16
+.Louter_loop:
+	xorq	(%r14),%r9
+	movq	8(%r14),%rdx
+	leaq	16(%r14),%r14
+	xorq	%r8,%rdx
+	movq	%r9,(%rdi)
+	movq	%rdx,8(%rdi)
+	shrq	$32,%rdx
+	xorq	%rax,%rax
+	roll	$8,%edx
+	movb	%dl,%al
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	shrl	$4,%ebx
+	roll	$8,%edx
+	movq	8(%rsi,%rax,1),%r8
+	movq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	movl	8(%rdi),%edx
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	movl	4(%rdi),%edx
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	movl	0(%rdi),%edx
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	andl	$240,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	movl	-4(%rdi),%edx
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	movzwq	(%r11,%r12,2),%r12
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	shlq	$48,%r12
+	xorq	%r10,%r8
+	xorq	%r12,%r9
+	movzbq	%r8b,%r13
+	shrq	$4,%r8
+	movq	%r9,%r10
+	shlb	$4,%r13b
+	shrq	$4,%r9
+	xorq	8(%rsi,%rcx,1),%r8
+	movzwq	(%r11,%r13,2),%r13
+	shlq	$60,%r10
+	xorq	(%rsi,%rcx,1),%r9
+	xorq	%r10,%r8
+	shlq	$48,%r13
+	bswapq	%r8
+	xorq	%r13,%r9
+	bswapq	%r9
+	cmpq	%r15,%r14
+	jb	.Louter_loop
+	movq	%r8,8(%rdi)
+	movq	%r9,(%rdi)
+
+	leaq	280+48(%rsp),%rsi
+	movq	-48(%rsi),%r15
+	movq	-40(%rsi),%r14
+	movq	-32(%rsi),%r13
+	movq	-24(%rsi),%r12
+	movq	-16(%rsi),%rbp
+	movq	-8(%rsi),%rbx
+	leaq	0(%rsi),%rsp
+.Lghash_epilogue:
+	ret
+.size	gcm_ghash_4bit,.-gcm_ghash_4bit
+.globl	gcm_init_clmul
+.type	gcm_init_clmul,@function
+.align	16
+gcm_init_clmul:
+.L_init_clmul:
+	movdqu	(%rsi),%xmm2
+	pshufd	$78,%xmm2,%xmm2
+
+
+	pshufd	$255,%xmm2,%xmm4
+	movdqa	%xmm2,%xmm3
+	psllq	$1,%xmm2
+	pxor	%xmm5,%xmm5
+	psrlq	$63,%xmm3
+	pcmpgtd	%xmm4,%xmm5
+	pslldq	$8,%xmm3
+	por	%xmm3,%xmm2
+
+
+	pand	.L0x1c2_polynomial(%rip),%xmm5
+	pxor	%xmm5,%xmm2
+
+
+	pshufd	$78,%xmm2,%xmm6
+	movdqa	%xmm2,%xmm0
+	pxor	%xmm2,%xmm6
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,222,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	pshufd	$78,%xmm2,%xmm3
+	pshufd	$78,%xmm0,%xmm4
+	pxor	%xmm2,%xmm3
+	movdqu	%xmm2,0(%rdi)
+	pxor	%xmm0,%xmm4
+	movdqu	%xmm0,16(%rdi)
+.byte	102,15,58,15,227,8
+	movdqu	%xmm4,32(%rdi)
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,222,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	movdqa	%xmm0,%xmm5
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,222,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	pshufd	$78,%xmm5,%xmm3
+	pshufd	$78,%xmm0,%xmm4
+	pxor	%xmm5,%xmm3
+	movdqu	%xmm5,48(%rdi)
+	pxor	%xmm0,%xmm4
+	movdqu	%xmm0,64(%rdi)
+.byte	102,15,58,15,227,8
+	movdqu	%xmm4,80(%rdi)
+	ret
+.size	gcm_init_clmul,.-gcm_init_clmul
+.globl	gcm_gmult_clmul
+.type	gcm_gmult_clmul,@function
+.align	16
+gcm_gmult_clmul:
+.L_gmult_clmul:
+	movdqu	(%rdi),%xmm0
+	movdqa	.Lbswap_mask(%rip),%xmm5
+	movdqu	(%rsi),%xmm2
+	movdqu	32(%rsi),%xmm4
+	pshufb	%xmm5,%xmm0
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,220,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	pshufb	%xmm5,%xmm0
+	movdqu	%xmm0,(%rdi)
+	ret
+.size	gcm_gmult_clmul,.-gcm_gmult_clmul
+.globl	gcm_ghash_clmul
+.type	gcm_ghash_clmul,@function
+.align	32
+gcm_ghash_clmul:
+.L_ghash_clmul:
+	movdqa	.Lbswap_mask(%rip),%xmm10
+
+	movdqu	(%rdi),%xmm0
+	movdqu	(%rsi),%xmm2
+	movdqu	32(%rsi),%xmm7
+	pshufb	%xmm10,%xmm0
+
+	subq	$0x10,%rcx
+	jz	.Lodd_tail
+
+	movdqu	16(%rsi),%xmm6
+
+
+	cmpq	$0x30,%rcx
+	jb	.Lskip4x
+
+
+
+
+
+	subq	$0x30,%rcx
+	movq	$0xA040608020C0E000,%rax
+	movdqu	48(%rsi),%xmm14
+	movdqu	64(%rsi),%xmm15
+
+
+
+
+	movdqu	48(%rdx),%xmm3
+	movdqu	32(%rdx),%xmm11
+	pshufb	%xmm10,%xmm3
+	pshufb	%xmm10,%xmm11
+	movdqa	%xmm3,%xmm5
+	pshufd	$78,%xmm3,%xmm4
+	pxor	%xmm3,%xmm4
+.byte	102,15,58,68,218,0
+.byte	102,15,58,68,234,17
+.byte	102,15,58,68,231,0
+
+	movdqa	%xmm11,%xmm13
+	pshufd	$78,%xmm11,%xmm12
+	pxor	%xmm11,%xmm12
+.byte	102,68,15,58,68,222,0
+.byte	102,68,15,58,68,238,17
+.byte	102,68,15,58,68,231,16
+	xorps	%xmm11,%xmm3
+	xorps	%xmm13,%xmm5
+	movups	80(%rsi),%xmm7
+	xorps	%xmm12,%xmm4
+
+	movdqu	16(%rdx),%xmm11
+	movdqu	0(%rdx),%xmm8
+	pshufb	%xmm10,%xmm11
+	pshufb	%xmm10,%xmm8
+	movdqa	%xmm11,%xmm13
+	pshufd	$78,%xmm11,%xmm12
+	pxor	%xmm8,%xmm0
+	pxor	%xmm11,%xmm12
+.byte	102,69,15,58,68,222,0
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm8
+	pxor	%xmm0,%xmm8
+.byte	102,69,15,58,68,238,17
+.byte	102,68,15,58,68,231,0
+	xorps	%xmm11,%xmm3
+	xorps	%xmm13,%xmm5
+
+	leaq	64(%rdx),%rdx
+	subq	$0x40,%rcx
+	jc	.Ltail4x
+
+	jmp	.Lmod4_loop
+.align	32
+.Lmod4_loop:
+.byte	102,65,15,58,68,199,0
+	xorps	%xmm12,%xmm4
+	movdqu	48(%rdx),%xmm11
+	pshufb	%xmm10,%xmm11
+.byte	102,65,15,58,68,207,17
+	xorps	%xmm3,%xmm0
+	movdqu	32(%rdx),%xmm3
+	movdqa	%xmm11,%xmm13
+.byte	102,68,15,58,68,199,16
+	pshufd	$78,%xmm11,%xmm12
+	xorps	%xmm5,%xmm1
+	pxor	%xmm11,%xmm12
+	pshufb	%xmm10,%xmm3
+	movups	32(%rsi),%xmm7
+	xorps	%xmm4,%xmm8
+.byte	102,68,15,58,68,218,0
+	pshufd	$78,%xmm3,%xmm4
+
+	pxor	%xmm0,%xmm8
+	movdqa	%xmm3,%xmm5
+	pxor	%xmm1,%xmm8
+	pxor	%xmm3,%xmm4
+	movdqa	%xmm8,%xmm9
+.byte	102,68,15,58,68,234,17
+	pslldq	$8,%xmm8
+	psrldq	$8,%xmm9
+	pxor	%xmm8,%xmm0
+	movdqa	.L7_mask(%rip),%xmm8
+	pxor	%xmm9,%xmm1
+.byte	102,76,15,110,200
+
+	pand	%xmm0,%xmm8
+	pshufb	%xmm8,%xmm9
+	pxor	%xmm0,%xmm9
+.byte	102,68,15,58,68,231,0
+	psllq	$57,%xmm9
+	movdqa	%xmm9,%xmm8
+	pslldq	$8,%xmm9
+.byte	102,15,58,68,222,0
+	psrldq	$8,%xmm8
+	pxor	%xmm9,%xmm0
+	pxor	%xmm8,%xmm1
+	movdqu	0(%rdx),%xmm8
+
+	movdqa	%xmm0,%xmm9
+	psrlq	$1,%xmm0
+.byte	102,15,58,68,238,17
+	xorps	%xmm11,%xmm3
+	movdqu	16(%rdx),%xmm11
+	pshufb	%xmm10,%xmm11
+.byte	102,15,58,68,231,16
+	xorps	%xmm13,%xmm5
+	movups	80(%rsi),%xmm7
+	pshufb	%xmm10,%xmm8
+	pxor	%xmm9,%xmm1
+	pxor	%xmm0,%xmm9
+	psrlq	$5,%xmm0
+
+	movdqa	%xmm11,%xmm13
+	pxor	%xmm12,%xmm4
+	pshufd	$78,%xmm11,%xmm12
+	pxor	%xmm9,%xmm0
+	pxor	%xmm8,%xmm1
+	pxor	%xmm11,%xmm12
+.byte	102,69,15,58,68,222,0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	movdqa	%xmm0,%xmm1
+.byte	102,69,15,58,68,238,17
+	xorps	%xmm11,%xmm3
+	pshufd	$78,%xmm0,%xmm8
+	pxor	%xmm0,%xmm8
+
+.byte	102,68,15,58,68,231,0
+	xorps	%xmm13,%xmm5
+
+	leaq	64(%rdx),%rdx
+	subq	$0x40,%rcx
+	jnc	.Lmod4_loop
+
+.Ltail4x:
+.byte	102,65,15,58,68,199,0
+.byte	102,65,15,58,68,207,17
+.byte	102,68,15,58,68,199,16
+	xorps	%xmm12,%xmm4
+	xorps	%xmm3,%xmm0
+	xorps	%xmm5,%xmm1
+	pxor	%xmm0,%xmm1
+	pxor	%xmm4,%xmm8
+
+	pxor	%xmm1,%xmm8
+	pxor	%xmm0,%xmm1
+
+	movdqa	%xmm8,%xmm9
+	psrldq	$8,%xmm8
+	pslldq	$8,%xmm9
+	pxor	%xmm8,%xmm1
+	pxor	%xmm9,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	addq	$0x40,%rcx
+	jz	.Ldone
+	movdqu	32(%rsi),%xmm7
+	subq	$0x10,%rcx
+	jz	.Lodd_tail
+.Lskip4x:
+
+
+
+
+
+	movdqu	(%rdx),%xmm8
+	movdqu	16(%rdx),%xmm3
+	pshufb	%xmm10,%xmm8
+	pshufb	%xmm10,%xmm3
+	pxor	%xmm8,%xmm0
+
+	movdqa	%xmm3,%xmm5
+	pshufd	$78,%xmm3,%xmm4
+	pxor	%xmm3,%xmm4
+.byte	102,15,58,68,218,0
+.byte	102,15,58,68,234,17
+.byte	102,15,58,68,231,0
+
+	leaq	32(%rdx),%rdx
+	nop
+	subq	$0x20,%rcx
+	jbe	.Leven_tail
+	nop
+	jmp	.Lmod_loop
+
+.align	32
+.Lmod_loop:
+	movdqa	%xmm0,%xmm1
+	movdqa	%xmm4,%xmm8
+	pshufd	$78,%xmm0,%xmm4
+	pxor	%xmm0,%xmm4
+
+.byte	102,15,58,68,198,0
+.byte	102,15,58,68,206,17
+.byte	102,15,58,68,231,16
+
+	pxor	%xmm3,%xmm0
+	pxor	%xmm5,%xmm1
+	movdqu	(%rdx),%xmm9
+	pxor	%xmm0,%xmm8
+	pshufb	%xmm10,%xmm9
+	movdqu	16(%rdx),%xmm3
+
+	pxor	%xmm1,%xmm8
+	pxor	%xmm9,%xmm1
+	pxor	%xmm8,%xmm4
+	pshufb	%xmm10,%xmm3
+	movdqa	%xmm4,%xmm8
+	psrldq	$8,%xmm8
+	pslldq	$8,%xmm4
+	pxor	%xmm8,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm3,%xmm5
+
+	movdqa	%xmm0,%xmm9
+	movdqa	%xmm0,%xmm8
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm8
+.byte	102,15,58,68,218,0
+	psllq	$1,%xmm0
+	pxor	%xmm8,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm8
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm8
+	pxor	%xmm9,%xmm0
+	pshufd	$78,%xmm5,%xmm4
+	pxor	%xmm8,%xmm1
+	pxor	%xmm5,%xmm4
+
+	movdqa	%xmm0,%xmm9
+	psrlq	$1,%xmm0
+.byte	102,15,58,68,234,17
+	pxor	%xmm9,%xmm1
+	pxor	%xmm0,%xmm9
+	psrlq	$5,%xmm0
+	pxor	%xmm9,%xmm0
+	leaq	32(%rdx),%rdx
+	psrlq	$1,%xmm0
+.byte	102,15,58,68,231,0
+	pxor	%xmm1,%xmm0
+
+	subq	$0x20,%rcx
+	ja	.Lmod_loop
+
+.Leven_tail:
+	movdqa	%xmm0,%xmm1
+	movdqa	%xmm4,%xmm8
+	pshufd	$78,%xmm0,%xmm4
+	pxor	%xmm0,%xmm4
+
+.byte	102,15,58,68,198,0
+.byte	102,15,58,68,206,17
+.byte	102,15,58,68,231,16
+
+	pxor	%xmm3,%xmm0
+	pxor	%xmm5,%xmm1
+	pxor	%xmm0,%xmm8
+	pxor	%xmm1,%xmm8
+	pxor	%xmm8,%xmm4
+	movdqa	%xmm4,%xmm8
+	psrldq	$8,%xmm8
+	pslldq	$8,%xmm4
+	pxor	%xmm8,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	testq	%rcx,%rcx
+	jnz	.Ldone
+
+.Lodd_tail:
+	movdqu	(%rdx),%xmm8
+	pshufb	%xmm10,%xmm8
+	pxor	%xmm8,%xmm0
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,223,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+.Ldone:
+	pshufb	%xmm10,%xmm0
+	movdqu	%xmm0,(%rdi)
+	ret
+.size	gcm_ghash_clmul,.-gcm_ghash_clmul
+.globl	gcm_init_avx
+.type	gcm_init_avx,@function
+.align	32
+gcm_init_avx:
+	vzeroupper
+
+	vmovdqu	(%rsi),%xmm2
+	vpshufd	$78,%xmm2,%xmm2
+
+
+	vpshufd	$255,%xmm2,%xmm4
+	vpsrlq	$63,%xmm2,%xmm3
+	vpsllq	$1,%xmm2,%xmm2
+	vpxor	%xmm5,%xmm5,%xmm5
+	vpcmpgtd	%xmm4,%xmm5,%xmm5
+	vpslldq	$8,%xmm3,%xmm3
+	vpor	%xmm3,%xmm2,%xmm2
+
+
+	vpand	.L0x1c2_polynomial(%rip),%xmm5,%xmm5
+	vpxor	%xmm5,%xmm2,%xmm2
+
+	vpunpckhqdq	%xmm2,%xmm2,%xmm6
+	vmovdqa	%xmm2,%xmm0
+	vpxor	%xmm2,%xmm6,%xmm6
+	movq	$4,%r10
+	jmp	.Linit_start_avx
+.align	32
+.Linit_loop_avx:
+	vpalignr	$8,%xmm3,%xmm4,%xmm5
+	vmovdqu	%xmm5,-16(%rdi)
+	vpunpckhqdq	%xmm0,%xmm0,%xmm3
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm2,%xmm0,%xmm1
+	vpclmulqdq	$0x00,%xmm2,%xmm0,%xmm0
+	vpclmulqdq	$0x00,%xmm6,%xmm3,%xmm3
+	vpxor	%xmm0,%xmm1,%xmm4
+	vpxor	%xmm4,%xmm3,%xmm3
+
+	vpslldq	$8,%xmm3,%xmm4
+	vpsrldq	$8,%xmm3,%xmm3
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpxor	%xmm3,%xmm1,%xmm1
+	vpsllq	$57,%xmm0,%xmm3
+	vpsllq	$62,%xmm0,%xmm4
+	vpxor	%xmm3,%xmm4,%xmm4
+	vpsllq	$63,%xmm0,%xmm3
+	vpxor	%xmm3,%xmm4,%xmm4
+	vpslldq	$8,%xmm4,%xmm3
+	vpsrldq	$8,%xmm4,%xmm4
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpxor	%xmm4,%xmm1,%xmm1
+
+	vpsrlq	$1,%xmm0,%xmm4
+	vpxor	%xmm0,%xmm1,%xmm1
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpsrlq	$5,%xmm4,%xmm4
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpsrlq	$1,%xmm0,%xmm0
+	vpxor	%xmm1,%xmm0,%xmm0
+.Linit_start_avx:
+	vmovdqa	%xmm0,%xmm5
+	vpunpckhqdq	%xmm0,%xmm0,%xmm3
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm2,%xmm0,%xmm1
+	vpclmulqdq	$0x00,%xmm2,%xmm0,%xmm0
+	vpclmulqdq	$0x00,%xmm6,%xmm3,%xmm3
+	vpxor	%xmm0,%xmm1,%xmm4
+	vpxor	%xmm4,%xmm3,%xmm3
+
+	vpslldq	$8,%xmm3,%xmm4
+	vpsrldq	$8,%xmm3,%xmm3
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpxor	%xmm3,%xmm1,%xmm1
+	vpsllq	$57,%xmm0,%xmm3
+	vpsllq	$62,%xmm0,%xmm4
+	vpxor	%xmm3,%xmm4,%xmm4
+	vpsllq	$63,%xmm0,%xmm3
+	vpxor	%xmm3,%xmm4,%xmm4
+	vpslldq	$8,%xmm4,%xmm3
+	vpsrldq	$8,%xmm4,%xmm4
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpxor	%xmm4,%xmm1,%xmm1
+
+	vpsrlq	$1,%xmm0,%xmm4
+	vpxor	%xmm0,%xmm1,%xmm1
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpsrlq	$5,%xmm4,%xmm4
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpsrlq	$1,%xmm0,%xmm0
+	vpxor	%xmm1,%xmm0,%xmm0
+	vpshufd	$78,%xmm5,%xmm3
+	vpshufd	$78,%xmm0,%xmm4
+	vpxor	%xmm5,%xmm3,%xmm3
+	vmovdqu	%xmm5,0(%rdi)
+	vpxor	%xmm0,%xmm4,%xmm4
+	vmovdqu	%xmm0,16(%rdi)
+	leaq	48(%rdi),%rdi
+	subq	$1,%r10
+	jnz	.Linit_loop_avx
+
+	vpalignr	$8,%xmm4,%xmm3,%xmm5
+	vmovdqu	%xmm5,-16(%rdi)
+
+	vzeroupper
+	ret
+.size	gcm_init_avx,.-gcm_init_avx
+.globl	gcm_gmult_avx
+.type	gcm_gmult_avx,@function
+.align	32
+gcm_gmult_avx:
+	jmp	.L_gmult_clmul
+.size	gcm_gmult_avx,.-gcm_gmult_avx
+.globl	gcm_ghash_avx
+.type	gcm_ghash_avx,@function
+.align	32
+gcm_ghash_avx:
+	vzeroupper
+
+	vmovdqu	(%rdi),%xmm10
+	leaq	.L0x1c2_polynomial(%rip),%r10
+	leaq	64(%rsi),%rsi
+	vmovdqu	.Lbswap_mask(%rip),%xmm13
+	vpshufb	%xmm13,%xmm10,%xmm10
+	cmpq	$0x80,%rcx
+	jb	.Lshort_avx
+	subq	$0x80,%rcx
+
+	vmovdqu	112(%rdx),%xmm14
+	vmovdqu	0-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vmovdqu	32-64(%rsi),%xmm7
+
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vmovdqu	96(%rdx),%xmm15
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm14,%xmm9,%xmm9
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	16-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vmovdqu	80(%rdx),%xmm14
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm15,%xmm8,%xmm8
+
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	48-64(%rsi),%xmm6
+	vpxor	%xmm14,%xmm9,%xmm9
+	vmovdqu	64(%rdx),%xmm15
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	80-64(%rsi),%xmm7
+
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	64-64(%rsi),%xmm6
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm15,%xmm8,%xmm8
+
+	vmovdqu	48(%rdx),%xmm14
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpxor	%xmm4,%xmm1,%xmm1
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	96-64(%rsi),%xmm6
+	vpxor	%xmm5,%xmm2,%xmm2
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	128-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+
+	vmovdqu	32(%rdx),%xmm15
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	112-64(%rsi),%xmm6
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm15,%xmm8,%xmm8
+
+	vmovdqu	16(%rdx),%xmm14
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpxor	%xmm4,%xmm1,%xmm1
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	144-64(%rsi),%xmm6
+	vpxor	%xmm5,%xmm2,%xmm2
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	176-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+
+	vmovdqu	(%rdx),%xmm15
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	160-64(%rsi),%xmm6
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x10,%xmm7,%xmm9,%xmm2
+
+	leaq	128(%rdx),%rdx
+	cmpq	$0x80,%rcx
+	jb	.Ltail_avx
+
+	vpxor	%xmm10,%xmm15,%xmm15
+	subq	$0x80,%rcx
+	jmp	.Loop8x_avx
+
+.align	32
+.Loop8x_avx:
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vmovdqu	112(%rdx),%xmm14
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpxor	%xmm15,%xmm8,%xmm8
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm10
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm11
+	vmovdqu	0-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm12
+	vmovdqu	32-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+
+	vmovdqu	96(%rdx),%xmm15
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm3,%xmm10,%xmm10
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vxorps	%xmm4,%xmm11,%xmm11
+	vmovdqu	16-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm5,%xmm12,%xmm12
+	vxorps	%xmm15,%xmm8,%xmm8
+
+	vmovdqu	80(%rdx),%xmm14
+	vpxor	%xmm10,%xmm12,%xmm12
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpxor	%xmm11,%xmm12,%xmm12
+	vpslldq	$8,%xmm12,%xmm9
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vpsrldq	$8,%xmm12,%xmm12
+	vpxor	%xmm9,%xmm10,%xmm10
+	vmovdqu	48-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vxorps	%xmm12,%xmm11,%xmm11
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	80-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	64(%rdx),%xmm15
+	vpalignr	$8,%xmm10,%xmm10,%xmm12
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	64-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm4,%xmm1,%xmm1
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vxorps	%xmm15,%xmm8,%xmm8
+	vpxor	%xmm5,%xmm2,%xmm2
+
+	vmovdqu	48(%rdx),%xmm14
+	vpclmulqdq	$0x10,(%r10),%xmm10,%xmm10
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	96-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	128-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	32(%rdx),%xmm15
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	112-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm4,%xmm1,%xmm1
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm15,%xmm8,%xmm8
+	vpxor	%xmm5,%xmm2,%xmm2
+	vxorps	%xmm12,%xmm10,%xmm10
+
+	vmovdqu	16(%rdx),%xmm14
+	vpalignr	$8,%xmm10,%xmm10,%xmm12
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	144-64(%rsi),%xmm6
+	vpclmulqdq	$0x10,(%r10),%xmm10,%xmm10
+	vxorps	%xmm11,%xmm12,%xmm12
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	176-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	(%rdx),%xmm15
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	160-64(%rsi),%xmm6
+	vpxor	%xmm12,%xmm15,%xmm15
+	vpclmulqdq	$0x10,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm10,%xmm15,%xmm15
+
+	leaq	128(%rdx),%rdx
+	subq	$0x80,%rcx
+	jnc	.Loop8x_avx
+
+	addq	$0x80,%rcx
+	jmp	.Ltail_no_xor_avx
+
+.align	32
+.Lshort_avx:
+	vmovdqu	-16(%rdx,%rcx,1),%xmm14
+	leaq	(%rdx,%rcx,1),%rdx
+	vmovdqu	0-64(%rsi),%xmm6
+	vmovdqu	32-64(%rsi),%xmm7
+	vpshufb	%xmm13,%xmm14,%xmm15
+
+	vmovdqa	%xmm0,%xmm3
+	vmovdqa	%xmm1,%xmm4
+	vmovdqa	%xmm2,%xmm5
+	subq	$0x10,%rcx
+	jz	.Ltail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-32(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	16-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vpsrldq	$8,%xmm7,%xmm7
+	subq	$0x10,%rcx
+	jz	.Ltail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-48(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	48-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vmovdqu	80-64(%rsi),%xmm7
+	subq	$0x10,%rcx
+	jz	.Ltail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-64(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	64-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vpsrldq	$8,%xmm7,%xmm7
+	subq	$0x10,%rcx
+	jz	.Ltail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-80(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	96-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vmovdqu	128-64(%rsi),%xmm7
+	subq	$0x10,%rcx
+	jz	.Ltail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-96(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	112-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vpsrldq	$8,%xmm7,%xmm7
+	subq	$0x10,%rcx
+	jz	.Ltail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-112(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	144-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vmovq	184-64(%rsi),%xmm7
+	subq	$0x10,%rcx
+	jmp	.Ltail_avx
+
+.align	32
+.Ltail_avx:
+	vpxor	%xmm10,%xmm15,%xmm15
+.Ltail_no_xor_avx:
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+
+	vmovdqu	(%r10),%xmm12
+
+	vpxor	%xmm0,%xmm3,%xmm10
+	vpxor	%xmm1,%xmm4,%xmm11
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vpxor	%xmm10,%xmm5,%xmm5
+	vpxor	%xmm11,%xmm5,%xmm5
+	vpslldq	$8,%xmm5,%xmm9
+	vpsrldq	$8,%xmm5,%xmm5
+	vpxor	%xmm9,%xmm10,%xmm10
+	vpxor	%xmm5,%xmm11,%xmm11
+
+	vpclmulqdq	$0x10,%xmm12,%xmm10,%xmm9
+	vpalignr	$8,%xmm10,%xmm10,%xmm10
+	vpxor	%xmm9,%xmm10,%xmm10
+
+	vpclmulqdq	$0x10,%xmm12,%xmm10,%xmm9
+	vpalignr	$8,%xmm10,%xmm10,%xmm10
+	vpxor	%xmm11,%xmm10,%xmm10
+	vpxor	%xmm9,%xmm10,%xmm10
+
+	cmpq	$0,%rcx
+	jne	.Lshort_avx
+
+	vpshufb	%xmm13,%xmm10,%xmm10
+	vmovdqu	%xmm10,(%rdi)
+	vzeroupper
+	ret
+.size	gcm_ghash_avx,.-gcm_ghash_avx
+.align	64
+.Lbswap_mask:
+.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+.L0x1c2_polynomial:
+.byte	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+.L7_mask:
+.long	7,0,7,0
+.L7_mask_poly:
+.long	7,0,450,0
+.align	64
+.type	.Lrem_4bit,@object
+.Lrem_4bit:
+.long	0,0,0,471859200,0,943718400,0,610271232
+.long	0,1887436800,0,1822425088,0,1220542464,0,1423966208
+.long	0,3774873600,0,4246732800,0,3644850176,0,3311403008
+.long	0,2441084928,0,2376073216,0,2847932416,0,3051356160
+.type	.Lrem_8bit,@object
+.Lrem_8bit:
+.value	0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E
+.value	0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E
+.value	0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E
+.value	0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E
+.value	0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E
+.value	0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E
+.value	0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E
+.value	0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E
+.value	0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE
+.value	0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE
+.value	0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE
+.value	0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE
+.value	0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E
+.value	0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E
+.value	0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE
+.value	0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE
+.value	0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E
+.value	0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E
+.value	0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E
+.value	0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E
+.value	0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E
+.value	0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E
+.value	0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E
+.value	0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E
+.value	0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE
+.value	0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE
+.value	0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE
+.value	0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE
+.value	0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E
+.value	0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E
+.value	0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE
+.value	0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE
+
+.byte	71,72,65,83,72,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,108,46,111,114,103,62,0
+.align	64
diff --git a/crypto/aesgcm/ghash_x64_gas_macosx.s b/crypto/aesgcm/ghash_x64_gas_macosx.s
new file mode 100644
index 0000000..dfd7cc9
--- /dev/null
+++ b/crypto/aesgcm/ghash_x64_gas_macosx.s
@@ -0,0 +1,1795 @@
+.text	
+
+
+.globl	_gcm_gmult_4bit
+
+.p2align	4
+_gcm_gmult_4bit:
+	pushq	%rbx
+	pushq	%rbp
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	subq	$280,%rsp
+L$gmult_prologue:
+
+	movzbq	15(%rdi),%r8
+	leaq	L$rem_4bit(%rip),%r11
+	xorq	%rax,%rax
+	xorq	%rbx,%rbx
+	movb	%r8b,%al
+	movb	%r8b,%bl
+	shlb	$4,%al
+	movq	$14,%rcx
+	movq	8(%rsi,%rax,1),%r8
+	movq	(%rsi,%rax,1),%r9
+	andb	$0xf0,%bl
+	movq	%r8,%rdx
+	jmp	L$oop1
+
+.p2align	4
+L$oop1:
+	shrq	$4,%r8
+	andq	$0xf,%rdx
+	movq	%r9,%r10
+	movb	(%rdi,%rcx,1),%al
+	shrq	$4,%r9
+	xorq	8(%rsi,%rbx,1),%r8
+	shlq	$60,%r10
+	xorq	(%rsi,%rbx,1),%r9
+	movb	%al,%bl
+	xorq	(%r11,%rdx,8),%r9
+	movq	%r8,%rdx
+	shlb	$4,%al
+	xorq	%r10,%r8
+	decq	%rcx
+	js	L$break1
+
+	shrq	$4,%r8
+	andq	$0xf,%rdx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	xorq	8(%rsi,%rax,1),%r8
+	shlq	$60,%r10
+	xorq	(%rsi,%rax,1),%r9
+	andb	$0xf0,%bl
+	xorq	(%r11,%rdx,8),%r9
+	movq	%r8,%rdx
+	xorq	%r10,%r8
+	jmp	L$oop1
+
+.p2align	4
+L$break1:
+	shrq	$4,%r8
+	andq	$0xf,%rdx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	xorq	8(%rsi,%rax,1),%r8
+	shlq	$60,%r10
+	xorq	(%rsi,%rax,1),%r9
+	andb	$0xf0,%bl
+	xorq	(%r11,%rdx,8),%r9
+	movq	%r8,%rdx
+	xorq	%r10,%r8
+
+	shrq	$4,%r8
+	andq	$0xf,%rdx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	xorq	8(%rsi,%rbx,1),%r8
+	shlq	$60,%r10
+	xorq	(%rsi,%rbx,1),%r9
+	xorq	%r10,%r8
+	xorq	(%r11,%rdx,8),%r9
+
+	bswapq	%r8
+	bswapq	%r9
+	movq	%r8,8(%rdi)
+	movq	%r9,(%rdi)
+
+	leaq	280+48(%rsp),%rsi
+	movq	-8(%rsi),%rbx
+	leaq	(%rsi),%rsp
+L$gmult_epilogue:
+	ret
+
+.globl	_gcm_ghash_4bit
+
+.p2align	4
+_gcm_ghash_4bit:
+	pushq	%rbx
+	pushq	%rbp
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	subq	$280,%rsp
+L$ghash_prologue:
+	movq	%rdx,%r14
+	movq	%rcx,%r15
+	subq	$-128,%rsi
+	leaq	16+128(%rsp),%rbp
+	xorl	%edx,%edx
+	movq	0+0-128(%rsi),%r8
+	movq	0+8-128(%rsi),%rax
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	16+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	16+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,0(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,0(%rbp)
+	movq	32+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,0-128(%rbp)
+	movq	32+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,1(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,8(%rbp)
+	movq	48+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,8-128(%rbp)
+	movq	48+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,2(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,16(%rbp)
+	movq	64+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,16-128(%rbp)
+	movq	64+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,3(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,24(%rbp)
+	movq	80+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,24-128(%rbp)
+	movq	80+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,4(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,32(%rbp)
+	movq	96+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,32-128(%rbp)
+	movq	96+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,5(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,40(%rbp)
+	movq	112+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,40-128(%rbp)
+	movq	112+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,6(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,48(%rbp)
+	movq	128+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,48-128(%rbp)
+	movq	128+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,7(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,56(%rbp)
+	movq	144+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,56-128(%rbp)
+	movq	144+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,8(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,64(%rbp)
+	movq	160+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,64-128(%rbp)
+	movq	160+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,9(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,72(%rbp)
+	movq	176+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,72-128(%rbp)
+	movq	176+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,10(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,80(%rbp)
+	movq	192+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,80-128(%rbp)
+	movq	192+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,11(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,88(%rbp)
+	movq	208+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,88-128(%rbp)
+	movq	208+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,12(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,96(%rbp)
+	movq	224+0-128(%rsi),%r8
+	shlb	$4,%dl
+	movq	%rax,96-128(%rbp)
+	movq	224+8-128(%rsi),%rax
+	shlq	$60,%r10
+	movb	%dl,13(%rsp)
+	orq	%r10,%rbx
+	movb	%al,%dl
+	shrq	$4,%rax
+	movq	%r8,%r10
+	shrq	$4,%r8
+	movq	%r9,104(%rbp)
+	movq	240+0-128(%rsi),%r9
+	shlb	$4,%dl
+	movq	%rbx,104-128(%rbp)
+	movq	240+8-128(%rsi),%rbx
+	shlq	$60,%r10
+	movb	%dl,14(%rsp)
+	orq	%r10,%rax
+	movb	%bl,%dl
+	shrq	$4,%rbx
+	movq	%r9,%r10
+	shrq	$4,%r9
+	movq	%r8,112(%rbp)
+	shlb	$4,%dl
+	movq	%rax,112-128(%rbp)
+	shlq	$60,%r10
+	movb	%dl,15(%rsp)
+	orq	%r10,%rbx
+	movq	%r9,120(%rbp)
+	movq	%rbx,120-128(%rbp)
+	addq	$-128,%rsi
+	movq	8(%rdi),%r8
+	movq	0(%rdi),%r9
+	addq	%r14,%r15
+	leaq	L$rem_8bit(%rip),%r11
+	jmp	L$outer_loop
+.p2align	4
+L$outer_loop:
+	xorq	(%r14),%r9
+	movq	8(%r14),%rdx
+	leaq	16(%r14),%r14
+	xorq	%r8,%rdx
+	movq	%r9,(%rdi)
+	movq	%rdx,8(%rdi)
+	shrq	$32,%rdx
+	xorq	%rax,%rax
+	roll	$8,%edx
+	movb	%dl,%al
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	shrl	$4,%ebx
+	roll	$8,%edx
+	movq	8(%rsi,%rax,1),%r8
+	movq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	movl	8(%rdi),%edx
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	movl	4(%rdi),%edx
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	movl	0(%rdi),%edx
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	shrl	$4,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r12,2),%r12
+	movzbl	%dl,%ebx
+	shlb	$4,%al
+	movzbq	(%rsp,%rcx,1),%r13
+	shrl	$4,%ebx
+	shlq	$48,%r12
+	xorq	%r8,%r13
+	movq	%r9,%r10
+	xorq	%r12,%r9
+	shrq	$8,%r8
+	movzbq	%r13b,%r13
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rcx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rcx,8),%r9
+	roll	$8,%edx
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	movb	%dl,%al
+	xorq	%r10,%r8
+	movzwq	(%r11,%r13,2),%r13
+	movzbl	%dl,%ecx
+	shlb	$4,%al
+	movzbq	(%rsp,%rbx,1),%r12
+	andl	$240,%ecx
+	shlq	$48,%r13
+	xorq	%r8,%r12
+	movq	%r9,%r10
+	xorq	%r13,%r9
+	shrq	$8,%r8
+	movzbq	%r12b,%r12
+	movl	-4(%rdi),%edx
+	shrq	$8,%r9
+	xorq	-128(%rbp,%rbx,8),%r8
+	shlq	$56,%r10
+	xorq	(%rbp,%rbx,8),%r9
+	movzwq	(%r11,%r12,2),%r12
+	xorq	8(%rsi,%rax,1),%r8
+	xorq	(%rsi,%rax,1),%r9
+	shlq	$48,%r12
+	xorq	%r10,%r8
+	xorq	%r12,%r9
+	movzbq	%r8b,%r13
+	shrq	$4,%r8
+	movq	%r9,%r10
+	shlb	$4,%r13b
+	shrq	$4,%r9
+	xorq	8(%rsi,%rcx,1),%r8
+	movzwq	(%r11,%r13,2),%r13
+	shlq	$60,%r10
+	xorq	(%rsi,%rcx,1),%r9
+	xorq	%r10,%r8
+	shlq	$48,%r13
+	bswapq	%r8
+	xorq	%r13,%r9
+	bswapq	%r9
+	cmpq	%r15,%r14
+	jb	L$outer_loop
+	movq	%r8,8(%rdi)
+	movq	%r9,(%rdi)
+
+	leaq	280+48(%rsp),%rsi
+	movq	-48(%rsi),%r15
+	movq	-40(%rsi),%r14
+	movq	-32(%rsi),%r13
+	movq	-24(%rsi),%r12
+	movq	-16(%rsi),%rbp
+	movq	-8(%rsi),%rbx
+	leaq	0(%rsi),%rsp
+L$ghash_epilogue:
+	ret
+
+.globl	_gcm_init_clmul
+
+.p2align	4
+_gcm_init_clmul:
+L$_init_clmul:
+	movdqu	(%rsi),%xmm2
+	pshufd	$78,%xmm2,%xmm2
+
+
+	pshufd	$255,%xmm2,%xmm4
+	movdqa	%xmm2,%xmm3
+	psllq	$1,%xmm2
+	pxor	%xmm5,%xmm5
+	psrlq	$63,%xmm3
+	pcmpgtd	%xmm4,%xmm5
+	pslldq	$8,%xmm3
+	por	%xmm3,%xmm2
+
+
+	pand	L$0x1c2_polynomial(%rip),%xmm5
+	pxor	%xmm5,%xmm2
+
+
+	pshufd	$78,%xmm2,%xmm6
+	movdqa	%xmm2,%xmm0
+	pxor	%xmm2,%xmm6
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,222,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	pshufd	$78,%xmm2,%xmm3
+	pshufd	$78,%xmm0,%xmm4
+	pxor	%xmm2,%xmm3
+	movdqu	%xmm2,0(%rdi)
+	pxor	%xmm0,%xmm4
+	movdqu	%xmm0,16(%rdi)
+.byte	102,15,58,15,227,8
+	movdqu	%xmm4,32(%rdi)
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,222,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	movdqa	%xmm0,%xmm5
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,222,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	pshufd	$78,%xmm5,%xmm3
+	pshufd	$78,%xmm0,%xmm4
+	pxor	%xmm5,%xmm3
+	movdqu	%xmm5,48(%rdi)
+	pxor	%xmm0,%xmm4
+	movdqu	%xmm0,64(%rdi)
+.byte	102,15,58,15,227,8
+	movdqu	%xmm4,80(%rdi)
+	ret
+
+.globl	_gcm_gmult_clmul
+
+.p2align	4
+_gcm_gmult_clmul:
+L$_gmult_clmul:
+	movdqu	(%rdi),%xmm0
+	movdqa	L$bswap_mask(%rip),%xmm5
+	movdqu	(%rsi),%xmm2
+	movdqu	32(%rsi),%xmm4
+	pshufb	%xmm5,%xmm0
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,220,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	pshufb	%xmm5,%xmm0
+	movdqu	%xmm0,(%rdi)
+	ret
+
+.globl	_gcm_ghash_clmul
+
+.p2align	5
+_gcm_ghash_clmul:
+L$_ghash_clmul:
+	movdqa	L$bswap_mask(%rip),%xmm10
+
+	movdqu	(%rdi),%xmm0
+	movdqu	(%rsi),%xmm2
+	movdqu	32(%rsi),%xmm7
+	pshufb	%xmm10,%xmm0
+
+	subq	$0x10,%rcx
+	jz	L$odd_tail
+
+	movdqu	16(%rsi),%xmm6
+
+
+	cmpq	$0x30,%rcx
+	jb	L$skip4x
+
+
+
+
+
+	subq	$0x30,%rcx
+	movq	$0xA040608020C0E000,%rax
+	movdqu	48(%rsi),%xmm14
+	movdqu	64(%rsi),%xmm15
+
+
+
+
+	movdqu	48(%rdx),%xmm3
+	movdqu	32(%rdx),%xmm11
+	pshufb	%xmm10,%xmm3
+	pshufb	%xmm10,%xmm11
+	movdqa	%xmm3,%xmm5
+	pshufd	$78,%xmm3,%xmm4
+	pxor	%xmm3,%xmm4
+.byte	102,15,58,68,218,0
+.byte	102,15,58,68,234,17
+.byte	102,15,58,68,231,0
+
+	movdqa	%xmm11,%xmm13
+	pshufd	$78,%xmm11,%xmm12
+	pxor	%xmm11,%xmm12
+.byte	102,68,15,58,68,222,0
+.byte	102,68,15,58,68,238,17
+.byte	102,68,15,58,68,231,16
+	xorps	%xmm11,%xmm3
+	xorps	%xmm13,%xmm5
+	movups	80(%rsi),%xmm7
+	xorps	%xmm12,%xmm4
+
+	movdqu	16(%rdx),%xmm11
+	movdqu	0(%rdx),%xmm8
+	pshufb	%xmm10,%xmm11
+	pshufb	%xmm10,%xmm8
+	movdqa	%xmm11,%xmm13
+	pshufd	$78,%xmm11,%xmm12
+	pxor	%xmm8,%xmm0
+	pxor	%xmm11,%xmm12
+.byte	102,69,15,58,68,222,0
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm8
+	pxor	%xmm0,%xmm8
+.byte	102,69,15,58,68,238,17
+.byte	102,68,15,58,68,231,0
+	xorps	%xmm11,%xmm3
+	xorps	%xmm13,%xmm5
+
+	leaq	64(%rdx),%rdx
+	subq	$0x40,%rcx
+	jc	L$tail4x
+
+	jmp	L$mod4_loop
+.p2align	5
+L$mod4_loop:
+.byte	102,65,15,58,68,199,0
+	xorps	%xmm12,%xmm4
+	movdqu	48(%rdx),%xmm11
+	pshufb	%xmm10,%xmm11
+.byte	102,65,15,58,68,207,17
+	xorps	%xmm3,%xmm0
+	movdqu	32(%rdx),%xmm3
+	movdqa	%xmm11,%xmm13
+.byte	102,68,15,58,68,199,16
+	pshufd	$78,%xmm11,%xmm12
+	xorps	%xmm5,%xmm1
+	pxor	%xmm11,%xmm12
+	pshufb	%xmm10,%xmm3
+	movups	32(%rsi),%xmm7
+	xorps	%xmm4,%xmm8
+.byte	102,68,15,58,68,218,0
+	pshufd	$78,%xmm3,%xmm4
+
+	pxor	%xmm0,%xmm8
+	movdqa	%xmm3,%xmm5
+	pxor	%xmm1,%xmm8
+	pxor	%xmm3,%xmm4
+	movdqa	%xmm8,%xmm9
+.byte	102,68,15,58,68,234,17
+	pslldq	$8,%xmm8
+	psrldq	$8,%xmm9
+	pxor	%xmm8,%xmm0
+	movdqa	L$7_mask(%rip),%xmm8
+	pxor	%xmm9,%xmm1
+.byte	102,76,15,110,200
+
+	pand	%xmm0,%xmm8
+	pshufb	%xmm8,%xmm9
+	pxor	%xmm0,%xmm9
+.byte	102,68,15,58,68,231,0
+	psllq	$57,%xmm9
+	movdqa	%xmm9,%xmm8
+	pslldq	$8,%xmm9
+.byte	102,15,58,68,222,0
+	psrldq	$8,%xmm8
+	pxor	%xmm9,%xmm0
+	pxor	%xmm8,%xmm1
+	movdqu	0(%rdx),%xmm8
+
+	movdqa	%xmm0,%xmm9
+	psrlq	$1,%xmm0
+.byte	102,15,58,68,238,17
+	xorps	%xmm11,%xmm3
+	movdqu	16(%rdx),%xmm11
+	pshufb	%xmm10,%xmm11
+.byte	102,15,58,68,231,16
+	xorps	%xmm13,%xmm5
+	movups	80(%rsi),%xmm7
+	pshufb	%xmm10,%xmm8
+	pxor	%xmm9,%xmm1
+	pxor	%xmm0,%xmm9
+	psrlq	$5,%xmm0
+
+	movdqa	%xmm11,%xmm13
+	pxor	%xmm12,%xmm4
+	pshufd	$78,%xmm11,%xmm12
+	pxor	%xmm9,%xmm0
+	pxor	%xmm8,%xmm1
+	pxor	%xmm11,%xmm12
+.byte	102,69,15,58,68,222,0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	movdqa	%xmm0,%xmm1
+.byte	102,69,15,58,68,238,17
+	xorps	%xmm11,%xmm3
+	pshufd	$78,%xmm0,%xmm8
+	pxor	%xmm0,%xmm8
+
+.byte	102,68,15,58,68,231,0
+	xorps	%xmm13,%xmm5
+
+	leaq	64(%rdx),%rdx
+	subq	$0x40,%rcx
+	jnc	L$mod4_loop
+
+L$tail4x:
+.byte	102,65,15,58,68,199,0
+.byte	102,65,15,58,68,207,17
+.byte	102,68,15,58,68,199,16
+	xorps	%xmm12,%xmm4
+	xorps	%xmm3,%xmm0
+	xorps	%xmm5,%xmm1
+	pxor	%xmm0,%xmm1
+	pxor	%xmm4,%xmm8
+
+	pxor	%xmm1,%xmm8
+	pxor	%xmm0,%xmm1
+
+	movdqa	%xmm8,%xmm9
+	psrldq	$8,%xmm8
+	pslldq	$8,%xmm9
+	pxor	%xmm8,%xmm1
+	pxor	%xmm9,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	addq	$0x40,%rcx
+	jz	L$done
+	movdqu	32(%rsi),%xmm7
+	subq	$0x10,%rcx
+	jz	L$odd_tail
+L$skip4x:
+
+
+
+
+
+	movdqu	(%rdx),%xmm8
+	movdqu	16(%rdx),%xmm3
+	pshufb	%xmm10,%xmm8
+	pshufb	%xmm10,%xmm3
+	pxor	%xmm8,%xmm0
+
+	movdqa	%xmm3,%xmm5
+	pshufd	$78,%xmm3,%xmm4
+	pxor	%xmm3,%xmm4
+.byte	102,15,58,68,218,0
+.byte	102,15,58,68,234,17
+.byte	102,15,58,68,231,0
+
+	leaq	32(%rdx),%rdx
+	nop
+	subq	$0x20,%rcx
+	jbe	L$even_tail
+	nop
+	jmp	L$mod_loop
+
+.p2align	5
+L$mod_loop:
+	movdqa	%xmm0,%xmm1
+	movdqa	%xmm4,%xmm8
+	pshufd	$78,%xmm0,%xmm4
+	pxor	%xmm0,%xmm4
+
+.byte	102,15,58,68,198,0
+.byte	102,15,58,68,206,17
+.byte	102,15,58,68,231,16
+
+	pxor	%xmm3,%xmm0
+	pxor	%xmm5,%xmm1
+	movdqu	(%rdx),%xmm9
+	pxor	%xmm0,%xmm8
+	pshufb	%xmm10,%xmm9
+	movdqu	16(%rdx),%xmm3
+
+	pxor	%xmm1,%xmm8
+	pxor	%xmm9,%xmm1
+	pxor	%xmm8,%xmm4
+	pshufb	%xmm10,%xmm3
+	movdqa	%xmm4,%xmm8
+	psrldq	$8,%xmm8
+	pslldq	$8,%xmm4
+	pxor	%xmm8,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm3,%xmm5
+
+	movdqa	%xmm0,%xmm9
+	movdqa	%xmm0,%xmm8
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm8
+.byte	102,15,58,68,218,0
+	psllq	$1,%xmm0
+	pxor	%xmm8,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm8
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm8
+	pxor	%xmm9,%xmm0
+	pshufd	$78,%xmm5,%xmm4
+	pxor	%xmm8,%xmm1
+	pxor	%xmm5,%xmm4
+
+	movdqa	%xmm0,%xmm9
+	psrlq	$1,%xmm0
+.byte	102,15,58,68,234,17
+	pxor	%xmm9,%xmm1
+	pxor	%xmm0,%xmm9
+	psrlq	$5,%xmm0
+	pxor	%xmm9,%xmm0
+	leaq	32(%rdx),%rdx
+	psrlq	$1,%xmm0
+.byte	102,15,58,68,231,0
+	pxor	%xmm1,%xmm0
+
+	subq	$0x20,%rcx
+	ja	L$mod_loop
+
+L$even_tail:
+	movdqa	%xmm0,%xmm1
+	movdqa	%xmm4,%xmm8
+	pshufd	$78,%xmm0,%xmm4
+	pxor	%xmm0,%xmm4
+
+.byte	102,15,58,68,198,0
+.byte	102,15,58,68,206,17
+.byte	102,15,58,68,231,16
+
+	pxor	%xmm3,%xmm0
+	pxor	%xmm5,%xmm1
+	pxor	%xmm0,%xmm8
+	pxor	%xmm1,%xmm8
+	pxor	%xmm8,%xmm4
+	movdqa	%xmm4,%xmm8
+	psrldq	$8,%xmm8
+	pslldq	$8,%xmm4
+	pxor	%xmm8,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+	testq	%rcx,%rcx
+	jnz	L$done
+
+L$odd_tail:
+	movdqu	(%rdx),%xmm8
+	pshufb	%xmm10,%xmm8
+	pxor	%xmm8,%xmm0
+	movdqa	%xmm0,%xmm1
+	pshufd	$78,%xmm0,%xmm3
+	pxor	%xmm0,%xmm3
+.byte	102,15,58,68,194,0
+.byte	102,15,58,68,202,17
+.byte	102,15,58,68,223,0
+	pxor	%xmm0,%xmm3
+	pxor	%xmm1,%xmm3
+
+	movdqa	%xmm3,%xmm4
+	psrldq	$8,%xmm3
+	pslldq	$8,%xmm4
+	pxor	%xmm3,%xmm1
+	pxor	%xmm4,%xmm0
+
+	movdqa	%xmm0,%xmm4
+	movdqa	%xmm0,%xmm3
+	psllq	$5,%xmm0
+	pxor	%xmm0,%xmm3
+	psllq	$1,%xmm0
+	pxor	%xmm3,%xmm0
+	psllq	$57,%xmm0
+	movdqa	%xmm0,%xmm3
+	pslldq	$8,%xmm0
+	psrldq	$8,%xmm3
+	pxor	%xmm4,%xmm0
+	pxor	%xmm3,%xmm1
+
+
+	movdqa	%xmm0,%xmm4
+	psrlq	$1,%xmm0
+	pxor	%xmm4,%xmm1
+	pxor	%xmm0,%xmm4
+	psrlq	$5,%xmm0
+	pxor	%xmm4,%xmm0
+	psrlq	$1,%xmm0
+	pxor	%xmm1,%xmm0
+L$done:
+	pshufb	%xmm10,%xmm0
+	movdqu	%xmm0,(%rdi)
+	ret
+
+.globl	_gcm_init_avx
+
+.p2align	5
+_gcm_init_avx:
+	vzeroupper
+
+	vmovdqu	(%rsi),%xmm2
+	vpshufd	$78,%xmm2,%xmm2
+
+
+	vpshufd	$255,%xmm2,%xmm4
+	vpsrlq	$63,%xmm2,%xmm3
+	vpsllq	$1,%xmm2,%xmm2
+	vpxor	%xmm5,%xmm5,%xmm5
+	vpcmpgtd	%xmm4,%xmm5,%xmm5
+	vpslldq	$8,%xmm3,%xmm3
+	vpor	%xmm3,%xmm2,%xmm2
+
+
+	vpand	L$0x1c2_polynomial(%rip),%xmm5,%xmm5
+	vpxor	%xmm5,%xmm2,%xmm2
+
+	vpunpckhqdq	%xmm2,%xmm2,%xmm6
+	vmovdqa	%xmm2,%xmm0
+	vpxor	%xmm2,%xmm6,%xmm6
+	movq	$4,%r10
+	jmp	L$init_start_avx
+.p2align	5
+L$init_loop_avx:
+	vpalignr	$8,%xmm3,%xmm4,%xmm5
+	vmovdqu	%xmm5,-16(%rdi)
+	vpunpckhqdq	%xmm0,%xmm0,%xmm3
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm2,%xmm0,%xmm1
+	vpclmulqdq	$0x00,%xmm2,%xmm0,%xmm0
+	vpclmulqdq	$0x00,%xmm6,%xmm3,%xmm3
+	vpxor	%xmm0,%xmm1,%xmm4
+	vpxor	%xmm4,%xmm3,%xmm3
+
+	vpslldq	$8,%xmm3,%xmm4
+	vpsrldq	$8,%xmm3,%xmm3
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpxor	%xmm3,%xmm1,%xmm1
+	vpsllq	$57,%xmm0,%xmm3
+	vpsllq	$62,%xmm0,%xmm4
+	vpxor	%xmm3,%xmm4,%xmm4
+	vpsllq	$63,%xmm0,%xmm3
+	vpxor	%xmm3,%xmm4,%xmm4
+	vpslldq	$8,%xmm4,%xmm3
+	vpsrldq	$8,%xmm4,%xmm4
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpxor	%xmm4,%xmm1,%xmm1
+
+	vpsrlq	$1,%xmm0,%xmm4
+	vpxor	%xmm0,%xmm1,%xmm1
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpsrlq	$5,%xmm4,%xmm4
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpsrlq	$1,%xmm0,%xmm0
+	vpxor	%xmm1,%xmm0,%xmm0
+L$init_start_avx:
+	vmovdqa	%xmm0,%xmm5
+	vpunpckhqdq	%xmm0,%xmm0,%xmm3
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm2,%xmm0,%xmm1
+	vpclmulqdq	$0x00,%xmm2,%xmm0,%xmm0
+	vpclmulqdq	$0x00,%xmm6,%xmm3,%xmm3
+	vpxor	%xmm0,%xmm1,%xmm4
+	vpxor	%xmm4,%xmm3,%xmm3
+
+	vpslldq	$8,%xmm3,%xmm4
+	vpsrldq	$8,%xmm3,%xmm3
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpxor	%xmm3,%xmm1,%xmm1
+	vpsllq	$57,%xmm0,%xmm3
+	vpsllq	$62,%xmm0,%xmm4
+	vpxor	%xmm3,%xmm4,%xmm4
+	vpsllq	$63,%xmm0,%xmm3
+	vpxor	%xmm3,%xmm4,%xmm4
+	vpslldq	$8,%xmm4,%xmm3
+	vpsrldq	$8,%xmm4,%xmm4
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpxor	%xmm4,%xmm1,%xmm1
+
+	vpsrlq	$1,%xmm0,%xmm4
+	vpxor	%xmm0,%xmm1,%xmm1
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpsrlq	$5,%xmm4,%xmm4
+	vpxor	%xmm4,%xmm0,%xmm0
+	vpsrlq	$1,%xmm0,%xmm0
+	vpxor	%xmm1,%xmm0,%xmm0
+	vpshufd	$78,%xmm5,%xmm3
+	vpshufd	$78,%xmm0,%xmm4
+	vpxor	%xmm5,%xmm3,%xmm3
+	vmovdqu	%xmm5,0(%rdi)
+	vpxor	%xmm0,%xmm4,%xmm4
+	vmovdqu	%xmm0,16(%rdi)
+	leaq	48(%rdi),%rdi
+	subq	$1,%r10
+	jnz	L$init_loop_avx
+
+	vpalignr	$8,%xmm4,%xmm3,%xmm5
+	vmovdqu	%xmm5,-16(%rdi)
+
+	vzeroupper
+	ret
+
+.globl	_gcm_gmult_avx
+
+.p2align	5
+_gcm_gmult_avx:
+	jmp	L$_gmult_clmul
+
+.globl	_gcm_ghash_avx
+
+.p2align	5
+_gcm_ghash_avx:
+	vzeroupper
+
+	vmovdqu	(%rdi),%xmm10
+	leaq	L$0x1c2_polynomial(%rip),%r10
+	leaq	64(%rsi),%rsi
+	vmovdqu	L$bswap_mask(%rip),%xmm13
+	vpshufb	%xmm13,%xmm10,%xmm10
+	cmpq	$0x80,%rcx
+	jb	L$short_avx
+	subq	$0x80,%rcx
+
+	vmovdqu	112(%rdx),%xmm14
+	vmovdqu	0-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vmovdqu	32-64(%rsi),%xmm7
+
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vmovdqu	96(%rdx),%xmm15
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm14,%xmm9,%xmm9
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	16-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vmovdqu	80(%rdx),%xmm14
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm15,%xmm8,%xmm8
+
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	48-64(%rsi),%xmm6
+	vpxor	%xmm14,%xmm9,%xmm9
+	vmovdqu	64(%rdx),%xmm15
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	80-64(%rsi),%xmm7
+
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	64-64(%rsi),%xmm6
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm15,%xmm8,%xmm8
+
+	vmovdqu	48(%rdx),%xmm14
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpxor	%xmm4,%xmm1,%xmm1
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	96-64(%rsi),%xmm6
+	vpxor	%xmm5,%xmm2,%xmm2
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	128-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+
+	vmovdqu	32(%rdx),%xmm15
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	112-64(%rsi),%xmm6
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm15,%xmm8,%xmm8
+
+	vmovdqu	16(%rdx),%xmm14
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpxor	%xmm4,%xmm1,%xmm1
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	144-64(%rsi),%xmm6
+	vpxor	%xmm5,%xmm2,%xmm2
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	176-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+
+	vmovdqu	(%rdx),%xmm15
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	160-64(%rsi),%xmm6
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x10,%xmm7,%xmm9,%xmm2
+
+	leaq	128(%rdx),%rdx
+	cmpq	$0x80,%rcx
+	jb	L$tail_avx
+
+	vpxor	%xmm10,%xmm15,%xmm15
+	subq	$0x80,%rcx
+	jmp	L$oop8x_avx
+
+.p2align	5
+L$oop8x_avx:
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vmovdqu	112(%rdx),%xmm14
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpxor	%xmm15,%xmm8,%xmm8
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm10
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm11
+	vmovdqu	0-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm12
+	vmovdqu	32-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+
+	vmovdqu	96(%rdx),%xmm15
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpxor	%xmm3,%xmm10,%xmm10
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vxorps	%xmm4,%xmm11,%xmm11
+	vmovdqu	16-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm5,%xmm12,%xmm12
+	vxorps	%xmm15,%xmm8,%xmm8
+
+	vmovdqu	80(%rdx),%xmm14
+	vpxor	%xmm10,%xmm12,%xmm12
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpxor	%xmm11,%xmm12,%xmm12
+	vpslldq	$8,%xmm12,%xmm9
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vpsrldq	$8,%xmm12,%xmm12
+	vpxor	%xmm9,%xmm10,%xmm10
+	vmovdqu	48-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vxorps	%xmm12,%xmm11,%xmm11
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	80-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	64(%rdx),%xmm15
+	vpalignr	$8,%xmm10,%xmm10,%xmm12
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	64-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm4,%xmm1,%xmm1
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vxorps	%xmm15,%xmm8,%xmm8
+	vpxor	%xmm5,%xmm2,%xmm2
+
+	vmovdqu	48(%rdx),%xmm14
+	vpclmulqdq	$0x10,(%r10),%xmm10,%xmm10
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	96-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	128-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	32(%rdx),%xmm15
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpxor	%xmm3,%xmm0,%xmm0
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	112-64(%rsi),%xmm6
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm4,%xmm1,%xmm1
+	vpclmulqdq	$0x00,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm15,%xmm8,%xmm8
+	vpxor	%xmm5,%xmm2,%xmm2
+	vxorps	%xmm12,%xmm10,%xmm10
+
+	vmovdqu	16(%rdx),%xmm14
+	vpalignr	$8,%xmm10,%xmm10,%xmm12
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm3
+	vpshufb	%xmm13,%xmm14,%xmm14
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm4
+	vmovdqu	144-64(%rsi),%xmm6
+	vpclmulqdq	$0x10,(%r10),%xmm10,%xmm10
+	vxorps	%xmm11,%xmm12,%xmm12
+	vpunpckhqdq	%xmm14,%xmm14,%xmm9
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x10,%xmm7,%xmm8,%xmm5
+	vmovdqu	176-64(%rsi),%xmm7
+	vpxor	%xmm14,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	(%rdx),%xmm15
+	vpclmulqdq	$0x00,%xmm6,%xmm14,%xmm0
+	vpshufb	%xmm13,%xmm15,%xmm15
+	vpclmulqdq	$0x11,%xmm6,%xmm14,%xmm1
+	vmovdqu	160-64(%rsi),%xmm6
+	vpxor	%xmm12,%xmm15,%xmm15
+	vpclmulqdq	$0x10,%xmm7,%xmm9,%xmm2
+	vpxor	%xmm10,%xmm15,%xmm15
+
+	leaq	128(%rdx),%rdx
+	subq	$0x80,%rcx
+	jnc	L$oop8x_avx
+
+	addq	$0x80,%rcx
+	jmp	L$tail_no_xor_avx
+
+.p2align	5
+L$short_avx:
+	vmovdqu	-16(%rdx,%rcx,1),%xmm14
+	leaq	(%rdx,%rcx,1),%rdx
+	vmovdqu	0-64(%rsi),%xmm6
+	vmovdqu	32-64(%rsi),%xmm7
+	vpshufb	%xmm13,%xmm14,%xmm15
+
+	vmovdqa	%xmm0,%xmm3
+	vmovdqa	%xmm1,%xmm4
+	vmovdqa	%xmm2,%xmm5
+	subq	$0x10,%rcx
+	jz	L$tail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-32(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	16-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vpsrldq	$8,%xmm7,%xmm7
+	subq	$0x10,%rcx
+	jz	L$tail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-48(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	48-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vmovdqu	80-64(%rsi),%xmm7
+	subq	$0x10,%rcx
+	jz	L$tail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-64(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	64-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vpsrldq	$8,%xmm7,%xmm7
+	subq	$0x10,%rcx
+	jz	L$tail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-80(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	96-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vmovdqu	128-64(%rsi),%xmm7
+	subq	$0x10,%rcx
+	jz	L$tail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-96(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	112-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vpsrldq	$8,%xmm7,%xmm7
+	subq	$0x10,%rcx
+	jz	L$tail_avx
+
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vmovdqu	-112(%rdx),%xmm14
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vmovdqu	144-64(%rsi),%xmm6
+	vpshufb	%xmm13,%xmm14,%xmm15
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+	vmovq	184-64(%rsi),%xmm7
+	subq	$0x10,%rcx
+	jmp	L$tail_avx
+
+.p2align	5
+L$tail_avx:
+	vpxor	%xmm10,%xmm15,%xmm15
+L$tail_no_xor_avx:
+	vpunpckhqdq	%xmm15,%xmm15,%xmm8
+	vpxor	%xmm0,%xmm3,%xmm3
+	vpclmulqdq	$0x00,%xmm6,%xmm15,%xmm0
+	vpxor	%xmm15,%xmm8,%xmm8
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpclmulqdq	$0x11,%xmm6,%xmm15,%xmm1
+	vpxor	%xmm2,%xmm5,%xmm5
+	vpclmulqdq	$0x00,%xmm7,%xmm8,%xmm2
+
+	vmovdqu	(%r10),%xmm12
+
+	vpxor	%xmm0,%xmm3,%xmm10
+	vpxor	%xmm1,%xmm4,%xmm11
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vpxor	%xmm10,%xmm5,%xmm5
+	vpxor	%xmm11,%xmm5,%xmm5
+	vpslldq	$8,%xmm5,%xmm9
+	vpsrldq	$8,%xmm5,%xmm5
+	vpxor	%xmm9,%xmm10,%xmm10
+	vpxor	%xmm5,%xmm11,%xmm11
+
+	vpclmulqdq	$0x10,%xmm12,%xmm10,%xmm9
+	vpalignr	$8,%xmm10,%xmm10,%xmm10
+	vpxor	%xmm9,%xmm10,%xmm10
+
+	vpclmulqdq	$0x10,%xmm12,%xmm10,%xmm9
+	vpalignr	$8,%xmm10,%xmm10,%xmm10
+	vpxor	%xmm11,%xmm10,%xmm10
+	vpxor	%xmm9,%xmm10,%xmm10
+
+	cmpq	$0,%rcx
+	jne	L$short_avx
+
+	vpshufb	%xmm13,%xmm10,%xmm10
+	vmovdqu	%xmm10,(%rdi)
+	vzeroupper
+	ret
+
+.p2align	6
+L$bswap_mask:
+.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+L$0x1c2_polynomial:
+.byte	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+L$7_mask:
+.long	7,0,7,0
+L$7_mask_poly:
+.long	7,0,450,0
+.p2align	6
+
+L$rem_4bit:
+.long	0,0,0,471859200,0,943718400,0,610271232
+.long	0,1887436800,0,1822425088,0,1220542464,0,1423966208
+.long	0,3774873600,0,4246732800,0,3644850176,0,3311403008
+.long	0,2441084928,0,2376073216,0,2847932416,0,3051356160
+
+L$rem_8bit:
+.value	0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E
+.value	0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E
+.value	0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E
+.value	0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E
+.value	0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E
+.value	0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E
+.value	0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E
+.value	0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E
+.value	0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE
+.value	0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE
+.value	0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE
+.value	0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE
+.value	0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E
+.value	0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E
+.value	0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE
+.value	0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE
+.value	0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E
+.value	0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E
+.value	0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E
+.value	0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E
+.value	0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E
+.value	0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E
+.value	0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E
+.value	0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E
+.value	0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE
+.value	0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE
+.value	0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE
+.value	0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE
+.value	0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E
+.value	0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E
+.value	0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE
+.value	0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE
+
+.byte	71,72,65,83,72,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,108,46,111,114,103,62,0
+.p2align	6
diff --git a/crypto/aesgcm/ghash_x64_nasm.asm b/crypto/aesgcm/ghash_x64_nasm.asm
new file mode 100644
index 0000000..22a8020
--- /dev/null
+++ b/crypto/aesgcm/ghash_x64_nasm.asm
@@ -0,0 +1,2029 @@
+default	rel
+%define XMMWORD
+%define YMMWORD
+%define ZMMWORD
+section	.text code align=64
+
+;.extern	OPENSSL_ia32cap_P
+
+global	gcm_gmult_4bit
+
+ALIGN	16
+gcm_gmult_4bit:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_gcm_gmult_4bit:
+	mov	rdi,rcx
+	mov	rsi,rdx
+
+
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	sub	rsp,280
+$L$gmult_prologue:
+
+	movzx	r8,BYTE[15+rdi]
+	lea	r11,[$L$rem_4bit]
+	xor	rax,rax
+	xor	rbx,rbx
+	mov	al,r8b
+	mov	bl,r8b
+	shl	al,4
+	mov	rcx,14
+	mov	r8,QWORD[8+rax*1+rsi]
+	mov	r9,QWORD[rax*1+rsi]
+	and	bl,0xf0
+	mov	rdx,r8
+	jmp	NEAR $L$oop1
+
+ALIGN	16
+$L$oop1:
+	shr	r8,4
+	and	rdx,0xf
+	mov	r10,r9
+	mov	al,BYTE[rcx*1+rdi]
+	shr	r9,4
+	xor	r8,QWORD[8+rbx*1+rsi]
+	shl	r10,60
+	xor	r9,QWORD[rbx*1+rsi]
+	mov	bl,al
+	xor	r9,QWORD[rdx*8+r11]
+	mov	rdx,r8
+	shl	al,4
+	xor	r8,r10
+	dec	rcx
+	js	NEAR $L$break1
+
+	shr	r8,4
+	and	rdx,0xf
+	mov	r10,r9
+	shr	r9,4
+	xor	r8,QWORD[8+rax*1+rsi]
+	shl	r10,60
+	xor	r9,QWORD[rax*1+rsi]
+	and	bl,0xf0
+	xor	r9,QWORD[rdx*8+r11]
+	mov	rdx,r8
+	xor	r8,r10
+	jmp	NEAR $L$oop1
+
+ALIGN	16
+$L$break1:
+	shr	r8,4
+	and	rdx,0xf
+	mov	r10,r9
+	shr	r9,4
+	xor	r8,QWORD[8+rax*1+rsi]
+	shl	r10,60
+	xor	r9,QWORD[rax*1+rsi]
+	and	bl,0xf0
+	xor	r9,QWORD[rdx*8+r11]
+	mov	rdx,r8
+	xor	r8,r10
+
+	shr	r8,4
+	and	rdx,0xf
+	mov	r10,r9
+	shr	r9,4
+	xor	r8,QWORD[8+rbx*1+rsi]
+	shl	r10,60
+	xor	r9,QWORD[rbx*1+rsi]
+	xor	r8,r10
+	xor	r9,QWORD[rdx*8+r11]
+
+	bswap	r8
+	bswap	r9
+	mov	QWORD[8+rdi],r8
+	mov	QWORD[rdi],r9
+
+	lea	rsi,[((280+48))+rsp]
+	mov	rbx,QWORD[((-8))+rsi]
+	lea	rsp,[rsi]
+$L$gmult_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	ret
+$L$SEH_end_gcm_gmult_4bit:
+global	gcm_ghash_4bit
+
+ALIGN	16
+gcm_ghash_4bit:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_gcm_ghash_4bit:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+
+
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	sub	rsp,280
+$L$ghash_prologue:
+	mov	r14,rdx
+	mov	r15,rcx
+	sub	rsi,-128
+	lea	rbp,[((16+128))+rsp]
+	xor	edx,edx
+	mov	r8,QWORD[((0+0-128))+rsi]
+	mov	rax,QWORD[((0+8-128))+rsi]
+	mov	dl,al
+	shr	rax,4
+	mov	r10,r8
+	shr	r8,4
+	mov	r9,QWORD[((16+0-128))+rsi]
+	shl	dl,4
+	mov	rbx,QWORD[((16+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[rsp],dl
+	or	rax,r10
+	mov	dl,bl
+	shr	rbx,4
+	mov	r10,r9
+	shr	r9,4
+	mov	QWORD[rbp],r8
+	mov	r8,QWORD[((32+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((0-128))+rbp],rax
+	mov	rax,QWORD[((32+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[1+rsp],dl
+	or	rbx,r10
+	mov	dl,al
+	shr	rax,4
+	mov	r10,r8
+	shr	r8,4
+	mov	QWORD[8+rbp],r9
+	mov	r9,QWORD[((48+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((8-128))+rbp],rbx
+	mov	rbx,QWORD[((48+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[2+rsp],dl
+	or	rax,r10
+	mov	dl,bl
+	shr	rbx,4
+	mov	r10,r9
+	shr	r9,4
+	mov	QWORD[16+rbp],r8
+	mov	r8,QWORD[((64+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((16-128))+rbp],rax
+	mov	rax,QWORD[((64+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[3+rsp],dl
+	or	rbx,r10
+	mov	dl,al
+	shr	rax,4
+	mov	r10,r8
+	shr	r8,4
+	mov	QWORD[24+rbp],r9
+	mov	r9,QWORD[((80+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((24-128))+rbp],rbx
+	mov	rbx,QWORD[((80+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[4+rsp],dl
+	or	rax,r10
+	mov	dl,bl
+	shr	rbx,4
+	mov	r10,r9
+	shr	r9,4
+	mov	QWORD[32+rbp],r8
+	mov	r8,QWORD[((96+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((32-128))+rbp],rax
+	mov	rax,QWORD[((96+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[5+rsp],dl
+	or	rbx,r10
+	mov	dl,al
+	shr	rax,4
+	mov	r10,r8
+	shr	r8,4
+	mov	QWORD[40+rbp],r9
+	mov	r9,QWORD[((112+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((40-128))+rbp],rbx
+	mov	rbx,QWORD[((112+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[6+rsp],dl
+	or	rax,r10
+	mov	dl,bl
+	shr	rbx,4
+	mov	r10,r9
+	shr	r9,4
+	mov	QWORD[48+rbp],r8
+	mov	r8,QWORD[((128+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((48-128))+rbp],rax
+	mov	rax,QWORD[((128+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[7+rsp],dl
+	or	rbx,r10
+	mov	dl,al
+	shr	rax,4
+	mov	r10,r8
+	shr	r8,4
+	mov	QWORD[56+rbp],r9
+	mov	r9,QWORD[((144+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((56-128))+rbp],rbx
+	mov	rbx,QWORD[((144+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[8+rsp],dl
+	or	rax,r10
+	mov	dl,bl
+	shr	rbx,4
+	mov	r10,r9
+	shr	r9,4
+	mov	QWORD[64+rbp],r8
+	mov	r8,QWORD[((160+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((64-128))+rbp],rax
+	mov	rax,QWORD[((160+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[9+rsp],dl
+	or	rbx,r10
+	mov	dl,al
+	shr	rax,4
+	mov	r10,r8
+	shr	r8,4
+	mov	QWORD[72+rbp],r9
+	mov	r9,QWORD[((176+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((72-128))+rbp],rbx
+	mov	rbx,QWORD[((176+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[10+rsp],dl
+	or	rax,r10
+	mov	dl,bl
+	shr	rbx,4
+	mov	r10,r9
+	shr	r9,4
+	mov	QWORD[80+rbp],r8
+	mov	r8,QWORD[((192+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((80-128))+rbp],rax
+	mov	rax,QWORD[((192+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[11+rsp],dl
+	or	rbx,r10
+	mov	dl,al
+	shr	rax,4
+	mov	r10,r8
+	shr	r8,4
+	mov	QWORD[88+rbp],r9
+	mov	r9,QWORD[((208+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((88-128))+rbp],rbx
+	mov	rbx,QWORD[((208+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[12+rsp],dl
+	or	rax,r10
+	mov	dl,bl
+	shr	rbx,4
+	mov	r10,r9
+	shr	r9,4
+	mov	QWORD[96+rbp],r8
+	mov	r8,QWORD[((224+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((96-128))+rbp],rax
+	mov	rax,QWORD[((224+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[13+rsp],dl
+	or	rbx,r10
+	mov	dl,al
+	shr	rax,4
+	mov	r10,r8
+	shr	r8,4
+	mov	QWORD[104+rbp],r9
+	mov	r9,QWORD[((240+0-128))+rsi]
+	shl	dl,4
+	mov	QWORD[((104-128))+rbp],rbx
+	mov	rbx,QWORD[((240+8-128))+rsi]
+	shl	r10,60
+	mov	BYTE[14+rsp],dl
+	or	rax,r10
+	mov	dl,bl
+	shr	rbx,4
+	mov	r10,r9
+	shr	r9,4
+	mov	QWORD[112+rbp],r8
+	shl	dl,4
+	mov	QWORD[((112-128))+rbp],rax
+	shl	r10,60
+	mov	BYTE[15+rsp],dl
+	or	rbx,r10
+	mov	QWORD[120+rbp],r9
+	mov	QWORD[((120-128))+rbp],rbx
+	add	rsi,-128
+	mov	r8,QWORD[8+rdi]
+	mov	r9,QWORD[rdi]
+	add	r15,r14
+	lea	r11,[$L$rem_8bit]
+	jmp	NEAR $L$outer_loop
+ALIGN	16
+$L$outer_loop:
+	xor	r9,QWORD[r14]
+	mov	rdx,QWORD[8+r14]
+	lea	r14,[16+r14]
+	xor	rdx,r8
+	mov	QWORD[rdi],r9
+	mov	QWORD[8+rdi],rdx
+	shr	rdx,32
+	xor	rax,rax
+	rol	edx,8
+	mov	al,dl
+	movzx	ebx,dl
+	shl	al,4
+	shr	ebx,4
+	rol	edx,8
+	mov	r8,QWORD[8+rax*1+rsi]
+	mov	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	movzx	ecx,dl
+	shl	al,4
+	movzx	r12,BYTE[rbx*1+rsp]
+	shr	ecx,4
+	xor	r12,r8
+	mov	r10,r9
+	shr	r8,8
+	movzx	r12,r12b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rbx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rbx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r12,WORD[r12*2+r11]
+	movzx	ebx,dl
+	shl	al,4
+	movzx	r13,BYTE[rcx*1+rsp]
+	shr	ebx,4
+	shl	r12,48
+	xor	r13,r8
+	mov	r10,r9
+	xor	r9,r12
+	shr	r8,8
+	movzx	r13,r13b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rcx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rcx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r13,WORD[r13*2+r11]
+	movzx	ecx,dl
+	shl	al,4
+	movzx	r12,BYTE[rbx*1+rsp]
+	shr	ecx,4
+	shl	r13,48
+	xor	r12,r8
+	mov	r10,r9
+	xor	r9,r13
+	shr	r8,8
+	movzx	r12,r12b
+	mov	edx,DWORD[8+rdi]
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rbx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rbx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r12,WORD[r12*2+r11]
+	movzx	ebx,dl
+	shl	al,4
+	movzx	r13,BYTE[rcx*1+rsp]
+	shr	ebx,4
+	shl	r12,48
+	xor	r13,r8
+	mov	r10,r9
+	xor	r9,r12
+	shr	r8,8
+	movzx	r13,r13b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rcx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rcx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r13,WORD[r13*2+r11]
+	movzx	ecx,dl
+	shl	al,4
+	movzx	r12,BYTE[rbx*1+rsp]
+	shr	ecx,4
+	shl	r13,48
+	xor	r12,r8
+	mov	r10,r9
+	xor	r9,r13
+	shr	r8,8
+	movzx	r12,r12b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rbx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rbx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r12,WORD[r12*2+r11]
+	movzx	ebx,dl
+	shl	al,4
+	movzx	r13,BYTE[rcx*1+rsp]
+	shr	ebx,4
+	shl	r12,48
+	xor	r13,r8
+	mov	r10,r9
+	xor	r9,r12
+	shr	r8,8
+	movzx	r13,r13b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rcx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rcx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r13,WORD[r13*2+r11]
+	movzx	ecx,dl
+	shl	al,4
+	movzx	r12,BYTE[rbx*1+rsp]
+	shr	ecx,4
+	shl	r13,48
+	xor	r12,r8
+	mov	r10,r9
+	xor	r9,r13
+	shr	r8,8
+	movzx	r12,r12b
+	mov	edx,DWORD[4+rdi]
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rbx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rbx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r12,WORD[r12*2+r11]
+	movzx	ebx,dl
+	shl	al,4
+	movzx	r13,BYTE[rcx*1+rsp]
+	shr	ebx,4
+	shl	r12,48
+	xor	r13,r8
+	mov	r10,r9
+	xor	r9,r12
+	shr	r8,8
+	movzx	r13,r13b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rcx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rcx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r13,WORD[r13*2+r11]
+	movzx	ecx,dl
+	shl	al,4
+	movzx	r12,BYTE[rbx*1+rsp]
+	shr	ecx,4
+	shl	r13,48
+	xor	r12,r8
+	mov	r10,r9
+	xor	r9,r13
+	shr	r8,8
+	movzx	r12,r12b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rbx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rbx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r12,WORD[r12*2+r11]
+	movzx	ebx,dl
+	shl	al,4
+	movzx	r13,BYTE[rcx*1+rsp]
+	shr	ebx,4
+	shl	r12,48
+	xor	r13,r8
+	mov	r10,r9
+	xor	r9,r12
+	shr	r8,8
+	movzx	r13,r13b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rcx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rcx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r13,WORD[r13*2+r11]
+	movzx	ecx,dl
+	shl	al,4
+	movzx	r12,BYTE[rbx*1+rsp]
+	shr	ecx,4
+	shl	r13,48
+	xor	r12,r8
+	mov	r10,r9
+	xor	r9,r13
+	shr	r8,8
+	movzx	r12,r12b
+	mov	edx,DWORD[rdi]
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rbx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rbx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r12,WORD[r12*2+r11]
+	movzx	ebx,dl
+	shl	al,4
+	movzx	r13,BYTE[rcx*1+rsp]
+	shr	ebx,4
+	shl	r12,48
+	xor	r13,r8
+	mov	r10,r9
+	xor	r9,r12
+	shr	r8,8
+	movzx	r13,r13b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rcx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rcx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r13,WORD[r13*2+r11]
+	movzx	ecx,dl
+	shl	al,4
+	movzx	r12,BYTE[rbx*1+rsp]
+	shr	ecx,4
+	shl	r13,48
+	xor	r12,r8
+	mov	r10,r9
+	xor	r9,r13
+	shr	r8,8
+	movzx	r12,r12b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rbx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rbx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r12,WORD[r12*2+r11]
+	movzx	ebx,dl
+	shl	al,4
+	movzx	r13,BYTE[rcx*1+rsp]
+	shr	ebx,4
+	shl	r12,48
+	xor	r13,r8
+	mov	r10,r9
+	xor	r9,r12
+	shr	r8,8
+	movzx	r13,r13b
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rcx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rcx*8+rbp]
+	rol	edx,8
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	mov	al,dl
+	xor	r8,r10
+	movzx	r13,WORD[r13*2+r11]
+	movzx	ecx,dl
+	shl	al,4
+	movzx	r12,BYTE[rbx*1+rsp]
+	and	ecx,240
+	shl	r13,48
+	xor	r12,r8
+	mov	r10,r9
+	xor	r9,r13
+	shr	r8,8
+	movzx	r12,r12b
+	mov	edx,DWORD[((-4))+rdi]
+	shr	r9,8
+	xor	r8,QWORD[((-128))+rbx*8+rbp]
+	shl	r10,56
+	xor	r9,QWORD[rbx*8+rbp]
+	movzx	r12,WORD[r12*2+r11]
+	xor	r8,QWORD[8+rax*1+rsi]
+	xor	r9,QWORD[rax*1+rsi]
+	shl	r12,48
+	xor	r8,r10
+	xor	r9,r12
+	movzx	r13,r8b
+	shr	r8,4
+	mov	r10,r9
+	shl	r13b,4
+	shr	r9,4
+	xor	r8,QWORD[8+rcx*1+rsi]
+	movzx	r13,WORD[r13*2+r11]
+	shl	r10,60
+	xor	r9,QWORD[rcx*1+rsi]
+	xor	r8,r10
+	shl	r13,48
+	bswap	r8
+	xor	r9,r13
+	bswap	r9
+	cmp	r14,r15
+	jb	NEAR $L$outer_loop
+	mov	QWORD[8+rdi],r8
+	mov	QWORD[rdi],r9
+
+	lea	rsi,[((280+48))+rsp]
+	mov	r15,QWORD[((-48))+rsi]
+	mov	r14,QWORD[((-40))+rsi]
+	mov	r13,QWORD[((-32))+rsi]
+	mov	r12,QWORD[((-24))+rsi]
+	mov	rbp,QWORD[((-16))+rsi]
+	mov	rbx,QWORD[((-8))+rsi]
+	lea	rsp,[rsi]
+$L$ghash_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	ret
+$L$SEH_end_gcm_ghash_4bit:
+global	gcm_init_clmul
+
+ALIGN	16
+gcm_init_clmul:
+$L$_init_clmul:
+$L$SEH_begin_gcm_init_clmul:
+
+DB	0x48,0x83,0xec,0x18
+DB	0x0f,0x29,0x34,0x24
+	movdqu	xmm2,XMMWORD[rdx]
+	pshufd	xmm2,xmm2,78
+
+
+	pshufd	xmm4,xmm2,255
+	movdqa	xmm3,xmm2
+	psllq	xmm2,1
+	pxor	xmm5,xmm5
+	psrlq	xmm3,63
+	pcmpgtd	xmm5,xmm4
+	pslldq	xmm3,8
+	por	xmm2,xmm3
+
+
+	pand	xmm5,XMMWORD[$L$0x1c2_polynomial]
+	pxor	xmm2,xmm5
+
+
+	pshufd	xmm6,xmm2,78
+	movdqa	xmm0,xmm2
+	pxor	xmm6,xmm2
+	movdqa	xmm1,xmm0
+	pshufd	xmm3,xmm0,78
+	pxor	xmm3,xmm0
+DB	102,15,58,68,194,0
+DB	102,15,58,68,202,17
+DB	102,15,58,68,222,0
+	pxor	xmm3,xmm0
+	pxor	xmm3,xmm1
+
+	movdqa	xmm4,xmm3
+	psrldq	xmm3,8
+	pslldq	xmm4,8
+	pxor	xmm1,xmm3
+	pxor	xmm0,xmm4
+
+	movdqa	xmm4,xmm0
+	movdqa	xmm3,xmm0
+	psllq	xmm0,5
+	pxor	xmm3,xmm0
+	psllq	xmm0,1
+	pxor	xmm0,xmm3
+	psllq	xmm0,57
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,8
+	psrldq	xmm3,8
+	pxor	xmm0,xmm4
+	pxor	xmm1,xmm3
+
+
+	movdqa	xmm4,xmm0
+	psrlq	xmm0,1
+	pxor	xmm1,xmm4
+	pxor	xmm4,xmm0
+	psrlq	xmm0,5
+	pxor	xmm0,xmm4
+	psrlq	xmm0,1
+	pxor	xmm0,xmm1
+	pshufd	xmm3,xmm2,78
+	pshufd	xmm4,xmm0,78
+	pxor	xmm3,xmm2
+	movdqu	XMMWORD[rcx],xmm2
+	pxor	xmm4,xmm0
+	movdqu	XMMWORD[16+rcx],xmm0
+DB	102,15,58,15,227,8
+	movdqu	XMMWORD[32+rcx],xmm4
+	movdqa	xmm1,xmm0
+	pshufd	xmm3,xmm0,78
+	pxor	xmm3,xmm0
+DB	102,15,58,68,194,0
+DB	102,15,58,68,202,17
+DB	102,15,58,68,222,0
+	pxor	xmm3,xmm0
+	pxor	xmm3,xmm1
+
+	movdqa	xmm4,xmm3
+	psrldq	xmm3,8
+	pslldq	xmm4,8
+	pxor	xmm1,xmm3
+	pxor	xmm0,xmm4
+
+	movdqa	xmm4,xmm0
+	movdqa	xmm3,xmm0
+	psllq	xmm0,5
+	pxor	xmm3,xmm0
+	psllq	xmm0,1
+	pxor	xmm0,xmm3
+	psllq	xmm0,57
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,8
+	psrldq	xmm3,8
+	pxor	xmm0,xmm4
+	pxor	xmm1,xmm3
+
+
+	movdqa	xmm4,xmm0
+	psrlq	xmm0,1
+	pxor	xmm1,xmm4
+	pxor	xmm4,xmm0
+	psrlq	xmm0,5
+	pxor	xmm0,xmm4
+	psrlq	xmm0,1
+	pxor	xmm0,xmm1
+	movdqa	xmm5,xmm0
+	movdqa	xmm1,xmm0
+	pshufd	xmm3,xmm0,78
+	pxor	xmm3,xmm0
+DB	102,15,58,68,194,0
+DB	102,15,58,68,202,17
+DB	102,15,58,68,222,0
+	pxor	xmm3,xmm0
+	pxor	xmm3,xmm1
+
+	movdqa	xmm4,xmm3
+	psrldq	xmm3,8
+	pslldq	xmm4,8
+	pxor	xmm1,xmm3
+	pxor	xmm0,xmm4
+
+	movdqa	xmm4,xmm0
+	movdqa	xmm3,xmm0
+	psllq	xmm0,5
+	pxor	xmm3,xmm0
+	psllq	xmm0,1
+	pxor	xmm0,xmm3
+	psllq	xmm0,57
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,8
+	psrldq	xmm3,8
+	pxor	xmm0,xmm4
+	pxor	xmm1,xmm3
+
+
+	movdqa	xmm4,xmm0
+	psrlq	xmm0,1
+	pxor	xmm1,xmm4
+	pxor	xmm4,xmm0
+	psrlq	xmm0,5
+	pxor	xmm0,xmm4
+	psrlq	xmm0,1
+	pxor	xmm0,xmm1
+	pshufd	xmm3,xmm5,78
+	pshufd	xmm4,xmm0,78
+	pxor	xmm3,xmm5
+	movdqu	XMMWORD[48+rcx],xmm5
+	pxor	xmm4,xmm0
+	movdqu	XMMWORD[64+rcx],xmm0
+DB	102,15,58,15,227,8
+	movdqu	XMMWORD[80+rcx],xmm4
+	movaps	xmm6,XMMWORD[rsp]
+	lea	rsp,[24+rsp]
+$L$SEH_end_gcm_init_clmul:
+	ret
+
+global	gcm_gmult_clmul
+
+ALIGN	16
+gcm_gmult_clmul:
+$L$_gmult_clmul:
+	movdqu	xmm0,XMMWORD[rcx]
+	movdqa	xmm5,XMMWORD[$L$bswap_mask]
+	movdqu	xmm2,XMMWORD[rdx]
+	movdqu	xmm4,XMMWORD[32+rdx]
+	pshufb	xmm0,xmm5
+	movdqa	xmm1,xmm0
+	pshufd	xmm3,xmm0,78
+	pxor	xmm3,xmm0
+DB	102,15,58,68,194,0
+DB	102,15,58,68,202,17
+DB	102,15,58,68,220,0
+	pxor	xmm3,xmm0
+	pxor	xmm3,xmm1
+
+	movdqa	xmm4,xmm3
+	psrldq	xmm3,8
+	pslldq	xmm4,8
+	pxor	xmm1,xmm3
+	pxor	xmm0,xmm4
+
+	movdqa	xmm4,xmm0
+	movdqa	xmm3,xmm0
+	psllq	xmm0,5
+	pxor	xmm3,xmm0
+	psllq	xmm0,1
+	pxor	xmm0,xmm3
+	psllq	xmm0,57
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,8
+	psrldq	xmm3,8
+	pxor	xmm0,xmm4
+	pxor	xmm1,xmm3
+
+
+	movdqa	xmm4,xmm0
+	psrlq	xmm0,1
+	pxor	xmm1,xmm4
+	pxor	xmm4,xmm0
+	psrlq	xmm0,5
+	pxor	xmm0,xmm4
+	psrlq	xmm0,1
+	pxor	xmm0,xmm1
+	pshufb	xmm0,xmm5
+	movdqu	XMMWORD[rcx],xmm0
+	ret
+
+global	gcm_ghash_clmul
+
+ALIGN	32
+gcm_ghash_clmul:
+$L$_ghash_clmul:
+	lea	rax,[((-136))+rsp]
+$L$SEH_begin_gcm_ghash_clmul:
+
+DB	0x48,0x8d,0x60,0xe0
+DB	0x0f,0x29,0x70,0xe0
+DB	0x0f,0x29,0x78,0xf0
+DB	0x44,0x0f,0x29,0x00
+DB	0x44,0x0f,0x29,0x48,0x10
+DB	0x44,0x0f,0x29,0x50,0x20
+DB	0x44,0x0f,0x29,0x58,0x30
+DB	0x44,0x0f,0x29,0x60,0x40
+DB	0x44,0x0f,0x29,0x68,0x50
+DB	0x44,0x0f,0x29,0x70,0x60
+DB	0x44,0x0f,0x29,0x78,0x70
+	movdqa	xmm10,XMMWORD[$L$bswap_mask]
+
+	movdqu	xmm0,XMMWORD[rcx]
+	movdqu	xmm2,XMMWORD[rdx]
+	movdqu	xmm7,XMMWORD[32+rdx]
+	pshufb	xmm0,xmm10
+
+	sub	r9,0x10
+	jz	NEAR $L$odd_tail
+
+	movdqu	xmm6,XMMWORD[16+rdx]
+;	leaq		OPENSSL_ia32cap_P(%rip),%rax
+;	mov		4(%rax),%eax
+	cmp	r9,0x30
+	jb	NEAR $L$skip4x
+
+;	and		$71303168,%eax
+;	cmp		$4194304,%eax
+;	je		.Lskip4x
+
+	sub	r9,0x30
+	mov	rax,0xA040608020C0E000
+	movdqu	xmm14,XMMWORD[48+rdx]
+	movdqu	xmm15,XMMWORD[64+rdx]
+
+
+
+
+	movdqu	xmm3,XMMWORD[48+r8]
+	movdqu	xmm11,XMMWORD[32+r8]
+	pshufb	xmm3,xmm10
+	pshufb	xmm11,xmm10
+	movdqa	xmm5,xmm3
+	pshufd	xmm4,xmm3,78
+	pxor	xmm4,xmm3
+DB	102,15,58,68,218,0
+DB	102,15,58,68,234,17
+DB	102,15,58,68,231,0
+
+	movdqa	xmm13,xmm11
+	pshufd	xmm12,xmm11,78
+	pxor	xmm12,xmm11
+DB	102,68,15,58,68,222,0
+DB	102,68,15,58,68,238,17
+DB	102,68,15,58,68,231,16
+	xorps	xmm3,xmm11
+	xorps	xmm5,xmm13
+	movups	xmm7,XMMWORD[80+rdx]
+	xorps	xmm4,xmm12
+
+	movdqu	xmm11,XMMWORD[16+r8]
+	movdqu	xmm8,XMMWORD[r8]
+	pshufb	xmm11,xmm10
+	pshufb	xmm8,xmm10
+	movdqa	xmm13,xmm11
+	pshufd	xmm12,xmm11,78
+	pxor	xmm0,xmm8
+	pxor	xmm12,xmm11
+DB	102,69,15,58,68,222,0
+	movdqa	xmm1,xmm0
+	pshufd	xmm8,xmm0,78
+	pxor	xmm8,xmm0
+DB	102,69,15,58,68,238,17
+DB	102,68,15,58,68,231,0
+	xorps	xmm3,xmm11
+	xorps	xmm5,xmm13
+
+	lea	r8,[64+r8]
+	sub	r9,0x40
+	jc	NEAR $L$tail4x
+
+	jmp	NEAR $L$mod4_loop
+ALIGN	32
+$L$mod4_loop:
+DB	102,65,15,58,68,199,0
+	xorps	xmm4,xmm12
+	movdqu	xmm11,XMMWORD[48+r8]
+	pshufb	xmm11,xmm10
+DB	102,65,15,58,68,207,17
+	xorps	xmm0,xmm3
+	movdqu	xmm3,XMMWORD[32+r8]
+	movdqa	xmm13,xmm11
+DB	102,68,15,58,68,199,16
+	pshufd	xmm12,xmm11,78
+	xorps	xmm1,xmm5
+	pxor	xmm12,xmm11
+	pshufb	xmm3,xmm10
+	movups	xmm7,XMMWORD[32+rdx]
+	xorps	xmm8,xmm4
+DB	102,68,15,58,68,218,0
+	pshufd	xmm4,xmm3,78
+
+	pxor	xmm8,xmm0
+	movdqa	xmm5,xmm3
+	pxor	xmm8,xmm1
+	pxor	xmm4,xmm3
+	movdqa	xmm9,xmm8
+DB	102,68,15,58,68,234,17
+	pslldq	xmm8,8
+	psrldq	xmm9,8
+	pxor	xmm0,xmm8
+	movdqa	xmm8,XMMWORD[$L$7_mask]
+	pxor	xmm1,xmm9
+DB	102,76,15,110,200
+
+	pand	xmm8,xmm0
+	pshufb	xmm9,xmm8
+	pxor	xmm9,xmm0
+DB	102,68,15,58,68,231,0
+	psllq	xmm9,57
+	movdqa	xmm8,xmm9
+	pslldq	xmm9,8
+DB	102,15,58,68,222,0
+	psrldq	xmm8,8
+	pxor	xmm0,xmm9
+	pxor	xmm1,xmm8
+	movdqu	xmm8,XMMWORD[r8]
+
+	movdqa	xmm9,xmm0
+	psrlq	xmm0,1
+DB	102,15,58,68,238,17
+	xorps	xmm3,xmm11
+	movdqu	xmm11,XMMWORD[16+r8]
+	pshufb	xmm11,xmm10
+DB	102,15,58,68,231,16
+	xorps	xmm5,xmm13
+	movups	xmm7,XMMWORD[80+rdx]
+	pshufb	xmm8,xmm10
+	pxor	xmm1,xmm9
+	pxor	xmm9,xmm0
+	psrlq	xmm0,5
+
+	movdqa	xmm13,xmm11
+	pxor	xmm4,xmm12
+	pshufd	xmm12,xmm11,78
+	pxor	xmm0,xmm9
+	pxor	xmm1,xmm8
+	pxor	xmm12,xmm11
+DB	102,69,15,58,68,222,0
+	psrlq	xmm0,1
+	pxor	xmm0,xmm1
+	movdqa	xmm1,xmm0
+DB	102,69,15,58,68,238,17
+	xorps	xmm3,xmm11
+	pshufd	xmm8,xmm0,78
+	pxor	xmm8,xmm0
+
+DB	102,68,15,58,68,231,0
+	xorps	xmm5,xmm13
+
+	lea	r8,[64+r8]
+	sub	r9,0x40
+	jnc	NEAR $L$mod4_loop
+
+$L$tail4x:
+DB	102,65,15,58,68,199,0
+DB	102,65,15,58,68,207,17
+DB	102,68,15,58,68,199,16
+	xorps	xmm4,xmm12
+	xorps	xmm0,xmm3
+	xorps	xmm1,xmm5
+	pxor	xmm1,xmm0
+	pxor	xmm8,xmm4
+
+	pxor	xmm8,xmm1
+	pxor	xmm1,xmm0
+
+	movdqa	xmm9,xmm8
+	psrldq	xmm8,8
+	pslldq	xmm9,8
+	pxor	xmm1,xmm8
+	pxor	xmm0,xmm9
+
+	movdqa	xmm4,xmm0
+	movdqa	xmm3,xmm0
+	psllq	xmm0,5
+	pxor	xmm3,xmm0
+	psllq	xmm0,1
+	pxor	xmm0,xmm3
+	psllq	xmm0,57
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,8
+	psrldq	xmm3,8
+	pxor	xmm0,xmm4
+	pxor	xmm1,xmm3
+
+
+	movdqa	xmm4,xmm0
+	psrlq	xmm0,1
+	pxor	xmm1,xmm4
+	pxor	xmm4,xmm0
+	psrlq	xmm0,5
+	pxor	xmm0,xmm4
+	psrlq	xmm0,1
+	pxor	xmm0,xmm1
+	add	r9,0x40
+	jz	NEAR $L$done
+	movdqu	xmm7,XMMWORD[32+rdx]
+	sub	r9,0x10
+	jz	NEAR $L$odd_tail
+$L$skip4x:
+
+
+
+
+
+	movdqu	xmm8,XMMWORD[r8]
+	movdqu	xmm3,XMMWORD[16+r8]
+	pshufb	xmm8,xmm10
+	pshufb	xmm3,xmm10
+	pxor	xmm0,xmm8
+
+	movdqa	xmm5,xmm3
+	pshufd	xmm4,xmm3,78
+	pxor	xmm4,xmm3
+DB	102,15,58,68,218,0
+DB	102,15,58,68,234,17
+DB	102,15,58,68,231,0
+
+	lea	r8,[32+r8]
+	nop
+	sub	r9,0x20
+	jbe	NEAR $L$even_tail
+	nop
+	jmp	NEAR $L$mod_loop
+
+ALIGN	32
+$L$mod_loop:
+	movdqa	xmm1,xmm0
+	movdqa	xmm8,xmm4
+	pshufd	xmm4,xmm0,78
+	pxor	xmm4,xmm0
+
+DB	102,15,58,68,198,0
+DB	102,15,58,68,206,17
+DB	102,15,58,68,231,16
+
+	pxor	xmm0,xmm3
+	pxor	xmm1,xmm5
+	movdqu	xmm9,XMMWORD[r8]
+	pxor	xmm8,xmm0
+	pshufb	xmm9,xmm10
+	movdqu	xmm3,XMMWORD[16+r8]
+
+	pxor	xmm8,xmm1
+	pxor	xmm1,xmm9
+	pxor	xmm4,xmm8
+	pshufb	xmm3,xmm10
+	movdqa	xmm8,xmm4
+	psrldq	xmm8,8
+	pslldq	xmm4,8
+	pxor	xmm1,xmm8
+	pxor	xmm0,xmm4
+
+	movdqa	xmm5,xmm3
+
+	movdqa	xmm9,xmm0
+	movdqa	xmm8,xmm0
+	psllq	xmm0,5
+	pxor	xmm8,xmm0
+DB	102,15,58,68,218,0
+	psllq	xmm0,1
+	pxor	xmm0,xmm8
+	psllq	xmm0,57
+	movdqa	xmm8,xmm0
+	pslldq	xmm0,8
+	psrldq	xmm8,8
+	pxor	xmm0,xmm9
+	pshufd	xmm4,xmm5,78
+	pxor	xmm1,xmm8
+	pxor	xmm4,xmm5
+
+	movdqa	xmm9,xmm0
+	psrlq	xmm0,1
+DB	102,15,58,68,234,17
+	pxor	xmm1,xmm9
+	pxor	xmm9,xmm0
+	psrlq	xmm0,5
+	pxor	xmm0,xmm9
+	lea	r8,[32+r8]
+	psrlq	xmm0,1
+DB	102,15,58,68,231,0
+	pxor	xmm0,xmm1
+
+	sub	r9,0x20
+	ja	NEAR $L$mod_loop
+
+$L$even_tail:
+	movdqa	xmm1,xmm0
+	movdqa	xmm8,xmm4
+	pshufd	xmm4,xmm0,78
+	pxor	xmm4,xmm0
+
+DB	102,15,58,68,198,0
+DB	102,15,58,68,206,17
+DB	102,15,58,68,231,16
+
+	pxor	xmm0,xmm3
+	pxor	xmm1,xmm5
+	pxor	xmm8,xmm0
+	pxor	xmm8,xmm1
+	pxor	xmm4,xmm8
+	movdqa	xmm8,xmm4
+	psrldq	xmm8,8
+	pslldq	xmm4,8
+	pxor	xmm1,xmm8
+	pxor	xmm0,xmm4
+
+	movdqa	xmm4,xmm0
+	movdqa	xmm3,xmm0
+	psllq	xmm0,5
+	pxor	xmm3,xmm0
+	psllq	xmm0,1
+	pxor	xmm0,xmm3
+	psllq	xmm0,57
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,8
+	psrldq	xmm3,8
+	pxor	xmm0,xmm4
+	pxor	xmm1,xmm3
+
+
+	movdqa	xmm4,xmm0
+	psrlq	xmm0,1
+	pxor	xmm1,xmm4
+	pxor	xmm4,xmm0
+	psrlq	xmm0,5
+	pxor	xmm0,xmm4
+	psrlq	xmm0,1
+	pxor	xmm0,xmm1
+	test	r9,r9
+	jnz	NEAR $L$done
+
+$L$odd_tail:
+	movdqu	xmm8,XMMWORD[r8]
+	pshufb	xmm8,xmm10
+	pxor	xmm0,xmm8
+	movdqa	xmm1,xmm0
+	pshufd	xmm3,xmm0,78
+	pxor	xmm3,xmm0
+DB	102,15,58,68,194,0
+DB	102,15,58,68,202,17
+DB	102,15,58,68,223,0
+	pxor	xmm3,xmm0
+	pxor	xmm3,xmm1
+
+	movdqa	xmm4,xmm3
+	psrldq	xmm3,8
+	pslldq	xmm4,8
+	pxor	xmm1,xmm3
+	pxor	xmm0,xmm4
+
+	movdqa	xmm4,xmm0
+	movdqa	xmm3,xmm0
+	psllq	xmm0,5
+	pxor	xmm3,xmm0
+	psllq	xmm0,1
+	pxor	xmm0,xmm3
+	psllq	xmm0,57
+	movdqa	xmm3,xmm0
+	pslldq	xmm0,8
+	psrldq	xmm3,8
+	pxor	xmm0,xmm4
+	pxor	xmm1,xmm3
+
+
+	movdqa	xmm4,xmm0
+	psrlq	xmm0,1
+	pxor	xmm1,xmm4
+	pxor	xmm4,xmm0
+	psrlq	xmm0,5
+	pxor	xmm0,xmm4
+	psrlq	xmm0,1
+	pxor	xmm0,xmm1
+$L$done:
+	pshufb	xmm0,xmm10
+	movdqu	XMMWORD[rcx],xmm0
+	movaps	xmm6,XMMWORD[rsp]
+	movaps	xmm7,XMMWORD[16+rsp]
+	movaps	xmm8,XMMWORD[32+rsp]
+	movaps	xmm9,XMMWORD[48+rsp]
+	movaps	xmm10,XMMWORD[64+rsp]
+	movaps	xmm11,XMMWORD[80+rsp]
+	movaps	xmm12,XMMWORD[96+rsp]
+	movaps	xmm13,XMMWORD[112+rsp]
+	movaps	xmm14,XMMWORD[128+rsp]
+	movaps	xmm15,XMMWORD[144+rsp]
+	lea	rsp,[168+rsp]
+$L$SEH_end_gcm_ghash_clmul:
+	ret
+
+global	gcm_init_avx
+
+ALIGN	32
+gcm_init_avx:
+$L$SEH_begin_gcm_init_avx:
+
+DB	0x48,0x83,0xec,0x18
+DB	0x0f,0x29,0x34,0x24
+	vzeroupper
+
+	vmovdqu	xmm2,XMMWORD[rdx]
+	vpshufd	xmm2,xmm2,78
+
+
+	vpshufd	xmm4,xmm2,255
+	vpsrlq	xmm3,xmm2,63
+	vpsllq	xmm2,xmm2,1
+	vpxor	xmm5,xmm5,xmm5
+	vpcmpgtd	xmm5,xmm5,xmm4
+	vpslldq	xmm3,xmm3,8
+	vpor	xmm2,xmm2,xmm3
+
+
+	vpand	xmm5,xmm5,XMMWORD[$L$0x1c2_polynomial]
+	vpxor	xmm2,xmm2,xmm5
+
+	vpunpckhqdq	xmm6,xmm2,xmm2
+	vmovdqa	xmm0,xmm2
+	vpxor	xmm6,xmm6,xmm2
+	mov	r10,4
+	jmp	NEAR $L$init_start_avx
+ALIGN	32
+$L$init_loop_avx:
+	vpalignr	xmm5,xmm4,xmm3,8
+	vmovdqu	XMMWORD[(-16)+rcx],xmm5
+	vpunpckhqdq	xmm3,xmm0,xmm0
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm1,xmm0,xmm2,0x11
+	vpclmulqdq	xmm0,xmm0,xmm2,0x00
+	vpclmulqdq	xmm3,xmm3,xmm6,0x00
+	vpxor	xmm4,xmm1,xmm0
+	vpxor	xmm3,xmm3,xmm4
+
+	vpslldq	xmm4,xmm3,8
+	vpsrldq	xmm3,xmm3,8
+	vpxor	xmm0,xmm0,xmm4
+	vpxor	xmm1,xmm1,xmm3
+	vpsllq	xmm3,xmm0,57
+	vpsllq	xmm4,xmm0,62
+	vpxor	xmm4,xmm4,xmm3
+	vpsllq	xmm3,xmm0,63
+	vpxor	xmm4,xmm4,xmm3
+	vpslldq	xmm3,xmm4,8
+	vpsrldq	xmm4,xmm4,8
+	vpxor	xmm0,xmm0,xmm3
+	vpxor	xmm1,xmm1,xmm4
+
+	vpsrlq	xmm4,xmm0,1
+	vpxor	xmm1,xmm1,xmm0
+	vpxor	xmm0,xmm0,xmm4
+	vpsrlq	xmm4,xmm4,5
+	vpxor	xmm0,xmm0,xmm4
+	vpsrlq	xmm0,xmm0,1
+	vpxor	xmm0,xmm0,xmm1
+$L$init_start_avx:
+	vmovdqa	xmm5,xmm0
+	vpunpckhqdq	xmm3,xmm0,xmm0
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm1,xmm0,xmm2,0x11
+	vpclmulqdq	xmm0,xmm0,xmm2,0x00
+	vpclmulqdq	xmm3,xmm3,xmm6,0x00
+	vpxor	xmm4,xmm1,xmm0
+	vpxor	xmm3,xmm3,xmm4
+
+	vpslldq	xmm4,xmm3,8
+	vpsrldq	xmm3,xmm3,8
+	vpxor	xmm0,xmm0,xmm4
+	vpxor	xmm1,xmm1,xmm3
+	vpsllq	xmm3,xmm0,57
+	vpsllq	xmm4,xmm0,62
+	vpxor	xmm4,xmm4,xmm3
+	vpsllq	xmm3,xmm0,63
+	vpxor	xmm4,xmm4,xmm3
+	vpslldq	xmm3,xmm4,8
+	vpsrldq	xmm4,xmm4,8
+	vpxor	xmm0,xmm0,xmm3
+	vpxor	xmm1,xmm1,xmm4
+
+	vpsrlq	xmm4,xmm0,1
+	vpxor	xmm1,xmm1,xmm0
+	vpxor	xmm0,xmm0,xmm4
+	vpsrlq	xmm4,xmm4,5
+	vpxor	xmm0,xmm0,xmm4
+	vpsrlq	xmm0,xmm0,1
+	vpxor	xmm0,xmm0,xmm1
+	vpshufd	xmm3,xmm5,78
+	vpshufd	xmm4,xmm0,78
+	vpxor	xmm3,xmm3,xmm5
+	vmovdqu	XMMWORD[rcx],xmm5
+	vpxor	xmm4,xmm4,xmm0
+	vmovdqu	XMMWORD[16+rcx],xmm0
+	lea	rcx,[48+rcx]
+	sub	r10,1
+	jnz	NEAR $L$init_loop_avx
+
+	vpalignr	xmm5,xmm3,xmm4,8
+	vmovdqu	XMMWORD[(-16)+rcx],xmm5
+
+	vzeroupper
+	movaps	xmm6,XMMWORD[rsp]
+	lea	rsp,[24+rsp]
+$L$SEH_end_gcm_init_avx:
+	ret
+
+global	gcm_gmult_avx
+
+ALIGN	32
+gcm_gmult_avx:
+	jmp	NEAR $L$_gmult_clmul
+
+global	gcm_ghash_avx
+
+ALIGN	32
+gcm_ghash_avx:
+	lea	rax,[((-136))+rsp]
+$L$SEH_begin_gcm_ghash_avx:
+
+DB	0x48,0x8d,0x60,0xe0
+DB	0x0f,0x29,0x70,0xe0
+DB	0x0f,0x29,0x78,0xf0
+DB	0x44,0x0f,0x29,0x00
+DB	0x44,0x0f,0x29,0x48,0x10
+DB	0x44,0x0f,0x29,0x50,0x20
+DB	0x44,0x0f,0x29,0x58,0x30
+DB	0x44,0x0f,0x29,0x60,0x40
+DB	0x44,0x0f,0x29,0x68,0x50
+DB	0x44,0x0f,0x29,0x70,0x60
+DB	0x44,0x0f,0x29,0x78,0x70
+	vzeroupper
+
+	vmovdqu	xmm10,XMMWORD[rcx]
+	lea	r10,[$L$0x1c2_polynomial]
+	lea	rdx,[64+rdx]
+	vmovdqu	xmm13,XMMWORD[$L$bswap_mask]
+	vpshufb	xmm10,xmm10,xmm13
+	cmp	r9,0x80
+	jb	NEAR $L$short_avx
+	sub	r9,0x80
+
+	vmovdqu	xmm14,XMMWORD[112+r8]
+	vmovdqu	xmm6,XMMWORD[((0-64))+rdx]
+	vpshufb	xmm14,xmm14,xmm13
+	vmovdqu	xmm7,XMMWORD[((32-64))+rdx]
+
+	vpunpckhqdq	xmm9,xmm14,xmm14
+	vmovdqu	xmm15,XMMWORD[96+r8]
+	vpclmulqdq	xmm0,xmm14,xmm6,0x00
+	vpxor	xmm9,xmm9,xmm14
+	vpshufb	xmm15,xmm15,xmm13
+	vpclmulqdq	xmm1,xmm14,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((16-64))+rdx]
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vmovdqu	xmm14,XMMWORD[80+r8]
+	vpclmulqdq	xmm2,xmm9,xmm7,0x00
+	vpxor	xmm8,xmm8,xmm15
+
+	vpshufb	xmm14,xmm14,xmm13
+	vpclmulqdq	xmm3,xmm15,xmm6,0x00
+	vpunpckhqdq	xmm9,xmm14,xmm14
+	vpclmulqdq	xmm4,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((48-64))+rdx]
+	vpxor	xmm9,xmm9,xmm14
+	vmovdqu	xmm15,XMMWORD[64+r8]
+	vpclmulqdq	xmm5,xmm8,xmm7,0x10
+	vmovdqu	xmm7,XMMWORD[((80-64))+rdx]
+
+	vpshufb	xmm15,xmm15,xmm13
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm14,xmm6,0x00
+	vpxor	xmm4,xmm4,xmm1
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpclmulqdq	xmm1,xmm14,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((64-64))+rdx]
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm9,xmm7,0x00
+	vpxor	xmm8,xmm8,xmm15
+
+	vmovdqu	xmm14,XMMWORD[48+r8]
+	vpxor	xmm0,xmm0,xmm3
+	vpclmulqdq	xmm3,xmm15,xmm6,0x00
+	vpxor	xmm1,xmm1,xmm4
+	vpshufb	xmm14,xmm14,xmm13
+	vpclmulqdq	xmm4,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((96-64))+rdx]
+	vpxor	xmm2,xmm2,xmm5
+	vpunpckhqdq	xmm9,xmm14,xmm14
+	vpclmulqdq	xmm5,xmm8,xmm7,0x10
+	vmovdqu	xmm7,XMMWORD[((128-64))+rdx]
+	vpxor	xmm9,xmm9,xmm14
+
+	vmovdqu	xmm15,XMMWORD[32+r8]
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm14,xmm6,0x00
+	vpxor	xmm4,xmm4,xmm1
+	vpshufb	xmm15,xmm15,xmm13
+	vpclmulqdq	xmm1,xmm14,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((112-64))+rdx]
+	vpxor	xmm5,xmm5,xmm2
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpclmulqdq	xmm2,xmm9,xmm7,0x00
+	vpxor	xmm8,xmm8,xmm15
+
+	vmovdqu	xmm14,XMMWORD[16+r8]
+	vpxor	xmm0,xmm0,xmm3
+	vpclmulqdq	xmm3,xmm15,xmm6,0x00
+	vpxor	xmm1,xmm1,xmm4
+	vpshufb	xmm14,xmm14,xmm13
+	vpclmulqdq	xmm4,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((144-64))+rdx]
+	vpxor	xmm2,xmm2,xmm5
+	vpunpckhqdq	xmm9,xmm14,xmm14
+	vpclmulqdq	xmm5,xmm8,xmm7,0x10
+	vmovdqu	xmm7,XMMWORD[((176-64))+rdx]
+	vpxor	xmm9,xmm9,xmm14
+
+	vmovdqu	xmm15,XMMWORD[r8]
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm14,xmm6,0x00
+	vpxor	xmm4,xmm4,xmm1
+	vpshufb	xmm15,xmm15,xmm13
+	vpclmulqdq	xmm1,xmm14,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((160-64))+rdx]
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm9,xmm7,0x10
+
+	lea	r8,[128+r8]
+	cmp	r9,0x80
+	jb	NEAR $L$tail_avx
+
+	vpxor	xmm15,xmm15,xmm10
+	sub	r9,0x80
+	jmp	NEAR $L$oop8x_avx
+
+ALIGN	32
+$L$oop8x_avx:
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vmovdqu	xmm14,XMMWORD[112+r8]
+	vpxor	xmm3,xmm3,xmm0
+	vpxor	xmm8,xmm8,xmm15
+	vpclmulqdq	xmm10,xmm15,xmm6,0x00
+	vpshufb	xmm14,xmm14,xmm13
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm11,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((0-64))+rdx]
+	vpunpckhqdq	xmm9,xmm14,xmm14
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm12,xmm8,xmm7,0x00
+	vmovdqu	xmm7,XMMWORD[((32-64))+rdx]
+	vpxor	xmm9,xmm9,xmm14
+
+	vmovdqu	xmm15,XMMWORD[96+r8]
+	vpclmulqdq	xmm0,xmm14,xmm6,0x00
+	vpxor	xmm10,xmm10,xmm3
+	vpshufb	xmm15,xmm15,xmm13
+	vpclmulqdq	xmm1,xmm14,xmm6,0x11
+	vxorps	xmm11,xmm11,xmm4
+	vmovdqu	xmm6,XMMWORD[((16-64))+rdx]
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpclmulqdq	xmm2,xmm9,xmm7,0x00
+	vpxor	xmm12,xmm12,xmm5
+	vxorps	xmm8,xmm8,xmm15
+
+	vmovdqu	xmm14,XMMWORD[80+r8]
+	vpxor	xmm12,xmm12,xmm10
+	vpclmulqdq	xmm3,xmm15,xmm6,0x00
+	vpxor	xmm12,xmm12,xmm11
+	vpslldq	xmm9,xmm12,8
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm4,xmm15,xmm6,0x11
+	vpsrldq	xmm12,xmm12,8
+	vpxor	xmm10,xmm10,xmm9
+	vmovdqu	xmm6,XMMWORD[((48-64))+rdx]
+	vpshufb	xmm14,xmm14,xmm13
+	vxorps	xmm11,xmm11,xmm12
+	vpxor	xmm4,xmm4,xmm1
+	vpunpckhqdq	xmm9,xmm14,xmm14
+	vpclmulqdq	xmm5,xmm8,xmm7,0x10
+	vmovdqu	xmm7,XMMWORD[((80-64))+rdx]
+	vpxor	xmm9,xmm9,xmm14
+	vpxor	xmm5,xmm5,xmm2
+
+	vmovdqu	xmm15,XMMWORD[64+r8]
+	vpalignr	xmm12,xmm10,xmm10,8
+	vpclmulqdq	xmm0,xmm14,xmm6,0x00
+	vpshufb	xmm15,xmm15,xmm13
+	vpxor	xmm0,xmm0,xmm3
+	vpclmulqdq	xmm1,xmm14,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((64-64))+rdx]
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm1,xmm1,xmm4
+	vpclmulqdq	xmm2,xmm9,xmm7,0x00
+	vxorps	xmm8,xmm8,xmm15
+	vpxor	xmm2,xmm2,xmm5
+
+	vmovdqu	xmm14,XMMWORD[48+r8]
+	vpclmulqdq	xmm10,xmm10,XMMWORD[r10],0x10
+	vpclmulqdq	xmm3,xmm15,xmm6,0x00
+	vpshufb	xmm14,xmm14,xmm13
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm4,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((96-64))+rdx]
+	vpunpckhqdq	xmm9,xmm14,xmm14
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm5,xmm8,xmm7,0x10
+	vmovdqu	xmm7,XMMWORD[((128-64))+rdx]
+	vpxor	xmm9,xmm9,xmm14
+	vpxor	xmm5,xmm5,xmm2
+
+	vmovdqu	xmm15,XMMWORD[32+r8]
+	vpclmulqdq	xmm0,xmm14,xmm6,0x00
+	vpshufb	xmm15,xmm15,xmm13
+	vpxor	xmm0,xmm0,xmm3
+	vpclmulqdq	xmm1,xmm14,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((112-64))+rdx]
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm1,xmm1,xmm4
+	vpclmulqdq	xmm2,xmm9,xmm7,0x00
+	vpxor	xmm8,xmm8,xmm15
+	vpxor	xmm2,xmm2,xmm5
+	vxorps	xmm10,xmm10,xmm12
+
+	vmovdqu	xmm14,XMMWORD[16+r8]
+	vpalignr	xmm12,xmm10,xmm10,8
+	vpclmulqdq	xmm3,xmm15,xmm6,0x00
+	vpshufb	xmm14,xmm14,xmm13
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm4,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((144-64))+rdx]
+	vpclmulqdq	xmm10,xmm10,XMMWORD[r10],0x10
+	vxorps	xmm12,xmm12,xmm11
+	vpunpckhqdq	xmm9,xmm14,xmm14
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm5,xmm8,xmm7,0x10
+	vmovdqu	xmm7,XMMWORD[((176-64))+rdx]
+	vpxor	xmm9,xmm9,xmm14
+	vpxor	xmm5,xmm5,xmm2
+
+	vmovdqu	xmm15,XMMWORD[r8]
+	vpclmulqdq	xmm0,xmm14,xmm6,0x00
+	vpshufb	xmm15,xmm15,xmm13
+	vpclmulqdq	xmm1,xmm14,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((160-64))+rdx]
+	vpxor	xmm15,xmm15,xmm12
+	vpclmulqdq	xmm2,xmm9,xmm7,0x10
+	vpxor	xmm15,xmm15,xmm10
+
+	lea	r8,[128+r8]
+	sub	r9,0x80
+	jnc	NEAR $L$oop8x_avx
+
+	add	r9,0x80
+	jmp	NEAR $L$tail_no_xor_avx
+
+ALIGN	32
+$L$short_avx:
+	vmovdqu	xmm14,XMMWORD[((-16))+r9*1+r8]
+	lea	r8,[r9*1+r8]
+	vmovdqu	xmm6,XMMWORD[((0-64))+rdx]
+	vmovdqu	xmm7,XMMWORD[((32-64))+rdx]
+	vpshufb	xmm15,xmm14,xmm13
+
+	vmovdqa	xmm3,xmm0
+	vmovdqa	xmm4,xmm1
+	vmovdqa	xmm5,xmm2
+	sub	r9,0x10
+	jz	NEAR $L$tail_avx
+
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm15,xmm6,0x00
+	vpxor	xmm8,xmm8,xmm15
+	vmovdqu	xmm14,XMMWORD[((-32))+r8]
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm1,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((16-64))+rdx]
+	vpshufb	xmm15,xmm14,xmm13
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm8,xmm7,0x00
+	vpsrldq	xmm7,xmm7,8
+	sub	r9,0x10
+	jz	NEAR $L$tail_avx
+
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm15,xmm6,0x00
+	vpxor	xmm8,xmm8,xmm15
+	vmovdqu	xmm14,XMMWORD[((-48))+r8]
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm1,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((48-64))+rdx]
+	vpshufb	xmm15,xmm14,xmm13
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm8,xmm7,0x00
+	vmovdqu	xmm7,XMMWORD[((80-64))+rdx]
+	sub	r9,0x10
+	jz	NEAR $L$tail_avx
+
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm15,xmm6,0x00
+	vpxor	xmm8,xmm8,xmm15
+	vmovdqu	xmm14,XMMWORD[((-64))+r8]
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm1,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((64-64))+rdx]
+	vpshufb	xmm15,xmm14,xmm13
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm8,xmm7,0x00
+	vpsrldq	xmm7,xmm7,8
+	sub	r9,0x10
+	jz	NEAR $L$tail_avx
+
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm15,xmm6,0x00
+	vpxor	xmm8,xmm8,xmm15
+	vmovdqu	xmm14,XMMWORD[((-80))+r8]
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm1,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((96-64))+rdx]
+	vpshufb	xmm15,xmm14,xmm13
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm8,xmm7,0x00
+	vmovdqu	xmm7,XMMWORD[((128-64))+rdx]
+	sub	r9,0x10
+	jz	NEAR $L$tail_avx
+
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm15,xmm6,0x00
+	vpxor	xmm8,xmm8,xmm15
+	vmovdqu	xmm14,XMMWORD[((-96))+r8]
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm1,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((112-64))+rdx]
+	vpshufb	xmm15,xmm14,xmm13
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm8,xmm7,0x00
+	vpsrldq	xmm7,xmm7,8
+	sub	r9,0x10
+	jz	NEAR $L$tail_avx
+
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm15,xmm6,0x00
+	vpxor	xmm8,xmm8,xmm15
+	vmovdqu	xmm14,XMMWORD[((-112))+r8]
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm1,xmm15,xmm6,0x11
+	vmovdqu	xmm6,XMMWORD[((144-64))+rdx]
+	vpshufb	xmm15,xmm14,xmm13
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm8,xmm7,0x00
+	vmovq	xmm7,QWORD[((184-64))+rdx]
+	sub	r9,0x10
+	jmp	NEAR $L$tail_avx
+
+ALIGN	32
+$L$tail_avx:
+	vpxor	xmm15,xmm15,xmm10
+$L$tail_no_xor_avx:
+	vpunpckhqdq	xmm8,xmm15,xmm15
+	vpxor	xmm3,xmm3,xmm0
+	vpclmulqdq	xmm0,xmm15,xmm6,0x00
+	vpxor	xmm8,xmm8,xmm15
+	vpxor	xmm4,xmm4,xmm1
+	vpclmulqdq	xmm1,xmm15,xmm6,0x11
+	vpxor	xmm5,xmm5,xmm2
+	vpclmulqdq	xmm2,xmm8,xmm7,0x00
+
+	vmovdqu	xmm12,XMMWORD[r10]
+
+	vpxor	xmm10,xmm3,xmm0
+	vpxor	xmm11,xmm4,xmm1
+	vpxor	xmm5,xmm5,xmm2
+
+	vpxor	xmm5,xmm5,xmm10
+	vpxor	xmm5,xmm5,xmm11
+	vpslldq	xmm9,xmm5,8
+	vpsrldq	xmm5,xmm5,8
+	vpxor	xmm10,xmm10,xmm9
+	vpxor	xmm11,xmm11,xmm5
+
+	vpclmulqdq	xmm9,xmm10,xmm12,0x10
+	vpalignr	xmm10,xmm10,xmm10,8
+	vpxor	xmm10,xmm10,xmm9
+
+	vpclmulqdq	xmm9,xmm10,xmm12,0x10
+	vpalignr	xmm10,xmm10,xmm10,8
+	vpxor	xmm10,xmm10,xmm11
+	vpxor	xmm10,xmm10,xmm9
+
+	cmp	r9,0
+	jne	NEAR $L$short_avx
+
+	vpshufb	xmm10,xmm10,xmm13
+	vmovdqu	XMMWORD[rcx],xmm10
+	vzeroupper
+	movaps	xmm6,XMMWORD[rsp]
+	movaps	xmm7,XMMWORD[16+rsp]
+	movaps	xmm8,XMMWORD[32+rsp]
+	movaps	xmm9,XMMWORD[48+rsp]
+	movaps	xmm10,XMMWORD[64+rsp]
+	movaps	xmm11,XMMWORD[80+rsp]
+	movaps	xmm12,XMMWORD[96+rsp]
+	movaps	xmm13,XMMWORD[112+rsp]
+	movaps	xmm14,XMMWORD[128+rsp]
+	movaps	xmm15,XMMWORD[144+rsp]
+	lea	rsp,[168+rsp]
+$L$SEH_end_gcm_ghash_avx:
+	ret
+
+ALIGN	64
+$L$bswap_mask:
+DB	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+$L$0x1c2_polynomial:
+DB	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+$L$7_mask:
+	DD	7,0,7,0
+$L$7_mask_poly:
+	DD	7,0,450,0
+ALIGN	64
+
+$L$rem_4bit:
+	DD	0,0,0,471859200,0,943718400,0,610271232
+	DD	0,1887436800,0,1822425088,0,1220542464,0,1423966208
+	DD	0,3774873600,0,4246732800,0,3644850176,0,3311403008
+	DD	0,2441084928,0,2376073216,0,2847932416,0,3051356160
+
+$L$rem_8bit:
+	DW	0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E
+	DW	0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E
+	DW	0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E
+	DW	0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E
+	DW	0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E
+	DW	0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E
+	DW	0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E
+	DW	0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E
+	DW	0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE
+	DW	0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE
+	DW	0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE
+	DW	0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE
+	DW	0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E
+	DW	0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E
+	DW	0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE
+	DW	0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE
+	DW	0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E
+	DW	0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E
+	DW	0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E
+	DW	0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E
+	DW	0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E
+	DW	0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E
+	DW	0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E
+	DW	0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E
+	DW	0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE
+	DW	0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE
+	DW	0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE
+	DW	0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE
+	DW	0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E
+	DW	0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E
+	DW	0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE
+	DW	0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE
+
+DB	71,72,65,83,72,32,102,111,114,32,120,56,54,95,54,52
+DB	44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32
+DB	60,97,112,112,114,111,64,108,46,111,114,103,62,0
+ALIGN	64
+EXTERN	__imp_RtlVirtualUnwind
+
+ALIGN	16
+se_handler:
+	push	rsi
+	push	rdi
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	pushfq
+	sub	rsp,64
+
+	mov	rax,QWORD[120+r8]
+	mov	rbx,QWORD[248+r8]
+
+	mov	rsi,QWORD[8+r9]
+	mov	r11,QWORD[56+r9]
+
+	mov	r10d,DWORD[r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jb	NEAR $L$in_prologue
+
+	mov	rax,QWORD[152+r8]
+
+	mov	r10d,DWORD[4+r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jae	NEAR $L$in_prologue
+
+	lea	rax,[((48+280))+rax]
+
+	mov	rbx,QWORD[((-8))+rax]
+	mov	rbp,QWORD[((-16))+rax]
+	mov	r12,QWORD[((-24))+rax]
+	mov	r13,QWORD[((-32))+rax]
+	mov	r14,QWORD[((-40))+rax]
+	mov	r15,QWORD[((-48))+rax]
+	mov	QWORD[144+r8],rbx
+	mov	QWORD[160+r8],rbp
+	mov	QWORD[216+r8],r12
+	mov	QWORD[224+r8],r13
+	mov	QWORD[232+r8],r14
+	mov	QWORD[240+r8],r15
+
+$L$in_prologue:
+	mov	rdi,QWORD[8+rax]
+	mov	rsi,QWORD[16+rax]
+	mov	QWORD[152+r8],rax
+	mov	QWORD[168+r8],rsi
+	mov	QWORD[176+r8],rdi
+
+	mov	rdi,QWORD[40+r9]
+	mov	rsi,r8
+	mov	ecx,154
+	DD	0xa548f3fc
+
+	mov	rsi,r9
+	xor	rcx,rcx
+	mov	rdx,QWORD[8+rsi]
+	mov	r8,QWORD[rsi]
+	mov	r9,QWORD[16+rsi]
+	mov	r10,QWORD[40+rsi]
+	lea	r11,[56+rsi]
+	lea	r12,[24+rsi]
+	mov	QWORD[32+rsp],r10
+	mov	QWORD[40+rsp],r11
+	mov	QWORD[48+rsp],r12
+	mov	QWORD[56+rsp],rcx
+	call	QWORD[__imp_RtlVirtualUnwind]
+
+	mov	eax,1
+	add	rsp,64
+	popfq
+	pop	r15
+	pop	r14
+	pop	r13
+	pop	r12
+	pop	rbp
+	pop	rbx
+	pop	rdi
+	pop	rsi
+	ret
+
+
+section	.pdata rdata align=4
+ALIGN	4
+	DD	$L$SEH_begin_gcm_gmult_4bit wrt ..imagebase
+	DD	$L$SEH_end_gcm_gmult_4bit wrt ..imagebase
+	DD	$L$SEH_info_gcm_gmult_4bit wrt ..imagebase
+
+	DD	$L$SEH_begin_gcm_ghash_4bit wrt ..imagebase
+	DD	$L$SEH_end_gcm_ghash_4bit wrt ..imagebase
+	DD	$L$SEH_info_gcm_ghash_4bit wrt ..imagebase
+
+	DD	$L$SEH_begin_gcm_init_clmul wrt ..imagebase
+	DD	$L$SEH_end_gcm_init_clmul wrt ..imagebase
+	DD	$L$SEH_info_gcm_init_clmul wrt ..imagebase
+
+	DD	$L$SEH_begin_gcm_ghash_clmul wrt ..imagebase
+	DD	$L$SEH_end_gcm_ghash_clmul wrt ..imagebase
+	DD	$L$SEH_info_gcm_ghash_clmul wrt ..imagebase
+	DD	$L$SEH_begin_gcm_init_avx wrt ..imagebase
+	DD	$L$SEH_end_gcm_init_avx wrt ..imagebase
+	DD	$L$SEH_info_gcm_init_clmul wrt ..imagebase
+
+	DD	$L$SEH_begin_gcm_ghash_avx wrt ..imagebase
+	DD	$L$SEH_end_gcm_ghash_avx wrt ..imagebase
+	DD	$L$SEH_info_gcm_ghash_clmul wrt ..imagebase
+section	.xdata rdata align=8
+ALIGN	8
+$L$SEH_info_gcm_gmult_4bit:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$gmult_prologue wrt ..imagebase,$L$gmult_epilogue wrt ..imagebase
+$L$SEH_info_gcm_ghash_4bit:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$ghash_prologue wrt ..imagebase,$L$ghash_epilogue wrt ..imagebase
+$L$SEH_info_gcm_init_clmul:
+DB	0x01,0x08,0x03,0x00
+DB	0x08,0x68,0x00,0x00
+DB	0x04,0x22,0x00,0x00
+$L$SEH_info_gcm_ghash_clmul:
+DB	0x01,0x33,0x16,0x00
+DB	0x33,0xf8,0x09,0x00
+DB	0x2e,0xe8,0x08,0x00
+DB	0x29,0xd8,0x07,0x00
+DB	0x24,0xc8,0x06,0x00
+DB	0x1f,0xb8,0x05,0x00
+DB	0x1a,0xa8,0x04,0x00
+DB	0x15,0x98,0x03,0x00
+DB	0x10,0x88,0x02,0x00
+DB	0x0c,0x78,0x01,0x00
+DB	0x08,0x68,0x00,0x00
+DB	0x04,0x01,0x15,0x00
diff --git a/crypto/aesgcm/ghashp8-ppc.pl b/crypto/aesgcm/ghashp8-ppc.pl
new file mode 100644
index 0000000..c46cdb5
--- /dev/null
+++ b/crypto/aesgcm/ghashp8-ppc.pl
@@ -0,0 +1,670 @@
+#! /usr/bin/env perl
+# Copyright 2014-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# GHASH for for PowerISA v2.07.
+#
+# July 2014
+#
+# Accurate performance measurements are problematic, because it's
+# always virtualized setup with possibly throttled processor.
+# Relative comparison is therefore more informative. This initial
+# version is ~2.1x slower than hardware-assisted AES-128-CTR, ~12x
+# faster than "4-bit" integer-only compiler-generated 64-bit code.
+# "Initial version" means that there is room for futher improvement.
+
+# May 2016
+#
+# 2x aggregated reduction improves performance by 50% (resulting
+# performance on POWER8 is 1 cycle per processed byte), and 4x
+# aggregated reduction - by 170% or 2.7x (resulting in 0.55 cpb).
+
+$flavour=shift;
+$output =shift;
+
+if ($flavour =~ /64/) {
+	$SIZE_T=8;
+	$LRSAVE=2*$SIZE_T;
+	$STU="stdu";
+	$POP="ld";
+	$PUSH="std";
+	$UCMP="cmpld";
+	$SHRI="srdi";
+} elsif ($flavour =~ /32/) {
+	$SIZE_T=4;
+	$LRSAVE=$SIZE_T;
+	$STU="stwu";
+	$POP="lwz";
+	$PUSH="stw";
+	$UCMP="cmplw";
+	$SHRI="srwi";
+} else { die "nonsense $flavour"; }
+
+$sp="r1";
+$FRAME=6*$SIZE_T+13*16;	# 13*16 is for v20-v31 offload
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}ppc-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../../../perlasm/ppc-xlate.pl" and -f $xlate) or
+die "can't locate ppc-xlate.pl";
+
+open STDOUT,"| $^X $xlate $flavour $output" || die "can't call $xlate: $!";
+
+my ($Xip,$Htbl,$inp,$len)=map("r$_",(3..6));	# argument block
+
+my ($Xl,$Xm,$Xh,$IN)=map("v$_",(0..3));
+my ($zero,$t0,$t1,$t2,$xC2,$H,$Hh,$Hl,$lemask)=map("v$_",(4..12));
+my ($Xl1,$Xm1,$Xh1,$IN1,$H2,$H2h,$H2l)=map("v$_",(13..19));
+my $vrsave="r12";
+
+$code=<<___;
+.machine	"any"
+
+.text
+
+.globl	.gcm_init_p8
+.align	5
+.gcm_init_p8:
+	li		r0,-4096
+	li		r8,0x10
+	mfspr		$vrsave,256
+	li		r9,0x20
+	mtspr		256,r0
+	li		r10,0x30
+	lvx_u		$H,0,r4			# load H
+
+	vspltisb	$xC2,-16		# 0xf0
+	vspltisb	$t0,1			# one
+	vaddubm		$xC2,$xC2,$xC2		# 0xe0
+	vxor		$zero,$zero,$zero
+	vor		$xC2,$xC2,$t0		# 0xe1
+	vsldoi		$xC2,$xC2,$zero,15	# 0xe1...
+	vsldoi		$t1,$zero,$t0,1		# ...1
+	vaddubm		$xC2,$xC2,$xC2		# 0xc2...
+	vspltisb	$t2,7
+	vor		$xC2,$xC2,$t1		# 0xc2....01
+	vspltb		$t1,$H,0		# most significant byte
+	vsl		$H,$H,$t0		# H<<=1
+	vsrab		$t1,$t1,$t2		# broadcast carry bit
+	vand		$t1,$t1,$xC2
+	vxor		$IN,$H,$t1		# twisted H
+
+	vsldoi		$H,$IN,$IN,8		# twist even more ...
+	vsldoi		$xC2,$zero,$xC2,8	# 0xc2.0
+	vsldoi		$Hl,$zero,$H,8		# ... and split
+	vsldoi		$Hh,$H,$zero,8
+
+	stvx_u		$xC2,0,r3		# save pre-computed table
+	stvx_u		$Hl,r8,r3
+	li		r8,0x40
+	stvx_u		$H, r9,r3
+	li		r9,0x50
+	stvx_u		$Hh,r10,r3
+	li		r10,0x60
+
+	vpmsumd		$Xl,$IN,$Hl		# H.lo·H.lo
+	vpmsumd		$Xm,$IN,$H		# H.hi·H.lo+H.lo·H.hi
+	vpmsumd		$Xh,$IN,$Hh		# H.hi·H.hi
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	vxor		$t1,$t1,$Xh
+	vxor		$IN1,$Xl,$t1
+
+	vsldoi		$H2,$IN1,$IN1,8
+	vsldoi		$H2l,$zero,$H2,8
+	vsldoi		$H2h,$H2,$zero,8
+
+	stvx_u		$H2l,r8,r3		# save H^2
+	li		r8,0x70
+	stvx_u		$H2,r9,r3
+	li		r9,0x80
+	stvx_u		$H2h,r10,r3
+	li		r10,0x90
+___
+{
+my ($t4,$t5,$t6) = ($Hl,$H,$Hh);
+$code.=<<___;
+	vpmsumd		$Xl,$IN,$H2l		# H.lo·H^2.lo
+	 vpmsumd	$Xl1,$IN1,$H2l		# H^2.lo·H^2.lo
+	vpmsumd		$Xm,$IN,$H2		# H.hi·H^2.lo+H.lo·H^2.hi
+	 vpmsumd	$Xm1,$IN1,$H2		# H^2.hi·H^2.lo+H^2.lo·H^2.hi
+	vpmsumd		$Xh,$IN,$H2h		# H.hi·H^2.hi
+	 vpmsumd	$Xh1,$IN1,$H2h		# H^2.hi·H^2.hi
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+	 vpmsumd	$t6,$Xl1,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	 vsldoi		$t4,$Xm1,$zero,8
+	 vsldoi		$t5,$zero,$Xm1,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+	 vxor		$Xl1,$Xl1,$t4
+	 vxor		$Xh1,$Xh1,$t5
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	 vsldoi		$Xl1,$Xl1,$Xl1,8
+	vxor		$Xl,$Xl,$t2
+	 vxor		$Xl1,$Xl1,$t6
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	 vsldoi		$t5,$Xl1,$Xl1,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	 vpmsumd	$Xl1,$Xl1,$xC2
+	vxor		$t1,$t1,$Xh
+	 vxor		$t5,$t5,$Xh1
+	vxor		$Xl,$Xl,$t1
+	 vxor		$Xl1,$Xl1,$t5
+
+	vsldoi		$H,$Xl,$Xl,8
+	 vsldoi		$H2,$Xl1,$Xl1,8
+	vsldoi		$Hl,$zero,$H,8
+	vsldoi		$Hh,$H,$zero,8
+	 vsldoi		$H2l,$zero,$H2,8
+	 vsldoi		$H2h,$H2,$zero,8
+
+	stvx_u		$Hl,r8,r3		# save H^3
+	li		r8,0xa0
+	stvx_u		$H,r9,r3
+	li		r9,0xb0
+	stvx_u		$Hh,r10,r3
+	li		r10,0xc0
+	 stvx_u		$H2l,r8,r3		# save H^4
+	 stvx_u		$H2,r9,r3
+	 stvx_u		$H2h,r10,r3
+
+	mtspr		256,$vrsave
+	blr
+	.long		0
+	.byte		0,12,0x14,0,0,0,2,0
+	.long		0
+.size	.gcm_init_p8,.-.gcm_init_p8
+___
+}
+$code.=<<___;
+.globl	.gcm_gmult_p8
+.align	5
+.gcm_gmult_p8:
+	lis		r0,0xfff8
+	li		r8,0x10
+	mfspr		$vrsave,256
+	li		r9,0x20
+	mtspr		256,r0
+	li		r10,0x30
+	lvx_u		$IN,0,$Xip		# load Xi
+
+	lvx_u		$Hl,r8,$Htbl		# load pre-computed table
+	 le?lvsl	$lemask,r0,r0
+	lvx_u		$H, r9,$Htbl
+	 le?vspltisb	$t0,0x07
+	lvx_u		$Hh,r10,$Htbl
+	 le?vxor	$lemask,$lemask,$t0
+	lvx_u		$xC2,0,$Htbl
+	 le?vperm	$IN,$IN,$IN,$lemask
+	vxor		$zero,$zero,$zero
+
+	vpmsumd		$Xl,$IN,$Hl		# H.lo·Xi.lo
+	vpmsumd		$Xm,$IN,$H		# H.hi·Xi.lo+H.lo·Xi.hi
+	vpmsumd		$Xh,$IN,$Hh		# H.hi·Xi.hi
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	vxor		$t1,$t1,$Xh
+	vxor		$Xl,$Xl,$t1
+
+	le?vperm	$Xl,$Xl,$Xl,$lemask
+	stvx_u		$Xl,0,$Xip		# write out Xi
+
+	mtspr		256,$vrsave
+	blr
+	.long		0
+	.byte		0,12,0x14,0,0,0,2,0
+	.long		0
+.size	.gcm_gmult_p8,.-.gcm_gmult_p8
+
+.globl	.gcm_ghash_p8
+.align	5
+.gcm_ghash_p8:
+	li		r0,-4096
+	li		r8,0x10
+	mfspr		$vrsave,256
+	li		r9,0x20
+	mtspr		256,r0
+	li		r10,0x30
+	lvx_u		$Xl,0,$Xip		# load Xi
+
+	lvx_u		$Hl,r8,$Htbl		# load pre-computed table
+	li		r8,0x40
+	 le?lvsl	$lemask,r0,r0
+	lvx_u		$H, r9,$Htbl
+	li		r9,0x50
+	 le?vspltisb	$t0,0x07
+	lvx_u		$Hh,r10,$Htbl
+	li		r10,0x60
+	 le?vxor	$lemask,$lemask,$t0
+	lvx_u		$xC2,0,$Htbl
+	 le?vperm	$Xl,$Xl,$Xl,$lemask
+	vxor		$zero,$zero,$zero
+
+	${UCMP}i	$len,64
+	bge		Lgcm_ghash_p8_4x
+
+	lvx_u		$IN,0,$inp
+	addi		$inp,$inp,16
+	subic.		$len,$len,16
+	 le?vperm	$IN,$IN,$IN,$lemask
+	vxor		$IN,$IN,$Xl
+	beq		Lshort
+
+	lvx_u		$H2l,r8,$Htbl		# load H^2
+	li		r8,16
+	lvx_u		$H2, r9,$Htbl
+	add		r9,$inp,$len		# end of input
+	lvx_u		$H2h,r10,$Htbl
+	be?b		Loop_2x
+
+.align	5
+Loop_2x:
+	lvx_u		$IN1,0,$inp
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+
+	 subic		$len,$len,32
+	vpmsumd		$Xl,$IN,$H2l		# H^2.lo·Xi.lo
+	 vpmsumd	$Xl1,$IN1,$Hl		# H.lo·Xi+1.lo
+	 subfe		r0,r0,r0		# borrow?-1:0
+	vpmsumd		$Xm,$IN,$H2		# H^2.hi·Xi.lo+H^2.lo·Xi.hi
+	 vpmsumd	$Xm1,$IN1,$H		# H.hi·Xi+1.lo+H.lo·Xi+1.hi
+	 and		r0,r0,$len
+	vpmsumd		$Xh,$IN,$H2h		# H^2.hi·Xi.hi
+	 vpmsumd	$Xh1,$IN1,$Hh		# H.hi·Xi+1.hi
+	 add		$inp,$inp,r0
+
+	vxor		$Xl,$Xl,$Xl1
+	vxor		$Xm,$Xm,$Xm1
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	 vxor		$Xh,$Xh,$Xh1
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+	 lvx_u		$IN,r8,$inp
+	 addi		$inp,$inp,32
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	 le?vperm	$IN,$IN,$IN,$lemask
+	vxor		$t1,$t1,$Xh
+	vxor		$IN,$IN,$t1
+	vxor		$IN,$IN,$Xl
+	$UCMP		r9,$inp
+	bgt		Loop_2x			# done yet?
+
+	cmplwi		$len,0
+	bne		Leven
+
+Lshort:
+	vpmsumd		$Xl,$IN,$Hl		# H.lo·Xi.lo
+	vpmsumd		$Xm,$IN,$H		# H.hi·Xi.lo+H.lo·Xi.hi
+	vpmsumd		$Xh,$IN,$Hh		# H.hi·Xi.hi
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	vxor		$t1,$t1,$Xh
+
+Leven:
+	vxor		$Xl,$Xl,$t1
+	le?vperm	$Xl,$Xl,$Xl,$lemask
+	stvx_u		$Xl,0,$Xip		# write out Xi
+
+	mtspr		256,$vrsave
+	blr
+	.long		0
+	.byte		0,12,0x14,0,0,0,4,0
+	.long		0
+___
+{
+my ($Xl3,$Xm2,$IN2,$H3l,$H3,$H3h,
+    $Xh3,$Xm3,$IN3,$H4l,$H4,$H4h) = map("v$_",(20..31));
+my $IN0=$IN;
+my ($H21l,$H21h,$loperm,$hiperm) = ($Hl,$Hh,$H2l,$H2h);
+
+$code.=<<___;
+.align	5
+.gcm_ghash_p8_4x:
+Lgcm_ghash_p8_4x:
+	$STU		$sp,-$FRAME($sp)
+	li		r10,`15+6*$SIZE_T`
+	li		r11,`31+6*$SIZE_T`
+	stvx		v20,r10,$sp
+	addi		r10,r10,32
+	stvx		v21,r11,$sp
+	addi		r11,r11,32
+	stvx		v22,r10,$sp
+	addi		r10,r10,32
+	stvx		v23,r11,$sp
+	addi		r11,r11,32
+	stvx		v24,r10,$sp
+	addi		r10,r10,32
+	stvx		v25,r11,$sp
+	addi		r11,r11,32
+	stvx		v26,r10,$sp
+	addi		r10,r10,32
+	stvx		v27,r11,$sp
+	addi		r11,r11,32
+	stvx		v28,r10,$sp
+	addi		r10,r10,32
+	stvx		v29,r11,$sp
+	addi		r11,r11,32
+	stvx		v30,r10,$sp
+	li		r10,0x60
+	stvx		v31,r11,$sp
+	li		r0,-1
+	stw		$vrsave,`$FRAME-4`($sp)	# save vrsave
+	mtspr		256,r0			# preserve all AltiVec registers
+
+	lvsl		$t0,0,r8		# 0x0001..0e0f
+	#lvx_u		$H2l,r8,$Htbl		# load H^2
+	li		r8,0x70
+	lvx_u		$H2, r9,$Htbl
+	li		r9,0x80
+	vspltisb	$t1,8			# 0x0808..0808
+	#lvx_u		$H2h,r10,$Htbl
+	li		r10,0x90
+	lvx_u		$H3l,r8,$Htbl		# load H^3
+	li		r8,0xa0
+	lvx_u		$H3, r9,$Htbl
+	li		r9,0xb0
+	lvx_u		$H3h,r10,$Htbl
+	li		r10,0xc0
+	lvx_u		$H4l,r8,$Htbl		# load H^4
+	li		r8,0x10
+	lvx_u		$H4, r9,$Htbl
+	li		r9,0x20
+	lvx_u		$H4h,r10,$Htbl
+	li		r10,0x30
+
+	vsldoi		$t2,$zero,$t1,8		# 0x0000..0808
+	vaddubm		$hiperm,$t0,$t2		# 0x0001..1617
+	vaddubm		$loperm,$t1,$hiperm	# 0x0809..1e1f
+
+	$SHRI		$len,$len,4		# this allows to use sign bit
+						# as carry
+	lvx_u		$IN0,0,$inp		# load input
+	lvx_u		$IN1,r8,$inp
+	subic.		$len,$len,8
+	lvx_u		$IN2,r9,$inp
+	lvx_u		$IN3,r10,$inp
+	addi		$inp,$inp,0x40
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+	le?vperm	$IN2,$IN2,$IN2,$lemask
+	le?vperm	$IN3,$IN3,$IN3,$lemask
+
+	vxor		$Xh,$IN0,$Xl
+
+	 vpmsumd	$Xl1,$IN1,$H3l
+	 vpmsumd	$Xm1,$IN1,$H3
+	 vpmsumd	$Xh1,$IN1,$H3h
+
+	 vperm		$H21l,$H2,$H,$hiperm
+	 vperm		$t0,$IN2,$IN3,$loperm
+	 vperm		$H21h,$H2,$H,$loperm
+	 vperm		$t1,$IN2,$IN3,$hiperm
+	 vpmsumd	$Xm2,$IN2,$H2		# H^2.lo·Xi+2.hi+H^2.hi·Xi+2.lo
+	 vpmsumd	$Xl3,$t0,$H21l		# H^2.lo·Xi+2.lo+H.lo·Xi+3.lo
+	 vpmsumd	$Xm3,$IN3,$H		# H.hi·Xi+3.lo  +H.lo·Xi+3.hi
+	 vpmsumd	$Xh3,$t1,$H21h		# H^2.hi·Xi+2.hi+H.hi·Xi+3.hi
+
+	 vxor		$Xm2,$Xm2,$Xm1
+	 vxor		$Xl3,$Xl3,$Xl1
+	 vxor		$Xm3,$Xm3,$Xm2
+	 vxor		$Xh3,$Xh3,$Xh1
+
+	blt		Ltail_4x
+
+Loop_4x:
+	lvx_u		$IN0,0,$inp
+	lvx_u		$IN1,r8,$inp
+	subic.		$len,$len,4
+	lvx_u		$IN2,r9,$inp
+	lvx_u		$IN3,r10,$inp
+	addi		$inp,$inp,0x40
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+	le?vperm	$IN2,$IN2,$IN2,$lemask
+	le?vperm	$IN3,$IN3,$IN3,$lemask
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+
+	vpmsumd		$Xl,$Xh,$H4l		# H^4.lo·Xi.lo
+	vpmsumd		$Xm,$Xh,$H4		# H^4.hi·Xi.lo+H^4.lo·Xi.hi
+	vpmsumd		$Xh,$Xh,$H4h		# H^4.hi·Xi.hi
+	 vpmsumd	$Xl1,$IN1,$H3l
+	 vpmsumd	$Xm1,$IN1,$H3
+	 vpmsumd	$Xh1,$IN1,$H3h
+
+	vxor		$Xl,$Xl,$Xl3
+	vxor		$Xm,$Xm,$Xm3
+	vxor		$Xh,$Xh,$Xh3
+	 vperm		$t0,$IN2,$IN3,$loperm
+	 vperm		$t1,$IN2,$IN3,$hiperm
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+	 vpmsumd	$Xl3,$t0,$H21l		# H.lo·Xi+3.lo  +H^2.lo·Xi+2.lo
+	 vpmsumd	$Xh3,$t1,$H21h		# H.hi·Xi+3.hi  +H^2.hi·Xi+2.hi
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	 vpmsumd	$Xm2,$IN2,$H2		# H^2.hi·Xi+2.lo+H^2.lo·Xi+2.hi
+	 vpmsumd	$Xm3,$IN3,$H		# H.hi·Xi+3.lo  +H.lo·Xi+3.hi
+	vpmsumd		$Xl,$Xl,$xC2
+
+	 vxor		$Xl3,$Xl3,$Xl1
+	 vxor		$Xh3,$Xh3,$Xh1
+	vxor		$Xh,$Xh,$IN0
+	 vxor		$Xm2,$Xm2,$Xm1
+	vxor		$Xh,$Xh,$t1
+	 vxor		$Xm3,$Xm3,$Xm2
+	vxor		$Xh,$Xh,$Xl
+	bge		Loop_4x
+
+Ltail_4x:
+	vpmsumd		$Xl,$Xh,$H4l		# H^4.lo·Xi.lo
+	vpmsumd		$Xm,$Xh,$H4		# H^4.hi·Xi.lo+H^4.lo·Xi.hi
+	vpmsumd		$Xh,$Xh,$H4h		# H^4.hi·Xi.hi
+
+	vxor		$Xl,$Xl,$Xl3
+	vxor		$Xm,$Xm,$Xm3
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	 vxor		$Xh,$Xh,$Xh3
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	vxor		$t1,$t1,$Xh
+	vxor		$Xl,$Xl,$t1
+
+	addic.		$len,$len,4
+	beq		Ldone_4x
+
+	lvx_u		$IN0,0,$inp
+	${UCMP}i	$len,2
+	li		$len,-4
+	blt		Lone
+	lvx_u		$IN1,r8,$inp
+	beq		Ltwo
+
+Lthree:
+	lvx_u		$IN2,r9,$inp
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+	le?vperm	$IN2,$IN2,$IN2,$lemask
+
+	vxor		$Xh,$IN0,$Xl
+	vmr		$H4l,$H3l
+	vmr		$H4, $H3
+	vmr		$H4h,$H3h
+
+	vperm		$t0,$IN1,$IN2,$loperm
+	vperm		$t1,$IN1,$IN2,$hiperm
+	vpmsumd		$Xm2,$IN1,$H2		# H^2.lo·Xi+1.hi+H^2.hi·Xi+1.lo
+	vpmsumd		$Xm3,$IN2,$H		# H.hi·Xi+2.lo  +H.lo·Xi+2.hi
+	vpmsumd		$Xl3,$t0,$H21l		# H^2.lo·Xi+1.lo+H.lo·Xi+2.lo
+	vpmsumd		$Xh3,$t1,$H21h		# H^2.hi·Xi+1.hi+H.hi·Xi+2.hi
+
+	vxor		$Xm3,$Xm3,$Xm2
+	b		Ltail_4x
+
+.align	4
+Ltwo:
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+
+	vxor		$Xh,$IN0,$Xl
+	vperm		$t0,$zero,$IN1,$loperm
+	vperm		$t1,$zero,$IN1,$hiperm
+
+	vsldoi		$H4l,$zero,$H2,8
+	vmr		$H4, $H2
+	vsldoi		$H4h,$H2,$zero,8
+
+	vpmsumd		$Xl3,$t0, $H21l		# H.lo·Xi+1.lo
+	vpmsumd		$Xm3,$IN1,$H		# H.hi·Xi+1.lo+H.lo·Xi+2.hi
+	vpmsumd		$Xh3,$t1, $H21h		# H.hi·Xi+1.hi
+
+	b		Ltail_4x
+
+.align	4
+Lone:
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+
+	vsldoi		$H4l,$zero,$H,8
+	vmr		$H4, $H
+	vsldoi		$H4h,$H,$zero,8
+
+	vxor		$Xh,$IN0,$Xl
+	vxor		$Xl3,$Xl3,$Xl3
+	vxor		$Xm3,$Xm3,$Xm3
+	vxor		$Xh3,$Xh3,$Xh3
+
+	b		Ltail_4x
+
+Ldone_4x:
+	le?vperm	$Xl,$Xl,$Xl,$lemask
+	stvx_u		$Xl,0,$Xip		# write out Xi
+
+	li		r10,`15+6*$SIZE_T`
+	li		r11,`31+6*$SIZE_T`
+	mtspr		256,$vrsave
+	lvx		v20,r10,$sp
+	addi		r10,r10,32
+	lvx		v21,r11,$sp
+	addi		r11,r11,32
+	lvx		v22,r10,$sp
+	addi		r10,r10,32
+	lvx		v23,r11,$sp
+	addi		r11,r11,32
+	lvx		v24,r10,$sp
+	addi		r10,r10,32
+	lvx		v25,r11,$sp
+	addi		r11,r11,32
+	lvx		v26,r10,$sp
+	addi		r10,r10,32
+	lvx		v27,r11,$sp
+	addi		r11,r11,32
+	lvx		v28,r10,$sp
+	addi		r10,r10,32
+	lvx		v29,r11,$sp
+	addi		r11,r11,32
+	lvx		v30,r10,$sp
+	lvx		v31,r11,$sp
+	addi		$sp,$sp,$FRAME
+	blr
+	.long		0
+	.byte		0,12,0x04,0,0x80,0,4,0
+	.long		0
+___
+}
+$code.=<<___;
+.size	.gcm_ghash_p8,.-.gcm_ghash_p8
+
+.asciz  "GHASH for PowerISA 2.07, CRYPTOGAMS by <appro\@openssl.org>"
+.align  2
+___
+
+foreach (split("\n",$code)) {
+	s/\`([^\`]*)\`/eval $1/geo;
+
+	if ($flavour =~ /le$/o) {	# little-endian
+	    s/le\?//o		or
+	    s/be\?/#be#/o;
+	} else {
+	    s/le\?/#le#/o	or
+	    s/be\?//o;
+	}
+	print $_,"\n";
+}
+
+close STDOUT; # enforce flush
diff --git a/crypto/aesgcm/ghashv8-armx.pl b/crypto/aesgcm/ghashv8-armx.pl
new file mode 100644
index 0000000..9bbca10
--- /dev/null
+++ b/crypto/aesgcm/ghashv8-armx.pl
@@ -0,0 +1,430 @@
+#! /usr/bin/env perl
+# Copyright 2014-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# GHASH for ARMv8 Crypto Extension, 64-bit polynomial multiplication.
+#
+# June 2014
+#
+# Initial version was developed in tight cooperation with Ard
+# Biesheuvel <ard.biesheuvel@linaro.org> from bits-n-pieces from
+# other assembly modules. Just like aesv8-armx.pl this module
+# supports both AArch32 and AArch64 execution modes.
+#
+# July 2014
+#
+# Implement 2x aggregated reduction [see ghash-x86.pl for background
+# information].
+#
+# Current performance in cycles per processed byte:
+#
+#		PMULL[2]	32-bit NEON(*)
+# Apple A7	0.92		5.62
+# Cortex-A53	1.01		8.39
+# Cortex-A57	1.17		7.61
+# Denver	0.71		6.02
+# Mongoose	1.10		8.06
+#
+# (*)	presented for reference/comparison purposes;
+
+$flavour = shift;
+$output  = shift;
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}arm-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../../../perlasm/arm-xlate.pl" and -f $xlate) or
+die "can't locate arm-xlate.pl";
+
+open OUT,"| \"$^X\" $xlate $flavour $output";
+*STDOUT=*OUT;
+
+$Xi="x0";	# argument block
+$Htbl="x1";
+$inp="x2";
+$len="x3";
+
+$inc="x12";
+
+{
+my ($Xl,$Xm,$Xh,$IN)=map("q$_",(0..3));
+my ($t0,$t1,$t2,$xC2,$H,$Hhl,$H2)=map("q$_",(8..14));
+
+$code=<<___;
+#include <openssl/arm_arch.h>
+
+.text
+___
+$code.=".arch	armv8-a+crypto\n"	if ($flavour =~ /64/);
+$code.=<<___				if ($flavour !~ /64/);
+.fpu	neon
+.code	32
+#undef	__thumb2__
+___
+
+################################################################################
+# void gcm_init_v8(u128 Htable[16],const u64 H[2]);
+#
+# input:	128-bit H - secret parameter E(K,0^128)
+# output:	precomputed table filled with degrees of twisted H;
+#		H is twisted to handle reverse bitness of GHASH;
+#		only few of 16 slots of Htable[16] are used;
+#		data is opaque to outside world (which allows to
+#		optimize the code independently);
+#
+$code.=<<___;
+.global	gcm_init_v8
+.type	gcm_init_v8,%function
+.align	4
+gcm_init_v8:
+	vld1.64		{$t1},[x1]		@ load input H
+	vmov.i8		$xC2,#0xe1
+	vshl.i64	$xC2,$xC2,#57		@ 0xc2.0
+	vext.8		$IN,$t1,$t1,#8
+	vshr.u64	$t2,$xC2,#63
+	vdup.32		$t1,${t1}[1]
+	vext.8		$t0,$t2,$xC2,#8		@ t0=0xc2....01
+	vshr.u64	$t2,$IN,#63
+	vshr.s32	$t1,$t1,#31		@ broadcast carry bit
+	vand		$t2,$t2,$t0
+	vshl.i64	$IN,$IN,#1
+	vext.8		$t2,$t2,$t2,#8
+	vand		$t0,$t0,$t1
+	vorr		$IN,$IN,$t2		@ H<<<=1
+	veor		$H,$IN,$t0		@ twisted H
+	vst1.64		{$H},[x0],#16		@ store Htable[0]
+
+	@ calculate H^2
+	vext.8		$t0,$H,$H,#8		@ Karatsuba pre-processing
+	vpmull.p64	$Xl,$H,$H
+	veor		$t0,$t0,$H
+	vpmull2.p64	$Xh,$H,$H
+	vpmull.p64	$Xm,$t0,$t0
+
+	vext.8		$t1,$Xl,$Xh,#8		@ Karatsuba post-processing
+	veor		$t2,$Xl,$Xh
+	veor		$Xm,$Xm,$t1
+	veor		$Xm,$Xm,$t2
+	vpmull.p64	$t2,$Xl,$xC2		@ 1st phase
+
+	vmov		$Xh#lo,$Xm#hi		@ Xh|Xm - 256-bit result
+	vmov		$Xm#hi,$Xl#lo		@ Xm is rotated Xl
+	veor		$Xl,$Xm,$t2
+
+	vext.8		$t2,$Xl,$Xl,#8		@ 2nd phase
+	vpmull.p64	$Xl,$Xl,$xC2
+	veor		$t2,$t2,$Xh
+	veor		$H2,$Xl,$t2
+
+	vext.8		$t1,$H2,$H2,#8		@ Karatsuba pre-processing
+	veor		$t1,$t1,$H2
+	vext.8		$Hhl,$t0,$t1,#8		@ pack Karatsuba pre-processed
+	vst1.64		{$Hhl-$H2},[x0]		@ store Htable[1..2]
+
+	ret
+.size	gcm_init_v8,.-gcm_init_v8
+___
+################################################################################
+# void gcm_gmult_v8(u64 Xi[2],const u128 Htable[16]);
+#
+# input:	Xi - current hash value;
+#		Htable - table precomputed in gcm_init_v8;
+# output:	Xi - next hash value Xi;
+#
+$code.=<<___;
+.global	gcm_gmult_v8
+.type	gcm_gmult_v8,%function
+.align	4
+gcm_gmult_v8:
+	vld1.64		{$t1},[$Xi]		@ load Xi
+	vmov.i8		$xC2,#0xe1
+	vld1.64		{$H-$Hhl},[$Htbl]	@ load twisted H, ...
+	vshl.u64	$xC2,$xC2,#57
+#ifndef __ARMEB__
+	vrev64.8	$t1,$t1
+#endif
+	vext.8		$IN,$t1,$t1,#8
+
+	vpmull.p64	$Xl,$H,$IN		@ H.lo·Xi.lo
+	veor		$t1,$t1,$IN		@ Karatsuba pre-processing
+	vpmull2.p64	$Xh,$H,$IN		@ H.hi·Xi.hi
+	vpmull.p64	$Xm,$Hhl,$t1		@ (H.lo+H.hi)·(Xi.lo+Xi.hi)
+
+	vext.8		$t1,$Xl,$Xh,#8		@ Karatsuba post-processing
+	veor		$t2,$Xl,$Xh
+	veor		$Xm,$Xm,$t1
+	veor		$Xm,$Xm,$t2
+	vpmull.p64	$t2,$Xl,$xC2		@ 1st phase of reduction
+
+	vmov		$Xh#lo,$Xm#hi		@ Xh|Xm - 256-bit result
+	vmov		$Xm#hi,$Xl#lo		@ Xm is rotated Xl
+	veor		$Xl,$Xm,$t2
+
+	vext.8		$t2,$Xl,$Xl,#8		@ 2nd phase of reduction
+	vpmull.p64	$Xl,$Xl,$xC2
+	veor		$t2,$t2,$Xh
+	veor		$Xl,$Xl,$t2
+
+#ifndef __ARMEB__
+	vrev64.8	$Xl,$Xl
+#endif
+	vext.8		$Xl,$Xl,$Xl,#8
+	vst1.64		{$Xl},[$Xi]		@ write out Xi
+
+	ret
+.size	gcm_gmult_v8,.-gcm_gmult_v8
+___
+################################################################################
+# void gcm_ghash_v8(u64 Xi[2],const u128 Htable[16],const u8 *inp,size_t len);
+#
+# input:	table precomputed in gcm_init_v8;
+#		current hash value Xi;
+#		pointer to input data;
+#		length of input data in bytes, but divisible by block size;
+# output:	next hash value Xi;
+#
+$code.=<<___;
+.global	gcm_ghash_v8
+.type	gcm_ghash_v8,%function
+.align	4
+gcm_ghash_v8:
+___
+$code.=<<___		if ($flavour !~ /64/);
+	vstmdb		sp!,{d8-d15}		@ 32-bit ABI says so
+___
+$code.=<<___;
+	vld1.64		{$Xl},[$Xi]		@ load [rotated] Xi
+						@ "[rotated]" means that
+						@ loaded value would have
+						@ to be rotated in order to
+						@ make it appear as in
+						@ alorithm specification
+	subs		$len,$len,#32		@ see if $len is 32 or larger
+	mov		$inc,#16		@ $inc is used as post-
+						@ increment for input pointer;
+						@ as loop is modulo-scheduled
+						@ $inc is zeroed just in time
+						@ to preclude oversteping
+						@ inp[len], which means that
+						@ last block[s] are actually
+						@ loaded twice, but last
+						@ copy is not processed
+	vld1.64		{$H-$Hhl},[$Htbl],#32	@ load twisted H, ..., H^2
+	vmov.i8		$xC2,#0xe1
+	vld1.64		{$H2},[$Htbl]
+	cclr		$inc,eq			@ is it time to zero $inc?
+	vext.8		$Xl,$Xl,$Xl,#8		@ rotate Xi
+	vld1.64		{$t0},[$inp],#16	@ load [rotated] I[0]
+	vshl.u64	$xC2,$xC2,#57		@ compose 0xc2.0 constant
+#ifndef __ARMEB__
+	vrev64.8	$t0,$t0
+	vrev64.8	$Xl,$Xl
+#endif
+	vext.8		$IN,$t0,$t0,#8		@ rotate I[0]
+	b.lo		.Lodd_tail_v8		@ $len was less than 32
+___
+{ my ($Xln,$Xmn,$Xhn,$In) = map("q$_",(4..7));
+	#######
+	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
+	#	[(H*Ii+1) + (H*Xi+1)] mod P =
+	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
+	#
+$code.=<<___;
+	vld1.64		{$t1},[$inp],$inc	@ load [rotated] I[1]
+#ifndef __ARMEB__
+	vrev64.8	$t1,$t1
+#endif
+	vext.8		$In,$t1,$t1,#8
+	veor		$IN,$IN,$Xl		@ I[i]^=Xi
+	vpmull.p64	$Xln,$H,$In		@ H·Ii+1
+	veor		$t1,$t1,$In		@ Karatsuba pre-processing
+	vpmull2.p64	$Xhn,$H,$In
+	b		.Loop_mod2x_v8
+
+.align	4
+.Loop_mod2x_v8:
+	vext.8		$t2,$IN,$IN,#8
+	subs		$len,$len,#32		@ is there more data?
+	vpmull.p64	$Xl,$H2,$IN		@ H^2.lo·Xi.lo
+	cclr		$inc,lo			@ is it time to zero $inc?
+
+	 vpmull.p64	$Xmn,$Hhl,$t1
+	veor		$t2,$t2,$IN		@ Karatsuba pre-processing
+	vpmull2.p64	$Xh,$H2,$IN		@ H^2.hi·Xi.hi
+	veor		$Xl,$Xl,$Xln		@ accumulate
+	vpmull2.p64	$Xm,$Hhl,$t2		@ (H^2.lo+H^2.hi)·(Xi.lo+Xi.hi)
+	 vld1.64	{$t0},[$inp],$inc	@ load [rotated] I[i+2]
+
+	veor		$Xh,$Xh,$Xhn
+	 cclr		$inc,eq			@ is it time to zero $inc?
+	veor		$Xm,$Xm,$Xmn
+
+	vext.8		$t1,$Xl,$Xh,#8		@ Karatsuba post-processing
+	veor		$t2,$Xl,$Xh
+	veor		$Xm,$Xm,$t1
+	 vld1.64	{$t1},[$inp],$inc	@ load [rotated] I[i+3]
+#ifndef __ARMEB__
+	 vrev64.8	$t0,$t0
+#endif
+	veor		$Xm,$Xm,$t2
+	vpmull.p64	$t2,$Xl,$xC2		@ 1st phase of reduction
+
+#ifndef __ARMEB__
+	 vrev64.8	$t1,$t1
+#endif
+	vmov		$Xh#lo,$Xm#hi		@ Xh|Xm - 256-bit result
+	vmov		$Xm#hi,$Xl#lo		@ Xm is rotated Xl
+	 vext.8		$In,$t1,$t1,#8
+	 vext.8		$IN,$t0,$t0,#8
+	veor		$Xl,$Xm,$t2
+	 vpmull.p64	$Xln,$H,$In		@ H·Ii+1
+	veor		$IN,$IN,$Xh		@ accumulate $IN early
+
+	vext.8		$t2,$Xl,$Xl,#8		@ 2nd phase of reduction
+	vpmull.p64	$Xl,$Xl,$xC2
+	veor		$IN,$IN,$t2
+	 veor		$t1,$t1,$In		@ Karatsuba pre-processing
+	veor		$IN,$IN,$Xl
+	 vpmull2.p64	$Xhn,$H,$In
+	b.hs		.Loop_mod2x_v8		@ there was at least 32 more bytes
+
+	veor		$Xh,$Xh,$t2
+	vext.8		$IN,$t0,$t0,#8		@ re-construct $IN
+	adds		$len,$len,#32		@ re-construct $len
+	veor		$Xl,$Xl,$Xh		@ re-construct $Xl
+	b.eq		.Ldone_v8		@ is $len zero?
+___
+}
+$code.=<<___;
+.Lodd_tail_v8:
+	vext.8		$t2,$Xl,$Xl,#8
+	veor		$IN,$IN,$Xl		@ inp^=Xi
+	veor		$t1,$t0,$t2		@ $t1 is rotated inp^Xi
+
+	vpmull.p64	$Xl,$H,$IN		@ H.lo·Xi.lo
+	veor		$t1,$t1,$IN		@ Karatsuba pre-processing
+	vpmull2.p64	$Xh,$H,$IN		@ H.hi·Xi.hi
+	vpmull.p64	$Xm,$Hhl,$t1		@ (H.lo+H.hi)·(Xi.lo+Xi.hi)
+
+	vext.8		$t1,$Xl,$Xh,#8		@ Karatsuba post-processing
+	veor		$t2,$Xl,$Xh
+	veor		$Xm,$Xm,$t1
+	veor		$Xm,$Xm,$t2
+	vpmull.p64	$t2,$Xl,$xC2		@ 1st phase of reduction
+
+	vmov		$Xh#lo,$Xm#hi		@ Xh|Xm - 256-bit result
+	vmov		$Xm#hi,$Xl#lo		@ Xm is rotated Xl
+	veor		$Xl,$Xm,$t2
+
+	vext.8		$t2,$Xl,$Xl,#8		@ 2nd phase of reduction
+	vpmull.p64	$Xl,$Xl,$xC2
+	veor		$t2,$t2,$Xh
+	veor		$Xl,$Xl,$t2
+
+.Ldone_v8:
+#ifndef __ARMEB__
+	vrev64.8	$Xl,$Xl
+#endif
+	vext.8		$Xl,$Xl,$Xl,#8
+	vst1.64		{$Xl},[$Xi]		@ write out Xi
+
+___
+$code.=<<___		if ($flavour !~ /64/);
+	vldmia		sp!,{d8-d15}		@ 32-bit ABI says so
+___
+$code.=<<___;
+	ret
+.size	gcm_ghash_v8,.-gcm_ghash_v8
+___
+}
+$code.=<<___;
+.asciz  "GHASH for ARMv8, CRYPTOGAMS by <appro\@openssl.org>"
+.align  2
+___
+
+if ($flavour =~ /64/) {			######## 64-bit code
+    sub unvmov {
+	my $arg=shift;
+
+	$arg =~ m/q([0-9]+)#(lo|hi),\s*q([0-9]+)#(lo|hi)/o &&
+	sprintf	"ins	v%d.d[%d],v%d.d[%d]",$1,($2 eq "lo")?0:1,$3,($4 eq "lo")?0:1;
+    }
+    foreach(split("\n",$code)) {
+	s/cclr\s+([wx])([^,]+),\s*([a-z]+)/csel	$1$2,$1zr,$1$2,$3/o	or
+	s/vmov\.i8/movi/o		or	# fix up legacy mnemonics
+	s/vmov\s+(.*)/unvmov($1)/geo	or
+	s/vext\.8/ext/o			or
+	s/vshr\.s/sshr\.s/o		or
+	s/vshr/ushr/o			or
+	s/^(\s+)v/$1/o			or	# strip off v prefix
+	s/\bbx\s+lr\b/ret/o;
+
+	s/\bq([0-9]+)\b/"v".($1<8?$1:$1+8).".16b"/geo;	# old->new registers
+	s/@\s/\/\//o;				# old->new style commentary
+
+	# fix up remainig legacy suffixes
+	s/\.[ui]?8(\s)/$1/o;
+	s/\.[uis]?32//o and s/\.16b/\.4s/go;
+	m/\.p64/o and s/\.16b/\.1q/o;		# 1st pmull argument
+	m/l\.p64/o and s/\.16b/\.1d/go;		# 2nd and 3rd pmull arguments
+	s/\.[uisp]?64//o and s/\.16b/\.2d/go;
+	s/\.[42]([sd])\[([0-3])\]/\.$1\[$2\]/o;
+
+	print $_,"\n";
+    }
+} else {				######## 32-bit code
+    sub unvdup32 {
+	my $arg=shift;
+
+	$arg =~ m/q([0-9]+),\s*q([0-9]+)\[([0-3])\]/o &&
+	sprintf	"vdup.32	q%d,d%d[%d]",$1,2*$2+($3>>1),$3&1;
+    }
+    sub unvpmullp64 {
+	my ($mnemonic,$arg)=@_;
+
+	if ($arg =~ m/q([0-9]+),\s*q([0-9]+),\s*q([0-9]+)/o) {
+	    my $word = 0xf2a00e00|(($1&7)<<13)|(($1&8)<<19)
+				 |(($2&7)<<17)|(($2&8)<<4)
+				 |(($3&7)<<1) |(($3&8)<<2);
+	    $word |= 0x00010001	 if ($mnemonic =~ "2");
+	    # since ARMv7 instructions are always encoded little-endian.
+	    # correct solution is to use .inst directive, but older
+	    # assemblers don't implement it:-(
+	    sprintf ".byte\t0x%02x,0x%02x,0x%02x,0x%02x\t@ %s %s",
+			$word&0xff,($word>>8)&0xff,
+			($word>>16)&0xff,($word>>24)&0xff,
+			$mnemonic,$arg;
+	}
+    }
+
+    foreach(split("\n",$code)) {
+	s/\b[wx]([0-9]+)\b/r$1/go;		# new->old registers
+	s/\bv([0-9])\.[12468]+[bsd]\b/q$1/go;	# new->old registers
+	s/\/\/\s?/@ /o;				# new->old style commentary
+
+	# fix up remainig new-style suffixes
+	s/\],#[0-9]+/]!/o;
+
+	s/cclr\s+([^,]+),\s*([a-z]+)/mov$2	$1,#0/o			or
+	s/vdup\.32\s+(.*)/unvdup32($1)/geo				or
+	s/v?(pmull2?)\.p64\s+(.*)/unvpmullp64($1,$2)/geo		or
+	s/\bq([0-9]+)#(lo|hi)/sprintf "d%d",2*$1+($2 eq "hi")/geo	or
+	s/^(\s+)b\./$1b/o						or
+	s/^(\s+)ret/$1bx\tlr/o;
+
+	print $_,"\n";
+    }
+}
+
+close STDOUT; # enforce flush
diff --git a/crypto/blake2s-load-sse2.h b/crypto/blake2s-load-sse2.h
new file mode 100644
index 0000000..d2e9a09
--- /dev/null
+++ b/crypto/blake2s-load-sse2.h
@@ -0,0 +1,60 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2S_LOAD_SSE2_H
+#define BLAKE2S_LOAD_SSE2_H
+
+#define LOAD_MSG_0_1(buf) buf = _mm_set_epi32(m6,m4,m2,m0)
+#define LOAD_MSG_0_2(buf) buf = _mm_set_epi32(m7,m5,m3,m1)
+#define LOAD_MSG_0_3(buf) buf = _mm_set_epi32(m14,m12,m10,m8)
+#define LOAD_MSG_0_4(buf) buf = _mm_set_epi32(m15,m13,m11,m9)
+#define LOAD_MSG_1_1(buf) buf = _mm_set_epi32(m13,m9,m4,m14)
+#define LOAD_MSG_1_2(buf) buf = _mm_set_epi32(m6,m15,m8,m10)
+#define LOAD_MSG_1_3(buf) buf = _mm_set_epi32(m5,m11,m0,m1)
+#define LOAD_MSG_1_4(buf) buf = _mm_set_epi32(m3,m7,m2,m12)
+#define LOAD_MSG_2_1(buf) buf = _mm_set_epi32(m15,m5,m12,m11)
+#define LOAD_MSG_2_2(buf) buf = _mm_set_epi32(m13,m2,m0,m8)
+#define LOAD_MSG_2_3(buf) buf = _mm_set_epi32(m9,m7,m3,m10)
+#define LOAD_MSG_2_4(buf) buf = _mm_set_epi32(m4,m1,m6,m14)
+#define LOAD_MSG_3_1(buf) buf = _mm_set_epi32(m11,m13,m3,m7)
+#define LOAD_MSG_3_2(buf) buf = _mm_set_epi32(m14,m12,m1,m9)
+#define LOAD_MSG_3_3(buf) buf = _mm_set_epi32(m15,m4,m5,m2)
+#define LOAD_MSG_3_4(buf) buf = _mm_set_epi32(m8,m0,m10,m6)
+#define LOAD_MSG_4_1(buf) buf = _mm_set_epi32(m10,m2,m5,m9)
+#define LOAD_MSG_4_2(buf) buf = _mm_set_epi32(m15,m4,m7,m0)
+#define LOAD_MSG_4_3(buf) buf = _mm_set_epi32(m3,m6,m11,m14)
+#define LOAD_MSG_4_4(buf) buf = _mm_set_epi32(m13,m8,m12,m1)
+#define LOAD_MSG_5_1(buf) buf = _mm_set_epi32(m8,m0,m6,m2)
+#define LOAD_MSG_5_2(buf) buf = _mm_set_epi32(m3,m11,m10,m12)
+#define LOAD_MSG_5_3(buf) buf = _mm_set_epi32(m1,m15,m7,m4)
+#define LOAD_MSG_5_4(buf) buf = _mm_set_epi32(m9,m14,m5,m13)
+#define LOAD_MSG_6_1(buf) buf = _mm_set_epi32(m4,m14,m1,m12)
+#define LOAD_MSG_6_2(buf) buf = _mm_set_epi32(m10,m13,m15,m5)
+#define LOAD_MSG_6_3(buf) buf = _mm_set_epi32(m8,m9,m6,m0)
+#define LOAD_MSG_6_4(buf) buf = _mm_set_epi32(m11,m2,m3,m7)
+#define LOAD_MSG_7_1(buf) buf = _mm_set_epi32(m3,m12,m7,m13)
+#define LOAD_MSG_7_2(buf) buf = _mm_set_epi32(m9,m1,m14,m11)
+#define LOAD_MSG_7_3(buf) buf = _mm_set_epi32(m2,m8,m15,m5)
+#define LOAD_MSG_7_4(buf) buf = _mm_set_epi32(m10,m6,m4,m0)
+#define LOAD_MSG_8_1(buf) buf = _mm_set_epi32(m0,m11,m14,m6)
+#define LOAD_MSG_8_2(buf) buf = _mm_set_epi32(m8,m3,m9,m15)
+#define LOAD_MSG_8_3(buf) buf = _mm_set_epi32(m10,m1,m13,m12)
+#define LOAD_MSG_8_4(buf) buf = _mm_set_epi32(m5,m4,m7,m2)
+#define LOAD_MSG_9_1(buf) buf = _mm_set_epi32(m1,m7,m8,m10)
+#define LOAD_MSG_9_2(buf) buf = _mm_set_epi32(m5,m6,m4,m2)
+#define LOAD_MSG_9_3(buf) buf = _mm_set_epi32(m13,m3,m9,m15)
+#define LOAD_MSG_9_4(buf) buf = _mm_set_epi32(m0,m12,m14,m11)
+
+
+#endif
diff --git a/crypto/blake2s-load-sse41.h b/crypto/blake2s-load-sse41.h
new file mode 100644
index 0000000..c316fb5
--- /dev/null
+++ b/crypto/blake2s-load-sse41.h
@@ -0,0 +1,229 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2S_LOAD_SSE41_H
+#define BLAKE2S_LOAD_SSE41_H
+
+#define LOAD_MSG_0_1(buf) \
+buf = TOI(_mm_shuffle_ps(TOF(m0), TOF(m1), _MM_SHUFFLE(2,0,2,0)));
+
+#define LOAD_MSG_0_2(buf) \
+buf = TOI(_mm_shuffle_ps(TOF(m0), TOF(m1), _MM_SHUFFLE(3,1,3,1)));
+
+#define LOAD_MSG_0_3(buf) \
+buf = TOI(_mm_shuffle_ps(TOF(m2), TOF(m3), _MM_SHUFFLE(2,0,2,0)));
+
+#define LOAD_MSG_0_4(buf) \
+buf = TOI(_mm_shuffle_ps(TOF(m2), TOF(m3), _MM_SHUFFLE(3,1,3,1)));
+
+#define LOAD_MSG_1_1(buf) \
+t0 = _mm_blend_epi16(m1, m2, 0x0C); \
+t1 = _mm_slli_si128(m3, 4); \
+t2 = _mm_blend_epi16(t0, t1, 0xF0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,1,0,3));
+
+#define LOAD_MSG_1_2(buf) \
+t0 = _mm_shuffle_epi32(m2,_MM_SHUFFLE(0,0,2,0)); \
+t1 = _mm_blend_epi16(m1,m3,0xC0); \
+t2 = _mm_blend_epi16(t0, t1, 0xF0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1));
+
+#define LOAD_MSG_1_3(buf) \
+t0 = _mm_slli_si128(m1, 4); \
+t1 = _mm_blend_epi16(m2, t0, 0x30); \
+t2 = _mm_blend_epi16(m0, t1, 0xF0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1));
+
+#define LOAD_MSG_1_4(buf) \
+t0 = _mm_unpackhi_epi32(m0,m1); \
+t1 = _mm_slli_si128(m3, 4); \
+t2 = _mm_blend_epi16(t0, t1, 0x0C); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1));
+
+#define LOAD_MSG_2_1(buf) \
+t0 = _mm_unpackhi_epi32(m2,m3); \
+t1 = _mm_blend_epi16(m3,m1,0x0C); \
+t2 = _mm_blend_epi16(t0, t1, 0x0F); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(3,1,0,2));
+
+#define LOAD_MSG_2_2(buf) \
+t0 = _mm_unpacklo_epi32(m2,m0); \
+t1 = _mm_blend_epi16(t0, m0, 0xF0); \
+t2 = _mm_slli_si128(m3, 8); \
+buf = _mm_blend_epi16(t1, t2, 0xC0);
+
+#define LOAD_MSG_2_3(buf) \
+t0 = _mm_blend_epi16(m0, m2, 0x3C); \
+t1 = _mm_srli_si128(m1, 12); \
+t2 = _mm_blend_epi16(t0,t1,0x03); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,0,3,2));
+
+#define LOAD_MSG_2_4(buf) \
+t0 = _mm_slli_si128(m3, 4); \
+t1 = _mm_blend_epi16(m0, m1, 0x33); \
+t2 = _mm_blend_epi16(t1, t0, 0xC0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(0,1,2,3));
+
+#define LOAD_MSG_3_1(buf) \
+t0 = _mm_unpackhi_epi32(m0,m1); \
+t1 = _mm_unpackhi_epi32(t0, m2); \
+t2 = _mm_blend_epi16(t1, m3, 0x0C); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(3,1,0,2));
+
+#define LOAD_MSG_3_2(buf) \
+t0 = _mm_slli_si128(m2, 8); \
+t1 = _mm_blend_epi16(m3,m0,0x0C); \
+t2 = _mm_blend_epi16(t1, t0, 0xC0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,0,1,3));
+
+#define LOAD_MSG_3_3(buf) \
+t0 = _mm_blend_epi16(m0,m1,0x0F); \
+t1 = _mm_blend_epi16(t0, m3, 0xC0); \
+buf = _mm_shuffle_epi32(t1, _MM_SHUFFLE(3,0,1,2));
+
+#define LOAD_MSG_3_4(buf) \
+t0 = _mm_unpacklo_epi32(m0,m2); \
+t1 = _mm_unpackhi_epi32(m1,m2); \
+buf = _mm_unpacklo_epi64(t1,t0);
+
+#define LOAD_MSG_4_1(buf) \
+t0 = _mm_unpacklo_epi64(m1,m2); \
+t1 = _mm_unpackhi_epi64(m0,m2); \
+t2 = _mm_blend_epi16(t0,t1,0x33); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,0,1,3));
+
+#define LOAD_MSG_4_2(buf) \
+t0 = _mm_unpackhi_epi64(m1,m3); \
+t1 = _mm_unpacklo_epi64(m0,m1); \
+buf = _mm_blend_epi16(t0,t1,0x33);
+
+#define LOAD_MSG_4_3(buf) \
+t0 = _mm_unpackhi_epi64(m3,m1); \
+t1 = _mm_unpackhi_epi64(m2,m0); \
+buf = _mm_blend_epi16(t1,t0,0x33);
+
+#define LOAD_MSG_4_4(buf) \
+t0 = _mm_blend_epi16(m0,m2,0x03); \
+t1 = _mm_slli_si128(t0, 8); \
+t2 = _mm_blend_epi16(t1,m3,0x0F); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,2,0,3));
+
+#define LOAD_MSG_5_1(buf) \
+t0 = _mm_unpackhi_epi32(m0,m1); \
+t1 = _mm_unpacklo_epi32(m0,m2); \
+buf = _mm_unpacklo_epi64(t0,t1);
+
+#define LOAD_MSG_5_2(buf) \
+t0 = _mm_srli_si128(m2, 4); \
+t1 = _mm_blend_epi16(m0,m3,0x03); \
+buf = _mm_blend_epi16(t1,t0,0x3C);
+
+#define LOAD_MSG_5_3(buf) \
+t0 = _mm_blend_epi16(m1,m0,0x0C); \
+t1 = _mm_srli_si128(m3, 4); \
+t2 = _mm_blend_epi16(t0,t1,0x30); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,2,3,0));
+
+#define LOAD_MSG_5_4(buf) \
+t0 = _mm_unpacklo_epi64(m1,m2); \
+t1= _mm_shuffle_epi32(m3, _MM_SHUFFLE(0,2,0,1)); \
+buf = _mm_blend_epi16(t0,t1,0x33);
+
+#define LOAD_MSG_6_1(buf) \
+t0 = _mm_slli_si128(m1, 12); \
+t1 = _mm_blend_epi16(m0,m3,0x33); \
+buf = _mm_blend_epi16(t1,t0,0xC0);
+
+#define LOAD_MSG_6_2(buf) \
+t0 = _mm_blend_epi16(m3,m2,0x30); \
+t1 = _mm_srli_si128(m1, 4); \
+t2 = _mm_blend_epi16(t0,t1,0x03); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,1,3,0));
+
+#define LOAD_MSG_6_3(buf) \
+t0 = _mm_unpacklo_epi64(m0,m2); \
+t1 = _mm_srli_si128(m1, 4); \
+buf = _mm_shuffle_epi32(_mm_blend_epi16(t0,t1,0x0C), _MM_SHUFFLE(2,3,1,0));
+
+#define LOAD_MSG_6_4(buf) \
+t0 = _mm_unpackhi_epi32(m1,m2); \
+t1 = _mm_unpackhi_epi64(m0,t0); \
+buf = _mm_shuffle_epi32(t1, _MM_SHUFFLE(3,0,1,2));
+
+#define LOAD_MSG_7_1(buf) \
+t0 = _mm_unpackhi_epi32(m0,m1); \
+t1 = _mm_blend_epi16(t0,m3,0x0F); \
+buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(2,0,3,1));
+
+#define LOAD_MSG_7_2(buf) \
+t0 = _mm_blend_epi16(m2,m3,0x30); \
+t1 = _mm_srli_si128(m0,4); \
+t2 = _mm_blend_epi16(t0,t1,0x03); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,0,2,3));
+
+#define LOAD_MSG_7_3(buf) \
+t0 = _mm_unpackhi_epi64(m0,m3); \
+t1 = _mm_unpacklo_epi64(m1,m2); \
+t2 = _mm_blend_epi16(t0,t1,0x3C); \
+buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(0,2,3,1));
+
+#define LOAD_MSG_7_4(buf) \
+t0 = _mm_unpacklo_epi32(m0,m1); \
+t1 = _mm_unpackhi_epi32(m1,m2); \
+buf = _mm_unpacklo_epi64(t0,t1);
+
+#define LOAD_MSG_8_1(buf) \
+t0 = _mm_unpackhi_epi32(m1,m3); \
+t1 = _mm_unpacklo_epi64(t0,m0); \
+t2 = _mm_blend_epi16(t1,m2,0xC0); \
+buf = _mm_shufflehi_epi16(t2,_MM_SHUFFLE(1,0,3,2));
+
+#define LOAD_MSG_8_2(buf) \
+t0 = _mm_unpackhi_epi32(m0,m3); \
+t1 = _mm_blend_epi16(m2,t0,0xF0); \
+buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(0,2,1,3));
+
+#define LOAD_MSG_8_3(buf) \
+t0 = _mm_blend_epi16(m2,m0,0x0C); \
+t1 = _mm_slli_si128(t0,4); \
+buf = _mm_blend_epi16(t1,m3,0x0F);
+
+#define LOAD_MSG_8_4(buf) \
+t0 = _mm_blend_epi16(m1,m0,0x30); \
+buf = _mm_shuffle_epi32(t0,_MM_SHUFFLE(1,0,3,2));
+
+#define LOAD_MSG_9_1(buf) \
+t0 = _mm_blend_epi16(m0,m2,0x03); \
+t1 = _mm_blend_epi16(m1,m2,0x30); \
+t2 = _mm_blend_epi16(t1,t0,0x0F); \
+buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(1,3,0,2));
+
+#define LOAD_MSG_9_2(buf) \
+t0 = _mm_slli_si128(m0,4); \
+t1 = _mm_blend_epi16(m1,t0,0xC0); \
+buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(1,2,0,3));
+
+#define LOAD_MSG_9_3(buf) \
+t0 = _mm_unpackhi_epi32(m0,m3); \
+t1 = _mm_unpacklo_epi32(m2,m3); \
+t2 = _mm_unpackhi_epi64(t0,t1); \
+buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(3,0,2,1));
+
+#define LOAD_MSG_9_4(buf) \
+t0 = _mm_blend_epi16(m3,m2,0xC0); \
+t1 = _mm_unpacklo_epi32(m0,m3); \
+t2 = _mm_blend_epi16(t0,t1,0x0F); \
+buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(0,1,2,3));
+
+#endif
diff --git a/crypto/blake2s-load-xop.h b/crypto/blake2s-load-xop.h
new file mode 100644
index 0000000..a97ddcc
--- /dev/null
+++ b/crypto/blake2s-load-xop.h
@@ -0,0 +1,191 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2S_LOAD_XOP_H
+#define BLAKE2S_LOAD_XOP_H
+
+#define TOB(x) ((x)*4*0x01010101 + 0x03020100) /* ..or not TOB */
+
+#if 0
+/* Basic VPPERM emulation, for testing purposes */
+static __m128i _mm_perm_epi8(const __m128i src1, const __m128i src2, const __m128i sel)
+{
+   const __m128i sixteen = _mm_set1_epi8(16);
+   const __m128i t0 = _mm_shuffle_epi8(src1, sel);
+   const __m128i s1 = _mm_shuffle_epi8(src2, _mm_sub_epi8(sel, sixteen));
+   const __m128i mask = _mm_or_si128(_mm_cmpeq_epi8(sel, sixteen),
+                                     _mm_cmpgt_epi8(sel, sixteen)); /* (>=16) = 0xff : 00 */
+   return _mm_blendv_epi8(t0, s1, mask);
+}
+#endif
+
+#define LOAD_MSG_0_1(buf) \
+buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(6),TOB(4),TOB(2),TOB(0)) );
+
+#define LOAD_MSG_0_2(buf) \
+buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(7),TOB(5),TOB(3),TOB(1)) );
+
+#define LOAD_MSG_0_3(buf) \
+buf = _mm_perm_epi8(m2, m3, _mm_set_epi32(TOB(6),TOB(4),TOB(2),TOB(0)) );
+
+#define LOAD_MSG_0_4(buf) \
+buf = _mm_perm_epi8(m2, m3, _mm_set_epi32(TOB(7),TOB(5),TOB(3),TOB(1)) );
+
+#define LOAD_MSG_1_1(buf) \
+t0 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(0),TOB(5),TOB(0),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(6)) );
+
+#define LOAD_MSG_1_2(buf) \
+t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(2),TOB(0),TOB(4),TOB(6)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_1_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(0),TOB(0),TOB(1)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_1_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(7),TOB(2),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(4)) );
+
+#define LOAD_MSG_2_1(buf) \
+t0 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(0),TOB(1),TOB(0),TOB(7)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(4),TOB(0)) );
+
+#define LOAD_MSG_2_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(2),TOB(0),TOB(4)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_2_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(7),TOB(3),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(6)) );
+
+#define LOAD_MSG_2_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(4),TOB(1),TOB(6),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(6)) );
+
+#define LOAD_MSG_3_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(3),TOB(7)) ); \
+t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(5),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_3_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(0),TOB(1),TOB(5)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(6),TOB(4),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_3_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(4),TOB(5),TOB(2)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_3_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(6)) ); \
+buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(4),TOB(2),TOB(6),TOB(0)) );
+
+#define LOAD_MSG_4_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(2),TOB(5),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(6),TOB(2),TOB(1),TOB(5)) );
+
+#define LOAD_MSG_4_2(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(4),TOB(7),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_4_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(6),TOB(0),TOB(0)) ); \
+t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(2),TOB(7),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(6)) );
+
+#define LOAD_MSG_4_4(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(4),TOB(0),TOB(1)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(4),TOB(0)) );
+
+#define LOAD_MSG_5_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(6),TOB(2)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(4),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_5_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(6),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(4)) );
+
+#define LOAD_MSG_5_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(1),TOB(0),TOB(7),TOB(4)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_5_4(buf) \
+t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(5),TOB(0),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(6),TOB(1),TOB(5)) );
+
+#define LOAD_MSG_6_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(4),TOB(0),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(6),TOB(1),TOB(4)) );
+
+#define LOAD_MSG_6_2(buf) \
+t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(6),TOB(0),TOB(0),TOB(1)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(5),TOB(7),TOB(0)) );
+
+#define LOAD_MSG_6_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(6),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(4),TOB(5),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_6_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(2),TOB(3),TOB(7)) ); \
+buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_7_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(0),TOB(7),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(4),TOB(1),TOB(5)) );
+
+#define LOAD_MSG_7_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(5),TOB(1),TOB(0),TOB(7)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(6),TOB(0)) );
+
+#define LOAD_MSG_7_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(2),TOB(0),TOB(0),TOB(5)) ); \
+t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(4),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(7),TOB(0)) );
+
+#define LOAD_MSG_7_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(6),TOB(4),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(6),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_8_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(6)) ); \
+t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(6),TOB(0)) );
+
+#define LOAD_MSG_8_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(4),TOB(3),TOB(5),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(7)) );
+
+#define LOAD_MSG_8_3(buf) \
+t0 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(6),TOB(1),TOB(0),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(5),TOB(4)) ); \
+
+#define LOAD_MSG_8_4(buf) \
+buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(4),TOB(7),TOB(2)) );
+
+#define LOAD_MSG_9_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(1),TOB(7),TOB(0),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(2),TOB(4),TOB(6)) );
+
+#define LOAD_MSG_9_2(buf) \
+buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(6),TOB(4),TOB(2)) );
+
+#define LOAD_MSG_9_3(buf) \
+t0 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(3),TOB(5),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(7)) );
+
+#define LOAD_MSG_9_4(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(7)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(4),TOB(6),TOB(0)) );
+
+#endif
diff --git a/crypto/blake2s-round.h b/crypto/blake2s-round.h
new file mode 100644
index 0000000..44a5574
--- /dev/null
+++ b/crypto/blake2s-round.h
@@ -0,0 +1,88 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2S_ROUND_H
+#define BLAKE2S_ROUND_H
+
+#define LOADU(p)  _mm_loadu_si128( (const __m128i *)(p) )
+#define STOREU(p,r) _mm_storeu_si128((__m128i *)(p), r)
+
+#define TOF(reg) _mm_castsi128_ps((reg))
+#define TOI(reg) _mm_castps_si128((reg))
+
+#define LIKELY(x) __builtin_expect((x),1)
+
+
+/* Microarchitecture-specific macros */
+#ifndef HAVE_XOP
+#ifdef HAVE_SSSE3
+#define _mm_roti_epi32(r, c) ( \
+                (8==-(c)) ? _mm_shuffle_epi8(r,r8) \
+              : (16==-(c)) ? _mm_shuffle_epi8(r,r16) \
+              : _mm_xor_si128(_mm_srli_epi32( (r), -(c) ),_mm_slli_epi32( (r), 32-(-(c)) )) )
+#else
+#define _mm_roti_epi32(r, c) _mm_xor_si128(_mm_srli_epi32( (r), -(c) ),_mm_slli_epi32( (r), 32-(-(c)) ))
+#endif
+#else
+/* ... */
+#endif
+
+
+#define G1(row1,row2,row3,row4,buf) \
+  row1 = _mm_add_epi32( _mm_add_epi32( row1, buf), row2 ); \
+  row4 = _mm_xor_si128( row4, row1 ); \
+  row4 = _mm_roti_epi32(row4, -16); \
+  row3 = _mm_add_epi32( row3, row4 );   \
+  row2 = _mm_xor_si128( row2, row3 ); \
+  row2 = _mm_roti_epi32(row2, -12);
+
+#define G2(row1,row2,row3,row4,buf) \
+  row1 = _mm_add_epi32( _mm_add_epi32( row1, buf), row2 ); \
+  row4 = _mm_xor_si128( row4, row1 ); \
+  row4 = _mm_roti_epi32(row4, -8); \
+  row3 = _mm_add_epi32( row3, row4 );   \
+  row2 = _mm_xor_si128( row2, row3 ); \
+  row2 = _mm_roti_epi32(row2, -7);
+
+#define DIAGONALIZE(row1,row2,row3,row4) \
+  row4 = _mm_shuffle_epi32( row4, _MM_SHUFFLE(2,1,0,3) ); \
+  row3 = _mm_shuffle_epi32( row3, _MM_SHUFFLE(1,0,3,2) ); \
+  row2 = _mm_shuffle_epi32( row2, _MM_SHUFFLE(0,3,2,1) );
+
+#define UNDIAGONALIZE(row1,row2,row3,row4) \
+  row4 = _mm_shuffle_epi32( row4, _MM_SHUFFLE(0,3,2,1) ); \
+  row3 = _mm_shuffle_epi32( row3, _MM_SHUFFLE(1,0,3,2) ); \
+  row2 = _mm_shuffle_epi32( row2, _MM_SHUFFLE(2,1,0,3) );
+
+#if defined(HAVE_XOP)
+#include "blake2s-load-xop.h"
+#elif defined(HAVE_SSE41)
+#include "blake2s-load-sse41.h"
+#else
+#include "blake2s-load-sse2.h"
+#endif
+
+#define ROUND(r)  \
+  LOAD_MSG_ ##r ##_1(buf1); \
+  G1(row1,row2,row3,row4,buf1); \
+  LOAD_MSG_ ##r ##_2(buf2); \
+  G2(row1,row2,row3,row4,buf2); \
+  DIAGONALIZE(row1,row2,row3,row4); \
+  LOAD_MSG_ ##r ##_3(buf3); \
+  G1(row1,row2,row3,row4,buf3); \
+  LOAD_MSG_ ##r ##_4(buf4); \
+  G2(row1,row2,row3,row4,buf4); \
+  UNDIAGONALIZE(row1,row2,row3,row4); \
+
+#endif
diff --git a/crypto/blake2s.cpp b/crypto/blake2s.cpp
new file mode 100644
index 0000000..8c2397c
--- /dev/null
+++ b/crypto/blake2s.cpp
@@ -0,0 +1,446 @@
+/*
+BLAKE2 reference source code package - reference C implementations
+
+Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+your option.  The terms of these licenses can be found at:
+
+- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+- OpenSSL license   : https://www.openssl.org/source/license.html
+- Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+More information about the BLAKE2 hash function can be found at
+https://blake2.net.
+*/
+
+#include "stdafx.h"
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+#include <assert.h>
+#include "tunsafe_types.h"
+#include "blake2s.h"
+#include "crypto_ops.h"
+
+void blake2s_compress_sse(blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES]);
+
+#if !defined(__cplusplus) && (!defined(__STDC_VERSION__) || __STDC_VERSION__ < 199901L)
+#if   defined(_MSC_VER)
+#define BLAKE2_INLINE __inline
+#elif defined(__GNUC__)
+#define BLAKE2_INLINE __inline__
+#else
+#define BLAKE2_INLINE
+#endif
+#else
+#define BLAKE2_INLINE inline
+#endif
+
+static BLAKE2_INLINE uint32_t load32(const void *src) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  uint32_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = (const uint8_t *)src;
+  return ((uint32_t)(p[0]) << 0) |
+    ((uint32_t)(p[1]) << 8) |
+    ((uint32_t)(p[2]) << 16) |
+    ((uint32_t)(p[3]) << 24);
+#endif
+}
+
+static BLAKE2_INLINE uint16_t load16(const void *src) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  uint16_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = (const uint8_t *)src;
+  return ((uint16_t)(p[0]) << 0) |
+    ((uint16_t)(p[1]) << 8);
+#endif
+}
+
+static BLAKE2_INLINE void store16(void *dst, uint16_t w) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = (uint8_t *)dst;
+  *p++ = (uint8_t)w; w >>= 8;
+  *p++ = (uint8_t)w;
+#endif
+}
+
+static BLAKE2_INLINE void store32(void *dst, uint32_t w) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = (uint8_t *)dst;
+  p[0] = (uint8_t)(w >> 0);
+  p[1] = (uint8_t)(w >> 8);
+  p[2] = (uint8_t)(w >> 16);
+  p[3] = (uint8_t)(w >> 24);
+#endif
+}
+
+static BLAKE2_INLINE uint32_t rotr32(const uint32_t w, const unsigned c) {
+  return (w >> c) | (w << (32 - c));
+}
+
+static BLAKE2_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
+  return (w >> c) | (w << (64 - c));
+}
+
+static const uint32_t blake2s_IV[8] = {
+  0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL,
+  0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL
+};
+
+static const uint8_t blake2s_sigma[10][16] =
+{
+  {0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15} ,
+  {14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3} ,
+  {11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4} ,
+  {7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8} ,
+  {9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13} ,
+  {2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9} ,
+  {12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11} ,
+  {13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10} ,
+  {6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5} ,
+  {10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0} ,
+};
+
+static void blake2s_set_lastnode(blake2s_state *S) {
+  S->f[1] = (uint32_t)-1;
+}
+
+/* Some helper functions, not necessarily useful */
+static int blake2s_is_lastblock(const blake2s_state *S) {
+  return S->f[0] != 0;
+}
+
+static void blake2s_set_lastblock(blake2s_state *S) {
+  if (S->last_node) blake2s_set_lastnode(S);
+
+  S->f[0] = (uint32_t)-1;
+}
+
+static void blake2s_increment_counter(blake2s_state *S, const uint32_t inc) {
+  S->t[0] += inc;
+  S->t[1] += (S->t[0] < inc);
+}
+
+void blake2s_init_with_len(blake2s_state *S, size_t outlen, size_t keylen) {
+  memset(S, 0, sizeof(blake2s_state));
+
+  blake2s_param *P = &S->param;
+  size_t i;
+
+  /* Move interval verification here? */
+  assert(outlen && outlen <= BLAKE2S_OUTBYTES);
+
+  P->digest_length = (uint8_t)outlen;
+  S->outlen = (uint8_t)outlen;
+  P->key_length = (uint8_t)keylen;
+  P->fanout = 1;
+  P->depth = 1;
+  //  store32(&P.leaf_length, 0);
+  //  store32(&P.node_offset, 0);
+  //  store16(&P.xof_length, 0);
+  //  P.node_depth = 0;
+  //  P.inner_length = 0;
+  /* memset(P->reserved, 0, sizeof(P->reserved) ); */
+  //  memset(P.salt, 0, sizeof(P.salt));
+  //  memset(P.personal, 0, sizeof(P.personal));
+  for (i = 0; i < 8; ++i)
+    S->h[i] = load32(&S->h[i]) ^ blake2s_IV[i];
+
+}
+
+/* Sequential blake2s initialization */
+void blake2s_init(blake2s_state *S, size_t outlen) {
+  blake2s_init_with_len(S, outlen, 0);
+}
+
+void blake2s_init_key(blake2s_state *S, size_t outlen, const void *key, size_t keylen) {
+  uint8_t block[BLAKE2S_BLOCKBYTES];
+
+  assert(outlen && outlen <= BLAKE2S_OUTBYTES);
+  assert(key && keylen && keylen <= BLAKE2S_KEYBYTES);
+
+  blake2s_init_with_len(S, outlen, keylen);
+
+  memset(block, 0, BLAKE2S_BLOCKBYTES);
+  memcpy(block, key, keylen);
+  blake2s_update(S, block, BLAKE2S_BLOCKBYTES);
+  memzero_crypto(block, BLAKE2S_BLOCKBYTES); /* Burn the key from stack */
+}
+
+#define G(r,i,a,b,c,d)                      \
+  do {                                      \
+    a = a + b + m[blake2s_sigma[r][2*i+0]]; \
+    d = rotr32(d ^ a, 16);                  \
+    c = c + d;                              \
+    b = rotr32(b ^ c, 12);                  \
+    a = a + b + m[blake2s_sigma[r][2*i+1]]; \
+    d = rotr32(d ^ a, 8);                   \
+    c = c + d;                              \
+    b = rotr32(b ^ c, 7);                   \
+  } while(0)
+
+#define ROUND(r)                    \
+  do {                              \
+    G(r,0,v[ 0],v[ 4],v[ 8],v[12]); \
+    G(r,1,v[ 1],v[ 5],v[ 9],v[13]); \
+    G(r,2,v[ 2],v[ 6],v[10],v[14]); \
+    G(r,3,v[ 3],v[ 7],v[11],v[15]); \
+    G(r,4,v[ 0],v[ 5],v[10],v[15]); \
+    G(r,5,v[ 1],v[ 6],v[11],v[12]); \
+    G(r,6,v[ 2],v[ 7],v[ 8],v[13]); \
+    G(r,7,v[ 3],v[ 4],v[ 9],v[14]); \
+  } while(0)
+
+static void blake2s_compress(blake2s_state *S, const uint8_t in[BLAKE2S_BLOCKBYTES]) {
+  uint32_t m[16];
+  uint32_t v[16];
+  size_t i;
+
+  for (i = 0; i < 16; ++i) {
+    m[i] = load32(in + i * sizeof(m[i]));
+  }
+
+  for (i = 0; i < 8; ++i) {
+    v[i] = S->h[i];
+  }
+
+  v[8] = blake2s_IV[0];
+  v[9] = blake2s_IV[1];
+  v[10] = blake2s_IV[2];
+  v[11] = blake2s_IV[3];
+  v[12] = S->t[0] ^ blake2s_IV[4];
+  v[13] = S->t[1] ^ blake2s_IV[5];
+  v[14] = S->f[0] ^ blake2s_IV[6];
+  v[15] = S->f[1] ^ blake2s_IV[7];
+
+  ROUND(0);
+  ROUND(1);
+  ROUND(2);
+  ROUND(3);
+  ROUND(4);
+  ROUND(5);
+  ROUND(6);
+  ROUND(7);
+  ROUND(8);
+  ROUND(9);
+
+  for (i = 0; i < 8; ++i) {
+    S->h[i] = S->h[i] ^ v[i] ^ v[i + 8];
+  }
+}
+
+#undef G
+#undef ROUND
+
+  static inline void blake2s_compress_impl(blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES]) {
+#if defined(ARCH_CPU_X86_64)
+  blake2s_compress_sse(S, block);
+#else
+  blake2s_compress(S, block);
+#endif
+}
+
+void blake2s_update(blake2s_state *S, const void *pin, size_t inlen) {
+  const unsigned char * in = (const unsigned char *)pin;
+  if (inlen > 0) {
+    size_t left = S->buflen;
+    size_t fill = BLAKE2S_BLOCKBYTES - left;
+    if (inlen > fill) {
+      S->buflen = 0;
+      memcpy(S->buf + left, in, fill); /* Fill buffer */
+      blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES);
+      blake2s_compress_impl(S, S->buf); /* Compress */
+      in += fill; inlen -= fill;
+      while (inlen > BLAKE2S_BLOCKBYTES) {
+        blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES);
+        blake2s_compress_impl(S, in);
+        in += BLAKE2S_BLOCKBYTES;
+        inlen -= BLAKE2S_BLOCKBYTES;
+      }
+    }
+    memcpy(S->buf + S->buflen, in, inlen);
+    S->buflen += inlen;
+  }
+}
+
+void blake2s_final(blake2s_state *S, void *out, size_t outlen) {
+  size_t i;
+
+  assert(out != NULL && outlen >= S->outlen);
+  assert(!blake2s_is_lastblock(S));
+
+  blake2s_increment_counter(S, (uint32_t)S->buflen);
+  blake2s_set_lastblock(S);
+  memset(S->buf + S->buflen, 0, BLAKE2S_BLOCKBYTES - S->buflen); /* Padding */
+  blake2s_compress_impl(S, S->buf);
+
+  for (i = 0; i < 8; ++i) /* Output full hash to temp buffer */
+    store32(&S->h[i], S->h[i]);
+
+  memcpy(out, S->h, outlen);
+}
+
+SAFEBUFFERS void blake2s(void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen) {
+  blake2s_state S;
+
+  /* Verify parameters */
+  assert(!((NULL == in && inlen > 0)));
+  assert(out);
+  assert(!(NULL == key && keylen > 0));
+  assert(!(!outlen || outlen > BLAKE2S_OUTBYTES));
+  assert(!(keylen > BLAKE2S_KEYBYTES));
+
+  if (keylen > 0) {
+    blake2s_init_key(&S, outlen, key, keylen);
+  } else {
+    blake2s_init(&S, outlen);
+  }
+  blake2s_update(&S, (const uint8_t *)in, inlen);
+  blake2s_final(&S, out, outlen);
+}
+
+SAFEBUFFERS void blake2s_hmac(uint8_t *out, size_t outlen, const uint8_t *in, size_t inlen, const uint8_t *key, size_t keylen) {
+  blake2s_state b2s;
+  uint64_t temp[BLAKE2S_OUTBYTES / 8];
+  uint64_t key_temp[BLAKE2S_BLOCKBYTES / 8] = { 0 };
+
+  if (keylen > BLAKE2S_BLOCKBYTES) {
+    blake2s_init(&b2s, BLAKE2S_OUTBYTES);
+    blake2s_update(&b2s, key, keylen);
+    blake2s_final(&b2s, key_temp, BLAKE2S_OUTBYTES);
+  } else {
+    memcpy(key_temp, key, keylen);
+  }
+
+  for (size_t i = 0; i < BLAKE2S_BLOCKBYTES / 8; i++)
+    key_temp[i] ^= 0x3636363636363636ull;
+
+  blake2s_init(&b2s, BLAKE2S_OUTBYTES);
+  blake2s_update(&b2s, key_temp, BLAKE2S_BLOCKBYTES);
+  blake2s_update(&b2s, in, inlen);
+  blake2s_final(&b2s, temp, BLAKE2S_OUTBYTES);
+
+  for (size_t i = 0; i < BLAKE2S_BLOCKBYTES / 8; i++)
+    key_temp[i] ^= 0x5c5c5c5c5c5c5c5cull ^ 0x3636363636363636ull;
+
+  blake2s_init(&b2s, BLAKE2S_OUTBYTES);
+  blake2s_update(&b2s, key_temp, BLAKE2S_BLOCKBYTES);
+  blake2s_update(&b2s, temp, BLAKE2S_OUTBYTES);
+  blake2s_final(&b2s, temp, BLAKE2S_OUTBYTES);
+
+  memcpy(out, temp, outlen);
+  memzero_crypto(key_temp, sizeof(key_temp));
+  memzero_crypto(temp, sizeof(temp));
+}
+
+SAFEBUFFERS
+void blake2s_hkdf(uint8 *dst1, size_t dst1_size,
+                  uint8 *dst2, size_t dst2_size,
+                  uint8 *dst3, size_t dst3_size,
+                  const uint8 *data, size_t data_size,
+                  const uint8 *key, size_t key_size) {
+  struct {
+    uint8 prk[BLAKE2S_OUTBYTES];
+    uint8 temp[BLAKE2S_OUTBYTES + 1];
+  } t;
+  blake2s_hmac(t.prk, BLAKE2S_OUTBYTES, data, data_size, key, key_size);
+  // first-key = HMAC(secret, 0x1)
+  t.temp[0] = 0x1;
+  blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, 1, t.prk, BLAKE2S_OUTBYTES);
+  memcpy(dst1, t.temp, dst1_size);
+  if (dst2 != NULL) {
+    // second-key = HMAC(secret, first-key || 0x2)
+    t.temp[BLAKE2S_OUTBYTES] = 0x2;
+    blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, BLAKE2S_OUTBYTES + 1, t.prk,  BLAKE2S_OUTBYTES);
+    memcpy(dst2, t.temp, dst2_size);
+    if (dst3 != NULL) {
+      // third-key = HMAC(secret, second-key || 0x3)
+      t.temp[BLAKE2S_OUTBYTES] = 0x3;
+      blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, BLAKE2S_OUTBYTES + 1, t.prk, BLAKE2S_OUTBYTES);
+      memcpy(dst3, t.temp, dst3_size);
+    }
+  }
+  memzero_crypto(&t, sizeof(t));
+}
+
+
+#if defined(SUPERCOP)
+int crypto_hash(unsigned char *out, unsigned char *in, unsigned long long inlen) {
+  return blake2s(out, BLAKE2S_OUTBYTES in, inlen, NULL, 0);
+}
+#endif
+
+#if defined(BLAKE2S_SELFTEST)
+#include <string.h>
+#include "blake2-kat.h"
+int main(void) {
+  uint8_t key[BLAKE2S_KEYBYTES];
+  uint8_t buf[BLAKE2_KAT_LENGTH];
+  size_t i, step;
+
+  for (i = 0; i < BLAKE2S_KEYBYTES; ++i)
+    key[i] = (uint8_t)i;
+
+  for (i = 0; i < BLAKE2_KAT_LENGTH; ++i)
+    buf[i] = (uint8_t)i;
+
+  /* Test simple API */
+  for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) {
+    uint8_t hash[BLAKE2S_OUTBYTES];
+    blake2s(hash, BLAKE2S_OUTBYTES, buf, i, key, BLAKE2S_KEYBYTES);
+
+    if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) {
+      goto fail;
+    }
+  }
+
+  /* Test streaming API */
+  for (step = 1; step < BLAKE2S_BLOCKBYTES; ++step) {
+    for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) {
+      uint8_t hash[BLAKE2S_OUTBYTES];
+      blake2s_state S;
+      uint8_t * p = buf;
+      size_t mlen = i;
+      int err = 0;
+
+      if ((err = blake2s_init_key(&S, BLAKE2S_OUTBYTES, key, BLAKE2S_KEYBYTES)) < 0) {
+        goto fail;
+      }
+
+      while (mlen >= step) {
+        if ((err = blake2s_update(&S, p, step)) < 0) {
+          goto fail;
+        }
+        mlen -= step;
+        p += step;
+      }
+      if ((err = blake2s_update(&S, p, mlen)) < 0) {
+        goto fail;
+      }
+      if ((err = blake2s_final(&S, hash, BLAKE2S_OUTBYTES)) < 0) {
+        goto fail;
+      }
+
+      if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) {
+        goto fail;
+      }
+    }
+  }
+
+  puts("ok");
+  return 0;
+fail:
+  puts("error");
+  return -1;
+}
+#endif
\ No newline at end of file
diff --git a/crypto/blake2s.h b/crypto/blake2s.h
new file mode 100644
index 0000000..aa53209
--- /dev/null
+++ b/crypto/blake2s.h
@@ -0,0 +1,100 @@
+/*
+BLAKE2 reference source code package - reference C implementations
+
+Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+your option.  The terms of these licenses can be found at:
+
+- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+- OpenSSL license   : https://www.openssl.org/source/license.html
+- Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+More information about the BLAKE2 hash function can be found at
+https://blake2.net.
+*/
+#ifndef BLAKE2_H
+#define BLAKE2_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include "tunsafe_types.h"
+#if defined(_MSC_VER)
+#define BLAKE2_PACKED(x) __pragma(pack(push, 1)) x __pragma(pack(pop))
+#else
+#define BLAKE2_PACKED(x) x __attribute__((packed))
+#endif
+
+#if defined(__cplusplus)
+//extern "C" {
+#endif
+
+enum blake2s_constant {
+  BLAKE2S_BLOCKBYTES = 64,
+  BLAKE2S_OUTBYTES = 32,
+  BLAKE2S_KEYBYTES = 32,
+  BLAKE2S_SALTBYTES = 8,
+  BLAKE2S_PERSONALBYTES = 8
+};
+
+BLAKE2_PACKED(struct blake2s_param__ {
+  uint8_t  digest_length; /* 1 */
+  uint8_t  key_length;    /* 2 */
+  uint8_t  fanout;        /* 3 */
+  uint8_t  depth;         /* 4 */
+  uint32_t leaf_length;   /* 8 */
+  uint32_t node_offset;  /* 12 */
+  uint16_t xof_length;    /* 14 */
+  uint8_t  node_depth;    /* 15 */
+  uint8_t  inner_length;  /* 16 */
+                          /* uint8_t  reserved[0]; */
+  uint32_t  salt[BLAKE2S_SALTBYTES / 4]; /* 24 */
+  uint32_t  personal[BLAKE2S_PERSONALBYTES / 4];  /* 32 */
+});
+
+
+typedef struct blake2s_param__ blake2s_param;
+
+/* Padded structs result in a compile-time error */
+enum {
+  BLAKE2_DUMMY_1 = 1 / (sizeof(blake2s_param) == BLAKE2S_OUTBYTES ? 1 : 0),
+};
+
+
+typedef struct blake2s_state__ {
+  union {
+    uint32_t h[8];
+    blake2s_param param;
+  };
+  uint32_t t[2];
+  uint32_t f[2];
+  uint8_t  buf[BLAKE2S_BLOCKBYTES];
+  size_t   buflen;
+  size_t   outlen;
+  uint8_t  last_node;
+} blake2s_state;
+
+
+
+/* Streaming API */
+void blake2s_init(blake2s_state *S, size_t outlen);
+void blake2s_init_key(blake2s_state *S, size_t outlen, const void *key, size_t keylen);
+void blake2s_init_param(blake2s_state *S, const blake2s_param *P);
+void blake2s_update(blake2s_state *S, const void *in, size_t inlen);
+void blake2s_final(blake2s_state *S, void *out, size_t outlen);
+
+/* Simple API */
+void blake2s(void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen);
+
+void blake2s_hmac(uint8_t *out, size_t outlen, const uint8_t *in, size_t inlen, const uint8_t *key, size_t keylen);
+
+void blake2s_hkdf(uint8 *dst1, size_t dst1_size,
+                  uint8 *dst2, size_t dst2_size,
+                  uint8 *dst3, size_t dst3_size,
+                  const uint8 *data, size_t data_size,
+                  const uint8 *key, size_t key_size);
+
+#if defined(__cplusplus)
+//}
+#endif
+
+#endif
\ No newline at end of file
diff --git a/crypto/blake2s_sse.cpp b/crypto/blake2s_sse.cpp
new file mode 100644
index 0000000..2527f24
--- /dev/null
+++ b/crypto/blake2s_sse.cpp
@@ -0,0 +1,399 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#include "stdafx.h"
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+
+#include "blake2s.h"
+#include "crypto_ops.h"
+
+#include <emmintrin.h>
+#if defined(HAVE_SSSE3)
+#include <tmmintrin.h>
+#endif
+#if defined(HAVE_SSE41)
+#include <smmintrin.h>
+#endif
+#if defined(HAVE_AVX)
+#include <immintrin.h>
+#endif
+#if defined(HAVE_XOP)
+#include <x86intrin.h>
+#endif
+
+#include "blake2s-round.h"
+
+#if !defined(__cplusplus) && (!defined(__STDC_VERSION__) || __STDC_VERSION__ < 199901L)
+#if   defined(_MSC_VER)
+#define BLAKE2_INLINE __inline
+#elif defined(__GNUC__)
+#define BLAKE2_INLINE __inline__
+#else
+#define BLAKE2_INLINE
+#endif
+#else
+#define BLAKE2_INLINE inline
+#endif
+
+static BLAKE2_INLINE uint32_t load32(const void *src) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  uint32_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = (const uint8_t *)src;
+  return ((uint32_t)(p[0]) << 0) |
+    ((uint32_t)(p[1]) << 8) |
+    ((uint32_t)(p[2]) << 16) |
+    ((uint32_t)(p[3]) << 24);
+#endif
+}
+
+static BLAKE2_INLINE void store32(void *dst, uint32_t w) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = (uint8_t *)dst;
+  p[0] = (uint8_t)(w >> 0);
+  p[1] = (uint8_t)(w >> 8);
+  p[2] = (uint8_t)(w >> 16);
+  p[3] = (uint8_t)(w >> 24);
+#endif
+}
+
+
+
+static const uint32_t blake2s_IV[8] =
+{
+  0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL,
+  0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL
+};
+
+/* Some helper functions */
+static void blake2s_set_lastnode( blake2s_state *S )
+{
+  S->f[1] = (uint32_t)-1;
+}
+
+static int blake2s_is_lastblock( const blake2s_state *S )
+{
+  return S->f[0] != 0;
+}
+
+static void blake2s_set_lastblock( blake2s_state *S )
+{
+  if( S->last_node ) blake2s_set_lastnode( S );
+
+  S->f[0] = (uint32_t)-1;
+}
+
+static void blake2s_increment_counter( blake2s_state *S, const uint32_t inc )
+{
+  uint64_t t = ( ( uint64_t )S->t[1] << 32 ) | S->t[0];
+  t += inc;
+  S->t[0] = ( uint32_t )( t >>  0 );
+  S->t[1] = ( uint32_t )( t >> 32 );
+}
+
+/* init2 xors IV with input parameter block */
+#if 0
+void blake2s_init_param( blake2s_state *S, const blake2s_param *P )
+{
+  size_t i;
+  /*blake2s_init0( S ); */
+  const uint8_t * v = ( const uint8_t * )( blake2s_IV );
+  const uint8_t * p = ( const uint8_t * )( P );
+  uint8_t * h = ( uint8_t * )( S->h );
+  /* IV XOR ParamBlock */
+  memset( S, 0, sizeof( blake2s_state ) );
+
+  for( i = 0; i < BLAKE2S_OUTBYTES; ++i ) h[i] = v[i] ^ p[i];
+
+  S->outlen = P->digest_length;
+}
+
+/* Some sort of default parameter block initialization, for sequential blake2s */
+void blake2s_init( blake2s_state *S, size_t outlen )
+{
+  blake2s_param P[1];
+  assert(outlen && outlen <= BLAKE2S_OUTBYTES);
+
+  P->digest_length = (uint8_t)outlen;
+  P->key_length    = 0;
+  P->fanout        = 1;
+  P->depth         = 1;
+  store32( &P->leaf_length, 0 );
+  store32( &P->node_offset, 0 );
+  store16( &P->xof_length, 0 );
+  P->node_depth    = 0;
+  P->inner_length  = 0;
+  /* memset(P->reserved, 0, sizeof(P->reserved) ); */
+  memset( P->salt,     0, sizeof( P->salt ) );
+  memset( P->personal, 0, sizeof( P->personal ) );
+
+  blake2s_init_param( S, P );
+}
+
+int blake2s_init_key( blake2s_state *S, size_t outlen, const void *key, size_t keylen )
+{
+  blake2s_param P[1];
+
+  /* Move interval verification here? */
+  if ( ( !outlen ) || ( outlen > BLAKE2S_OUTBYTES ) ) return -1;
+
+  if ( ( !key ) || ( !keylen ) || keylen > BLAKE2S_KEYBYTES ) return -1;
+
+  P->digest_length = (uint8_t)outlen;
+  P->key_length    = (uint8_t)keylen;
+  P->fanout        = 1;
+  P->depth         = 1;
+  store32( &P->leaf_length, 0 );
+  store32( &P->node_offset, 0 );
+  store16( &P->xof_length, 0 );
+  P->node_depth    = 0;
+  P->inner_length  = 0;
+  /* memset(P->reserved, 0, sizeof(P->reserved) ); */
+  memset( P->salt,     0, sizeof( P->salt ) );
+  memset( P->personal, 0, sizeof( P->personal ) );
+
+  if( blake2s_init_param( S, P ) < 0 )
+    return -1;
+
+  {
+    uint8_t block[BLAKE2S_BLOCKBYTES];
+    memset( block, 0, BLAKE2S_BLOCKBYTES );
+    memcpy( block, key, keylen );
+    blake2s_update( S, block, BLAKE2S_BLOCKBYTES );
+    memzero_crypto( block, BLAKE2S_BLOCKBYTES ); /* Burn the key from stack */
+  }
+  return 0;
+}
+#endif
+
+
+void blake2s_compress_sse( blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES] )
+{
+  __m128i row1, row2, row3, row4;
+  __m128i buf1, buf2, buf3, buf4;
+#if defined(HAVE_SSE41)
+  __m128i t0, t1;
+#if !defined(HAVE_XOP)
+  __m128i t2;
+#endif
+#endif
+  __m128i ff0, ff1;
+#if defined(HAVE_SSSE3) && !defined(HAVE_XOP)
+  const __m128i r8 = _mm_set_epi8( 12, 15, 14, 13, 8, 11, 10, 9, 4, 7, 6, 5, 0, 3, 2, 1 );
+  const __m128i r16 = _mm_set_epi8( 13, 12, 15, 14, 9, 8, 11, 10, 5, 4, 7, 6, 1, 0, 3, 2 );
+#endif
+#if defined(HAVE_SSE41)
+  const __m128i m0 = LOADU( block +  00 );
+  const __m128i m1 = LOADU( block +  16 );
+  const __m128i m2 = LOADU( block +  32 );
+  const __m128i m3 = LOADU( block +  48 );
+#else
+  const uint32_t  m0 = load32(block +  0 * sizeof(uint32_t));
+  const uint32_t  m1 = load32(block +  1 * sizeof(uint32_t));
+  const uint32_t  m2 = load32(block +  2 * sizeof(uint32_t));
+  const uint32_t  m3 = load32(block +  3 * sizeof(uint32_t));
+  const uint32_t  m4 = load32(block +  4 * sizeof(uint32_t));
+  const uint32_t  m5 = load32(block +  5 * sizeof(uint32_t));
+  const uint32_t  m6 = load32(block +  6 * sizeof(uint32_t));
+  const uint32_t  m7 = load32(block +  7 * sizeof(uint32_t));
+  const uint32_t  m8 = load32(block +  8 * sizeof(uint32_t));
+  const uint32_t  m9 = load32(block +  9 * sizeof(uint32_t));
+  const uint32_t m10 = load32(block + 10 * sizeof(uint32_t));
+  const uint32_t m11 = load32(block + 11 * sizeof(uint32_t));
+  const uint32_t m12 = load32(block + 12 * sizeof(uint32_t));
+  const uint32_t m13 = load32(block + 13 * sizeof(uint32_t));
+  const uint32_t m14 = load32(block + 14 * sizeof(uint32_t));
+  const uint32_t m15 = load32(block + 15 * sizeof(uint32_t));
+#endif
+  row1 = ff0 = LOADU( &S->h[0] );
+  row2 = ff1 = LOADU( &S->h[4] );
+  row3 = _mm_loadu_si128( (__m128i const *)&blake2s_IV[0] );
+  row4 = _mm_xor_si128( _mm_loadu_si128( (__m128i const *)&blake2s_IV[4] ), LOADU( &S->t[0] ) );
+  ROUND( 0 );
+  ROUND( 1 );
+  ROUND( 2 );
+  ROUND( 3 );
+  ROUND( 4 );
+  ROUND( 5 );
+  ROUND( 6 );
+  ROUND( 7 );
+  ROUND( 8 );
+  ROUND( 9 );
+  STOREU( &S->h[0], _mm_xor_si128( ff0, _mm_xor_si128( row1, row3 ) ) );
+  STOREU( &S->h[4], _mm_xor_si128( ff1, _mm_xor_si128( row2, row4 ) ) );
+}
+
+#if 0
+int blake2s_update( blake2s_state *S, const void *pin, size_t inlen )
+{
+  const unsigned char * in = (const unsigned char *)pin;
+  if( inlen > 0 )
+  {
+    size_t left = S->buflen;
+    size_t fill = BLAKE2S_BLOCKBYTES - left;
+    if( inlen > fill )
+    {
+      S->buflen = 0;
+      memcpy( S->buf + left, in, fill ); /* Fill buffer */
+      blake2s_increment_counter( S, BLAKE2S_BLOCKBYTES );
+      blake2s_compress( S, S->buf ); /* Compress */
+      in += fill; inlen -= fill;
+      while(inlen > BLAKE2S_BLOCKBYTES) {
+        blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES);
+        blake2s_compress( S, in );
+        in += BLAKE2S_BLOCKBYTES;
+        inlen -= BLAKE2S_BLOCKBYTES;
+      }
+    }
+    memcpy( S->buf + S->buflen, in, inlen );
+    S->buflen += inlen;
+  }
+  return 0;
+}
+
+int blake2s_final( blake2s_state *S, void *out, size_t outlen )
+{
+  uint8_t buffer[BLAKE2S_OUTBYTES] = {0};
+  size_t i;
+
+  if( out == NULL || outlen < S->outlen )
+    return -1;
+
+  if( blake2s_is_lastblock( S ) )
+    return -1;
+
+  blake2s_increment_counter( S, (uint32_t)S->buflen );
+  blake2s_set_lastblock( S );
+  memset( S->buf + S->buflen, 0, BLAKE2S_BLOCKBYTES - S->buflen ); /* Padding */
+  blake2s_compress( S, S->buf );
+
+  for( i = 0; i < 8; ++i ) /* Output full hash to temp buffer */
+    store32( buffer + sizeof( S->h[i] ) * i, S->h[i] );
+
+  memcpy( out, buffer, S->outlen );
+  memzero_crypto( buffer, sizeof(buffer) );
+  return 0;
+}
+
+/* inlen, at least, should be uint64_t. Others can be size_t. */
+int blake2s( void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen )
+{
+  blake2s_state S[1];
+
+  /* Verify parameters */
+  if ( NULL == in && inlen > 0 ) return -1;
+
+  if ( NULL == out ) return -1;
+
+  if ( NULL == key && keylen > 0) return -1;
+
+  if( !outlen || outlen > BLAKE2S_OUTBYTES ) return -1;
+
+  if( keylen > BLAKE2S_KEYBYTES ) return -1;
+
+  if( keylen > 0 )
+  {
+    if( blake2s_init_key( S, outlen, key, keylen ) < 0 ) return -1;
+  }
+  else
+  {
+    if( blake2s_init( S, outlen ) < 0 ) return -1;
+  }
+
+  blake2s_update( S, ( const uint8_t * )in, inlen );
+  blake2s_final( S, out, outlen );
+  return 0;
+}
+#endif
+
+#if defined(SUPERCOP)
+int crypto_hash( unsigned char *out, unsigned char *in, unsigned long long inlen )
+{
+  return blake2s( out, BLAKE2S_OUTBYTES, in, inlen, NULL, 0 );
+}
+#endif
+
+#if defined(BLAKE2S_SELFTEST)
+#include <string.h>
+#include "blake2-kat.h"
+int main( void )
+{
+  uint8_t key[BLAKE2S_KEYBYTES];
+  uint8_t buf[BLAKE2_KAT_LENGTH];
+  size_t i, step;
+
+  for( i = 0; i < BLAKE2S_KEYBYTES; ++i )
+    key[i] = ( uint8_t )i;
+
+  for( i = 0; i < BLAKE2_KAT_LENGTH; ++i )
+    buf[i] = ( uint8_t )i;
+
+  /* Test simple API */
+  for( i = 0; i < BLAKE2_KAT_LENGTH; ++i )
+  {
+    uint8_t hash[BLAKE2S_OUTBYTES];
+    blake2s( hash, BLAKE2S_OUTBYTES, buf, i, key, BLAKE2S_KEYBYTES );
+
+    if( 0 != memcmp( hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES ) )
+    {
+      goto fail;
+    }
+  }
+
+  /* Test streaming API */
+  for(step = 1; step < BLAKE2S_BLOCKBYTES; ++step) {
+    for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) {
+      uint8_t hash[BLAKE2S_OUTBYTES];
+      blake2s_state S;
+      uint8_t * p = buf;
+      size_t mlen = i;
+      int err = 0;
+
+      if( (err = blake2s_init_key(&S, BLAKE2S_OUTBYTES, key, BLAKE2S_KEYBYTES)) < 0 ) {
+        goto fail;
+      }
+
+      while (mlen >= step) {
+        if ( (err = blake2s_update(&S, p, step)) < 0 ) {
+          goto fail;
+        }
+        mlen -= step;
+        p += step;
+      }
+      if ( (err = blake2s_update(&S, p, mlen)) < 0) {
+        goto fail;
+      }
+      if ( (err = blake2s_final(&S, hash, BLAKE2S_OUTBYTES)) < 0) {
+        goto fail;
+      }
+
+      if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) {
+        goto fail;
+      }
+    }
+  }
+
+  puts( "ok" );
+  return 0;
+fail:
+  puts("error");
+  return -1;
+}
+#endif
diff --git a/crypto/chacha20_x64.asm b/crypto/chacha20_x64.asm
new file mode 100644
index 0000000..3b4b4db
--- /dev/null
+++ b/crypto/chacha20_x64.asm
@@ -0,0 +1,3011 @@
+default	rel
+%define XMMWORD
+%define YMMWORD
+%define ZMMWORD
+section	.text code align=64
+
+
+ALIGN	64
+$L$zero:
+	DD	0,0,0,0
+$L$one:
+	DD	1,0,0,0
+$L$inc:
+	DD	0,1,2,3
+$L$four:
+	DD	4,4,4,4
+$L$incy:
+	DD	0,2,4,6,1,3,5,7
+$L$eight:
+	DD	8,8,8,8,8,8,8,8
+$L$rot16:
+DB	0x2,0x3,0x0,0x1,0x6,0x7,0x4,0x5,0xa,0xb,0x8,0x9,0xe,0xf,0xc,0xd
+$L$rot24:
+DB	0x3,0x0,0x1,0x2,0x7,0x4,0x5,0x6,0xb,0x8,0x9,0xa,0xf,0xc,0xd,0xe
+$L$sigma:
+DB	101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107
+DB	0
+ALIGN	64
+$L$zeroz:
+	DD	0,0,0,0,1,0,0,0,2,0,0,0,3,0,0,0
+$L$fourz:
+	DD	4,0,0,0,4,0,0,0,4,0,0,0,4,0,0,0
+$L$incz:
+	DD	0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+$L$sixteen:
+	DD	16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16
+ALIGN	64
+$L$twoy:
+	DD	2,0,0,0,2,0,0,0
+
+global	hchacha20_ssse3
+
+ALIGN	32
+hchacha20_ssse3:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_hchacha20_ssse3:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+$L$hchacha20_ssse3:
+	movdqa	xmm0,XMMWORD[$L$sigma]
+	movdqu	xmm1,XMMWORD[rdx]
+	movdqu	xmm2,XMMWORD[16+rdx]
+	movdqu	xmm3,XMMWORD[rsi]
+	movdqa	xmm6,XMMWORD[$L$rot16]
+	movdqa	xmm7,XMMWORD[$L$rot24]
+	mov	r8,10
+ALIGN	32
+$L$oop_hssse3:
+	paddd	xmm0,xmm1
+	pxor	xmm3,xmm0
+	pshufb	xmm3,xmm6
+	paddd	xmm2,xmm3
+	pxor	xmm1,xmm2
+	movdqa	xmm4,xmm1
+	psrld	xmm1,20
+	pslld	xmm4,12
+	por	xmm1,xmm4
+	paddd	xmm0,xmm1
+	pxor	xmm3,xmm0
+	pshufb	xmm3,xmm7
+	paddd	xmm2,xmm3
+	pxor	xmm1,xmm2
+	movdqa	xmm4,xmm1
+	psrld	xmm1,25
+	pslld	xmm4,7
+	por	xmm1,xmm4
+	pshufd	xmm2,xmm2,78
+	pshufd	xmm1,xmm1,57
+	pshufd	xmm3,xmm3,147
+	nop
+	paddd	xmm0,xmm1
+	pxor	xmm3,xmm0
+	pshufb	xmm3,xmm6
+	paddd	xmm2,xmm3
+	pxor	xmm1,xmm2
+	movdqa	xmm4,xmm1
+	psrld	xmm1,20
+	pslld	xmm4,12
+	por	xmm1,xmm4
+	paddd	xmm0,xmm1
+	pxor	xmm3,xmm0
+	pshufb	xmm3,xmm7
+	paddd	xmm2,xmm3
+	pxor	xmm1,xmm2
+	movdqa	xmm4,xmm1
+	psrld	xmm1,25
+	pslld	xmm4,7
+	por	xmm1,xmm4
+	pshufd	xmm2,xmm2,78
+	pshufd	xmm1,xmm1,147
+	pshufd	xmm3,xmm3,57
+	dec	r8
+	jnz	NEAR $L$oop_hssse3
+	movdqu	XMMWORD[rdi],xmm0
+	movdqu	XMMWORD[16+rdi],xmm3
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_hchacha20_ssse3:
+global	chacha20_ssse3
+
+ALIGN	32
+chacha20_ssse3:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_chacha20_ssse3:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+$L$chacha20_ssse3:
+	mov	r9,rsp
+
+	cmp	rdx,128
+	ja	NEAR $L$chacha20_4x
+
+$L$do_sse3_after_all:
+	sub	rsp,64+40
+	movaps	XMMWORD[(-40)+r9],xmm6
+	movaps	XMMWORD[(-24)+r9],xmm7
+$L$ssse3_body:
+	movdqa	xmm0,XMMWORD[$L$sigma]
+	movdqu	xmm1,XMMWORD[rcx]
+	movdqu	xmm2,XMMWORD[16+rcx]
+	movdqu	xmm3,XMMWORD[r8]
+	movdqa	xmm6,XMMWORD[$L$rot16]
+	movdqa	xmm7,XMMWORD[$L$rot24]
+
+	movdqa	XMMWORD[rsp],xmm0
+	movdqa	XMMWORD[16+rsp],xmm1
+	movdqa	XMMWORD[32+rsp],xmm2
+	movdqa	XMMWORD[48+rsp],xmm3
+	mov	r8,10
+	jmp	NEAR $L$oop_ssse3
+
+ALIGN	32
+$L$oop_outer_ssse3:
+	movdqa	xmm3,XMMWORD[$L$one]
+	movdqa	xmm0,XMMWORD[rsp]
+	movdqa	xmm1,XMMWORD[16+rsp]
+	movdqa	xmm2,XMMWORD[32+rsp]
+	paddd	xmm3,XMMWORD[48+rsp]
+	mov	r8,10
+	movdqa	XMMWORD[48+rsp],xmm3
+	jmp	NEAR $L$oop_ssse3
+
+ALIGN	32
+$L$oop_ssse3:
+	paddd	xmm0,xmm1
+	pxor	xmm3,xmm0
+	pshufb	xmm3,xmm6
+	paddd	xmm2,xmm3
+	pxor	xmm1,xmm2
+	movdqa	xmm4,xmm1
+	psrld	xmm1,20
+	pslld	xmm4,12
+	por	xmm1,xmm4
+	paddd	xmm0,xmm1
+	pxor	xmm3,xmm0
+	pshufb	xmm3,xmm7
+	paddd	xmm2,xmm3
+	pxor	xmm1,xmm2
+	movdqa	xmm4,xmm1
+	psrld	xmm1,25
+	pslld	xmm4,7
+	por	xmm1,xmm4
+	pshufd	xmm2,xmm2,78
+	pshufd	xmm1,xmm1,57
+	pshufd	xmm3,xmm3,147
+	nop
+	paddd	xmm0,xmm1
+	pxor	xmm3,xmm0
+	pshufb	xmm3,xmm6
+	paddd	xmm2,xmm3
+	pxor	xmm1,xmm2
+	movdqa	xmm4,xmm1
+	psrld	xmm1,20
+	pslld	xmm4,12
+	por	xmm1,xmm4
+	paddd	xmm0,xmm1
+	pxor	xmm3,xmm0
+	pshufb	xmm3,xmm7
+	paddd	xmm2,xmm3
+	pxor	xmm1,xmm2
+	movdqa	xmm4,xmm1
+	psrld	xmm1,25
+	pslld	xmm4,7
+	por	xmm1,xmm4
+	pshufd	xmm2,xmm2,78
+	pshufd	xmm1,xmm1,147
+	pshufd	xmm3,xmm3,57
+	dec	r8
+	jnz	NEAR $L$oop_ssse3
+	paddd	xmm0,XMMWORD[rsp]
+	paddd	xmm1,XMMWORD[16+rsp]
+	paddd	xmm2,XMMWORD[32+rsp]
+	paddd	xmm3,XMMWORD[48+rsp]
+
+	cmp	rdx,64
+	jb	NEAR $L$tail_ssse3
+
+	movdqu	xmm4,XMMWORD[rsi]
+	movdqu	xmm5,XMMWORD[16+rsi]
+	pxor	xmm0,xmm4
+	movdqu	xmm4,XMMWORD[32+rsi]
+	pxor	xmm1,xmm5
+	movdqu	xmm5,XMMWORD[48+rsi]
+	lea	rsi,[64+rsi]
+	pxor	xmm2,xmm4
+	pxor	xmm3,xmm5
+
+	movdqu	XMMWORD[rdi],xmm0
+	movdqu	XMMWORD[16+rdi],xmm1
+	movdqu	XMMWORD[32+rdi],xmm2
+	movdqu	XMMWORD[48+rdi],xmm3
+	lea	rdi,[64+rdi]
+
+	sub	rdx,64
+	jnz	NEAR $L$oop_outer_ssse3
+
+	jmp	NEAR $L$done_ssse3
+
+ALIGN	16
+$L$tail_ssse3:
+	movdqa	XMMWORD[rsp],xmm0
+	movdqa	XMMWORD[16+rsp],xmm1
+	movdqa	XMMWORD[32+rsp],xmm2
+	movdqa	XMMWORD[48+rsp],xmm3
+	xor	r8,r8
+
+$L$oop_tail_ssse3:
+	movzx	eax,BYTE[r8*1+rsi]
+	movzx	ecx,BYTE[r8*1+rsp]
+	lea	r8,[1+r8]
+	xor	eax,ecx
+	mov	BYTE[((-1))+r8*1+rdi],al
+	dec	rdx
+	jnz	NEAR $L$oop_tail_ssse3
+
+$L$done_ssse3:
+	movaps	xmm6,XMMWORD[((-40))+r9]
+	movaps	xmm7,XMMWORD[((-24))+r9]
+	lea	rsp,[r9]
+
+$L$ssse3_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_chacha20_ssse3:
+global	chacha20_4x
+
+ALIGN	32
+chacha20_4x:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_chacha20_4x:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+$L$chacha20_4x:
+	mov	r9,rsp
+
+
+
+
+
+
+
+
+
+
+
+
+$L$proceed4x:
+	sub	rsp,0x140+168
+	movaps	XMMWORD[(-168)+r9],xmm6
+	movaps	XMMWORD[(-152)+r9],xmm7
+	movaps	XMMWORD[(-136)+r9],xmm8
+	movaps	XMMWORD[(-120)+r9],xmm9
+	movaps	XMMWORD[(-104)+r9],xmm10
+	movaps	XMMWORD[(-88)+r9],xmm11
+	movaps	XMMWORD[(-72)+r9],xmm12
+	movaps	XMMWORD[(-56)+r9],xmm13
+	movaps	XMMWORD[(-40)+r9],xmm14
+	movaps	XMMWORD[(-24)+r9],xmm15
+$L$4x_body:
+	movdqa	xmm11,XMMWORD[$L$sigma]
+	movdqu	xmm15,XMMWORD[rcx]
+	movdqu	xmm7,XMMWORD[16+rcx]
+	movdqu	xmm3,XMMWORD[r8]
+	lea	rcx,[256+rsp]
+	lea	r10,[$L$rot16]
+	lea	r11,[$L$rot24]
+
+	pshufd	xmm8,xmm11,0x00
+	pshufd	xmm9,xmm11,0x55
+	movdqa	XMMWORD[64+rsp],xmm8
+	pshufd	xmm10,xmm11,0xaa
+	movdqa	XMMWORD[80+rsp],xmm9
+	pshufd	xmm11,xmm11,0xff
+	movdqa	XMMWORD[96+rsp],xmm10
+	movdqa	XMMWORD[112+rsp],xmm11
+
+	pshufd	xmm12,xmm15,0x00
+	pshufd	xmm13,xmm15,0x55
+	movdqa	XMMWORD[(128-256)+rcx],xmm12
+	pshufd	xmm14,xmm15,0xaa
+	movdqa	XMMWORD[(144-256)+rcx],xmm13
+	pshufd	xmm15,xmm15,0xff
+	movdqa	XMMWORD[(160-256)+rcx],xmm14
+	movdqa	XMMWORD[(176-256)+rcx],xmm15
+
+	pshufd	xmm4,xmm7,0x00
+	pshufd	xmm5,xmm7,0x55
+	movdqa	XMMWORD[(192-256)+rcx],xmm4
+	pshufd	xmm6,xmm7,0xaa
+	movdqa	XMMWORD[(208-256)+rcx],xmm5
+	pshufd	xmm7,xmm7,0xff
+	movdqa	XMMWORD[(224-256)+rcx],xmm6
+	movdqa	XMMWORD[(240-256)+rcx],xmm7
+
+	pshufd	xmm0,xmm3,0x00
+	pshufd	xmm1,xmm3,0x55
+	paddd	xmm0,XMMWORD[$L$inc]
+	pshufd	xmm2,xmm3,0xaa
+	movdqa	XMMWORD[(272-256)+rcx],xmm1
+	pshufd	xmm3,xmm3,0xff
+	movdqa	XMMWORD[(288-256)+rcx],xmm2
+	movdqa	XMMWORD[(304-256)+rcx],xmm3
+
+	jmp	NEAR $L$oop_enter4x
+
+ALIGN	32
+$L$oop_outer4x:
+	movdqa	xmm8,XMMWORD[64+rsp]
+	movdqa	xmm9,XMMWORD[80+rsp]
+	movdqa	xmm10,XMMWORD[96+rsp]
+	movdqa	xmm11,XMMWORD[112+rsp]
+	movdqa	xmm12,XMMWORD[((128-256))+rcx]
+	movdqa	xmm13,XMMWORD[((144-256))+rcx]
+	movdqa	xmm14,XMMWORD[((160-256))+rcx]
+	movdqa	xmm15,XMMWORD[((176-256))+rcx]
+	movdqa	xmm4,XMMWORD[((192-256))+rcx]
+	movdqa	xmm5,XMMWORD[((208-256))+rcx]
+	movdqa	xmm6,XMMWORD[((224-256))+rcx]
+	movdqa	xmm7,XMMWORD[((240-256))+rcx]
+	movdqa	xmm0,XMMWORD[((256-256))+rcx]
+	movdqa	xmm1,XMMWORD[((272-256))+rcx]
+	movdqa	xmm2,XMMWORD[((288-256))+rcx]
+	movdqa	xmm3,XMMWORD[((304-256))+rcx]
+	paddd	xmm0,XMMWORD[$L$four]
+
+$L$oop_enter4x:
+	movdqa	XMMWORD[32+rsp],xmm6
+	movdqa	XMMWORD[48+rsp],xmm7
+	movdqa	xmm7,XMMWORD[r10]
+	mov	eax,10
+	movdqa	XMMWORD[(256-256)+rcx],xmm0
+	jmp	NEAR $L$oop4x
+
+ALIGN	32
+$L$oop4x:
+	paddd	xmm8,xmm12
+	paddd	xmm9,xmm13
+	pxor	xmm0,xmm8
+	pxor	xmm1,xmm9
+	pshufb	xmm0,xmm7
+	pshufb	xmm1,xmm7
+	paddd	xmm4,xmm0
+	paddd	xmm5,xmm1
+	pxor	xmm12,xmm4
+	pxor	xmm13,xmm5
+	movdqa	xmm6,xmm12
+	pslld	xmm12,12
+	psrld	xmm6,20
+	movdqa	xmm7,xmm13
+	pslld	xmm13,12
+	por	xmm12,xmm6
+	psrld	xmm7,20
+	movdqa	xmm6,XMMWORD[r11]
+	por	xmm13,xmm7
+	paddd	xmm8,xmm12
+	paddd	xmm9,xmm13
+	pxor	xmm0,xmm8
+	pxor	xmm1,xmm9
+	pshufb	xmm0,xmm6
+	pshufb	xmm1,xmm6
+	paddd	xmm4,xmm0
+	paddd	xmm5,xmm1
+	pxor	xmm12,xmm4
+	pxor	xmm13,xmm5
+	movdqa	xmm7,xmm12
+	pslld	xmm12,7
+	psrld	xmm7,25
+	movdqa	xmm6,xmm13
+	pslld	xmm13,7
+	por	xmm12,xmm7
+	psrld	xmm6,25
+	movdqa	xmm7,XMMWORD[r10]
+	por	xmm13,xmm6
+	movdqa	XMMWORD[rsp],xmm4
+	movdqa	XMMWORD[16+rsp],xmm5
+	movdqa	xmm4,XMMWORD[32+rsp]
+	movdqa	xmm5,XMMWORD[48+rsp]
+	paddd	xmm10,xmm14
+	paddd	xmm11,xmm15
+	pxor	xmm2,xmm10
+	pxor	xmm3,xmm11
+	pshufb	xmm2,xmm7
+	pshufb	xmm3,xmm7
+	paddd	xmm4,xmm2
+	paddd	xmm5,xmm3
+	pxor	xmm14,xmm4
+	pxor	xmm15,xmm5
+	movdqa	xmm6,xmm14
+	pslld	xmm14,12
+	psrld	xmm6,20
+	movdqa	xmm7,xmm15
+	pslld	xmm15,12
+	por	xmm14,xmm6
+	psrld	xmm7,20
+	movdqa	xmm6,XMMWORD[r11]
+	por	xmm15,xmm7
+	paddd	xmm10,xmm14
+	paddd	xmm11,xmm15
+	pxor	xmm2,xmm10
+	pxor	xmm3,xmm11
+	pshufb	xmm2,xmm6
+	pshufb	xmm3,xmm6
+	paddd	xmm4,xmm2
+	paddd	xmm5,xmm3
+	pxor	xmm14,xmm4
+	pxor	xmm15,xmm5
+	movdqa	xmm7,xmm14
+	pslld	xmm14,7
+	psrld	xmm7,25
+	movdqa	xmm6,xmm15
+	pslld	xmm15,7
+	por	xmm14,xmm7
+	psrld	xmm6,25
+	movdqa	xmm7,XMMWORD[r10]
+	por	xmm15,xmm6
+	paddd	xmm8,xmm13
+	paddd	xmm9,xmm14
+	pxor	xmm3,xmm8
+	pxor	xmm0,xmm9
+	pshufb	xmm3,xmm7
+	pshufb	xmm0,xmm7
+	paddd	xmm4,xmm3
+	paddd	xmm5,xmm0
+	pxor	xmm13,xmm4
+	pxor	xmm14,xmm5
+	movdqa	xmm6,xmm13
+	pslld	xmm13,12
+	psrld	xmm6,20
+	movdqa	xmm7,xmm14
+	pslld	xmm14,12
+	por	xmm13,xmm6
+	psrld	xmm7,20
+	movdqa	xmm6,XMMWORD[r11]
+	por	xmm14,xmm7
+	paddd	xmm8,xmm13
+	paddd	xmm9,xmm14
+	pxor	xmm3,xmm8
+	pxor	xmm0,xmm9
+	pshufb	xmm3,xmm6
+	pshufb	xmm0,xmm6
+	paddd	xmm4,xmm3
+	paddd	xmm5,xmm0
+	pxor	xmm13,xmm4
+	pxor	xmm14,xmm5
+	movdqa	xmm7,xmm13
+	pslld	xmm13,7
+	psrld	xmm7,25
+	movdqa	xmm6,xmm14
+	pslld	xmm14,7
+	por	xmm13,xmm7
+	psrld	xmm6,25
+	movdqa	xmm7,XMMWORD[r10]
+	por	xmm14,xmm6
+	movdqa	XMMWORD[32+rsp],xmm4
+	movdqa	XMMWORD[48+rsp],xmm5
+	movdqa	xmm4,XMMWORD[rsp]
+	movdqa	xmm5,XMMWORD[16+rsp]
+	paddd	xmm10,xmm15
+	paddd	xmm11,xmm12
+	pxor	xmm1,xmm10
+	pxor	xmm2,xmm11
+	pshufb	xmm1,xmm7
+	pshufb	xmm2,xmm7
+	paddd	xmm4,xmm1
+	paddd	xmm5,xmm2
+	pxor	xmm15,xmm4
+	pxor	xmm12,xmm5
+	movdqa	xmm6,xmm15
+	pslld	xmm15,12
+	psrld	xmm6,20
+	movdqa	xmm7,xmm12
+	pslld	xmm12,12
+	por	xmm15,xmm6
+	psrld	xmm7,20
+	movdqa	xmm6,XMMWORD[r11]
+	por	xmm12,xmm7
+	paddd	xmm10,xmm15
+	paddd	xmm11,xmm12
+	pxor	xmm1,xmm10
+	pxor	xmm2,xmm11
+	pshufb	xmm1,xmm6
+	pshufb	xmm2,xmm6
+	paddd	xmm4,xmm1
+	paddd	xmm5,xmm2
+	pxor	xmm15,xmm4
+	pxor	xmm12,xmm5
+	movdqa	xmm7,xmm15
+	pslld	xmm15,7
+	psrld	xmm7,25
+	movdqa	xmm6,xmm12
+	pslld	xmm12,7
+	por	xmm15,xmm7
+	psrld	xmm6,25
+	movdqa	xmm7,XMMWORD[r10]
+	por	xmm12,xmm6
+	dec	eax
+	jnz	NEAR $L$oop4x
+
+	paddd	xmm8,XMMWORD[64+rsp]
+	paddd	xmm9,XMMWORD[80+rsp]
+	paddd	xmm10,XMMWORD[96+rsp]
+	paddd	xmm11,XMMWORD[112+rsp]
+
+	movdqa	xmm6,xmm8
+	punpckldq	xmm8,xmm9
+	movdqa	xmm7,xmm10
+	punpckldq	xmm10,xmm11
+	punpckhdq	xmm6,xmm9
+	punpckhdq	xmm7,xmm11
+	movdqa	xmm9,xmm8
+	punpcklqdq	xmm8,xmm10
+	movdqa	xmm11,xmm6
+	punpcklqdq	xmm6,xmm7
+	punpckhqdq	xmm9,xmm10
+	punpckhqdq	xmm11,xmm7
+	paddd	xmm12,XMMWORD[((128-256))+rcx]
+	paddd	xmm13,XMMWORD[((144-256))+rcx]
+	paddd	xmm14,XMMWORD[((160-256))+rcx]
+	paddd	xmm15,XMMWORD[((176-256))+rcx]
+
+	movdqa	XMMWORD[rsp],xmm8
+	movdqa	XMMWORD[16+rsp],xmm9
+	movdqa	xmm8,XMMWORD[32+rsp]
+	movdqa	xmm9,XMMWORD[48+rsp]
+
+	movdqa	xmm10,xmm12
+	punpckldq	xmm12,xmm13
+	movdqa	xmm7,xmm14
+	punpckldq	xmm14,xmm15
+	punpckhdq	xmm10,xmm13
+	punpckhdq	xmm7,xmm15
+	movdqa	xmm13,xmm12
+	punpcklqdq	xmm12,xmm14
+	movdqa	xmm15,xmm10
+	punpcklqdq	xmm10,xmm7
+	punpckhqdq	xmm13,xmm14
+	punpckhqdq	xmm15,xmm7
+	paddd	xmm4,XMMWORD[((192-256))+rcx]
+	paddd	xmm5,XMMWORD[((208-256))+rcx]
+	paddd	xmm8,XMMWORD[((224-256))+rcx]
+	paddd	xmm9,XMMWORD[((240-256))+rcx]
+
+	movdqa	XMMWORD[32+rsp],xmm6
+	movdqa	XMMWORD[48+rsp],xmm11
+
+	movdqa	xmm14,xmm4
+	punpckldq	xmm4,xmm5
+	movdqa	xmm7,xmm8
+	punpckldq	xmm8,xmm9
+	punpckhdq	xmm14,xmm5
+	punpckhdq	xmm7,xmm9
+	movdqa	xmm5,xmm4
+	punpcklqdq	xmm4,xmm8
+	movdqa	xmm9,xmm14
+	punpcklqdq	xmm14,xmm7
+	punpckhqdq	xmm5,xmm8
+	punpckhqdq	xmm9,xmm7
+	paddd	xmm0,XMMWORD[((256-256))+rcx]
+	paddd	xmm1,XMMWORD[((272-256))+rcx]
+	paddd	xmm2,XMMWORD[((288-256))+rcx]
+	paddd	xmm3,XMMWORD[((304-256))+rcx]
+
+	movdqa	xmm8,xmm0
+	punpckldq	xmm0,xmm1
+	movdqa	xmm7,xmm2
+	punpckldq	xmm2,xmm3
+	punpckhdq	xmm8,xmm1
+	punpckhdq	xmm7,xmm3
+	movdqa	xmm1,xmm0
+	punpcklqdq	xmm0,xmm2
+	movdqa	xmm3,xmm8
+	punpcklqdq	xmm8,xmm7
+	punpckhqdq	xmm1,xmm2
+	punpckhqdq	xmm3,xmm7
+	cmp	rdx,64*4
+	jb	NEAR $L$tail4x
+
+	movdqu	xmm6,XMMWORD[rsi]
+	movdqu	xmm11,XMMWORD[16+rsi]
+	movdqu	xmm2,XMMWORD[32+rsi]
+	movdqu	xmm7,XMMWORD[48+rsi]
+	pxor	xmm6,XMMWORD[rsp]
+	pxor	xmm11,xmm12
+	pxor	xmm2,xmm4
+	pxor	xmm7,xmm0
+
+	movdqu	XMMWORD[rdi],xmm6
+	movdqu	xmm6,XMMWORD[64+rsi]
+	movdqu	XMMWORD[16+rdi],xmm11
+	movdqu	xmm11,XMMWORD[80+rsi]
+	movdqu	XMMWORD[32+rdi],xmm2
+	movdqu	xmm2,XMMWORD[96+rsi]
+	movdqu	XMMWORD[48+rdi],xmm7
+	movdqu	xmm7,XMMWORD[112+rsi]
+	lea	rsi,[128+rsi]
+	pxor	xmm6,XMMWORD[16+rsp]
+	pxor	xmm11,xmm13
+	pxor	xmm2,xmm5
+	pxor	xmm7,xmm1
+
+	movdqu	XMMWORD[64+rdi],xmm6
+	movdqu	xmm6,XMMWORD[rsi]
+	movdqu	XMMWORD[80+rdi],xmm11
+	movdqu	xmm11,XMMWORD[16+rsi]
+	movdqu	XMMWORD[96+rdi],xmm2
+	movdqu	xmm2,XMMWORD[32+rsi]
+	movdqu	XMMWORD[112+rdi],xmm7
+	lea	rdi,[128+rdi]
+	movdqu	xmm7,XMMWORD[48+rsi]
+	pxor	xmm6,XMMWORD[32+rsp]
+	pxor	xmm11,xmm10
+	pxor	xmm2,xmm14
+	pxor	xmm7,xmm8
+
+	movdqu	XMMWORD[rdi],xmm6
+	movdqu	xmm6,XMMWORD[64+rsi]
+	movdqu	XMMWORD[16+rdi],xmm11
+	movdqu	xmm11,XMMWORD[80+rsi]
+	movdqu	XMMWORD[32+rdi],xmm2
+	movdqu	xmm2,XMMWORD[96+rsi]
+	movdqu	XMMWORD[48+rdi],xmm7
+	movdqu	xmm7,XMMWORD[112+rsi]
+	lea	rsi,[128+rsi]
+	pxor	xmm6,XMMWORD[48+rsp]
+	pxor	xmm11,xmm15
+	pxor	xmm2,xmm9
+	pxor	xmm7,xmm3
+	movdqu	XMMWORD[64+rdi],xmm6
+	movdqu	XMMWORD[80+rdi],xmm11
+	movdqu	XMMWORD[96+rdi],xmm2
+	movdqu	XMMWORD[112+rdi],xmm7
+	lea	rdi,[128+rdi]
+
+	sub	rdx,64*4
+	jnz	NEAR $L$oop_outer4x
+
+	jmp	NEAR $L$done4x
+
+$L$tail4x:
+	cmp	rdx,192
+	jae	NEAR $L$192_or_more4x
+	cmp	rdx,128
+	jae	NEAR $L$128_or_more4x
+	cmp	rdx,64
+	jae	NEAR $L$64_or_more4x
+
+
+	xor	r10,r10
+
+	movdqa	XMMWORD[16+rsp],xmm12
+	movdqa	XMMWORD[32+rsp],xmm4
+	movdqa	XMMWORD[48+rsp],xmm0
+	jmp	NEAR $L$oop_tail4x
+
+ALIGN	32
+$L$64_or_more4x:
+	movdqu	xmm6,XMMWORD[rsi]
+	movdqu	xmm11,XMMWORD[16+rsi]
+	movdqu	xmm2,XMMWORD[32+rsi]
+	movdqu	xmm7,XMMWORD[48+rsi]
+	pxor	xmm6,XMMWORD[rsp]
+	pxor	xmm11,xmm12
+	pxor	xmm2,xmm4
+	pxor	xmm7,xmm0
+	movdqu	XMMWORD[rdi],xmm6
+	movdqu	XMMWORD[16+rdi],xmm11
+	movdqu	XMMWORD[32+rdi],xmm2
+	movdqu	XMMWORD[48+rdi],xmm7
+	je	NEAR $L$done4x
+
+	movdqa	xmm6,XMMWORD[16+rsp]
+	lea	rsi,[64+rsi]
+	xor	r10,r10
+	movdqa	XMMWORD[rsp],xmm6
+	movdqa	XMMWORD[16+rsp],xmm13
+	lea	rdi,[64+rdi]
+	movdqa	XMMWORD[32+rsp],xmm5
+	sub	rdx,64
+	movdqa	XMMWORD[48+rsp],xmm1
+	jmp	NEAR $L$oop_tail4x
+
+ALIGN	32
+$L$128_or_more4x:
+	movdqu	xmm6,XMMWORD[rsi]
+	movdqu	xmm11,XMMWORD[16+rsi]
+	movdqu	xmm2,XMMWORD[32+rsi]
+	movdqu	xmm7,XMMWORD[48+rsi]
+	pxor	xmm6,XMMWORD[rsp]
+	pxor	xmm11,xmm12
+	pxor	xmm2,xmm4
+	pxor	xmm7,xmm0
+
+	movdqu	XMMWORD[rdi],xmm6
+	movdqu	xmm6,XMMWORD[64+rsi]
+	movdqu	XMMWORD[16+rdi],xmm11
+	movdqu	xmm11,XMMWORD[80+rsi]
+	movdqu	XMMWORD[32+rdi],xmm2
+	movdqu	xmm2,XMMWORD[96+rsi]
+	movdqu	XMMWORD[48+rdi],xmm7
+	movdqu	xmm7,XMMWORD[112+rsi]
+	pxor	xmm6,XMMWORD[16+rsp]
+	pxor	xmm11,xmm13
+	pxor	xmm2,xmm5
+	pxor	xmm7,xmm1
+	movdqu	XMMWORD[64+rdi],xmm6
+	movdqu	XMMWORD[80+rdi],xmm11
+	movdqu	XMMWORD[96+rdi],xmm2
+	movdqu	XMMWORD[112+rdi],xmm7
+	je	NEAR $L$done4x
+
+	movdqa	xmm6,XMMWORD[32+rsp]
+	lea	rsi,[128+rsi]
+	xor	r10,r10
+	movdqa	XMMWORD[rsp],xmm6
+	movdqa	XMMWORD[16+rsp],xmm10
+	lea	rdi,[128+rdi]
+	movdqa	XMMWORD[32+rsp],xmm14
+	sub	rdx,128
+	movdqa	XMMWORD[48+rsp],xmm8
+	jmp	NEAR $L$oop_tail4x
+
+ALIGN	32
+$L$192_or_more4x:
+	movdqu	xmm6,XMMWORD[rsi]
+	movdqu	xmm11,XMMWORD[16+rsi]
+	movdqu	xmm2,XMMWORD[32+rsi]
+	movdqu	xmm7,XMMWORD[48+rsi]
+	pxor	xmm6,XMMWORD[rsp]
+	pxor	xmm11,xmm12
+	pxor	xmm2,xmm4
+	pxor	xmm7,xmm0
+
+	movdqu	XMMWORD[rdi],xmm6
+	movdqu	xmm6,XMMWORD[64+rsi]
+	movdqu	XMMWORD[16+rdi],xmm11
+	movdqu	xmm11,XMMWORD[80+rsi]
+	movdqu	XMMWORD[32+rdi],xmm2
+	movdqu	xmm2,XMMWORD[96+rsi]
+	movdqu	XMMWORD[48+rdi],xmm7
+	movdqu	xmm7,XMMWORD[112+rsi]
+	lea	rsi,[128+rsi]
+	pxor	xmm6,XMMWORD[16+rsp]
+	pxor	xmm11,xmm13
+	pxor	xmm2,xmm5
+	pxor	xmm7,xmm1
+
+	movdqu	XMMWORD[64+rdi],xmm6
+	movdqu	xmm6,XMMWORD[rsi]
+	movdqu	XMMWORD[80+rdi],xmm11
+	movdqu	xmm11,XMMWORD[16+rsi]
+	movdqu	XMMWORD[96+rdi],xmm2
+	movdqu	xmm2,XMMWORD[32+rsi]
+	movdqu	XMMWORD[112+rdi],xmm7
+	lea	rdi,[128+rdi]
+	movdqu	xmm7,XMMWORD[48+rsi]
+	pxor	xmm6,XMMWORD[32+rsp]
+	pxor	xmm11,xmm10
+	pxor	xmm2,xmm14
+	pxor	xmm7,xmm8
+	movdqu	XMMWORD[rdi],xmm6
+	movdqu	XMMWORD[16+rdi],xmm11
+	movdqu	XMMWORD[32+rdi],xmm2
+	movdqu	XMMWORD[48+rdi],xmm7
+	je	NEAR $L$done4x
+
+	movdqa	xmm6,XMMWORD[48+rsp]
+	lea	rsi,[64+rsi]
+	xor	r10,r10
+	movdqa	XMMWORD[rsp],xmm6
+	movdqa	XMMWORD[16+rsp],xmm15
+	lea	rdi,[64+rdi]
+	movdqa	XMMWORD[32+rsp],xmm9
+	sub	rdx,192
+	movdqa	XMMWORD[48+rsp],xmm3
+
+$L$oop_tail4x:
+	movzx	eax,BYTE[r10*1+rsi]
+	movzx	ecx,BYTE[r10*1+rsp]
+	lea	r10,[1+r10]
+	xor	eax,ecx
+	mov	BYTE[((-1))+r10*1+rdi],al
+	dec	rdx
+	jnz	NEAR $L$oop_tail4x
+
+$L$done4x:
+	movaps	xmm6,XMMWORD[((-168))+r9]
+	movaps	xmm7,XMMWORD[((-152))+r9]
+	movaps	xmm8,XMMWORD[((-136))+r9]
+	movaps	xmm9,XMMWORD[((-120))+r9]
+	movaps	xmm10,XMMWORD[((-104))+r9]
+	movaps	xmm11,XMMWORD[((-88))+r9]
+	movaps	xmm12,XMMWORD[((-72))+r9]
+	movaps	xmm13,XMMWORD[((-56))+r9]
+	movaps	xmm14,XMMWORD[((-40))+r9]
+	movaps	xmm15,XMMWORD[((-24))+r9]
+	lea	rsp,[r9]
+
+$L$4x_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_chacha20_4x:
+global	chacha20_avx2
+
+ALIGN	32
+chacha20_avx2:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_chacha20_avx2:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+$L$chacha20_avx2:
+	mov	r9,rsp
+
+	sub	rsp,0x280+168
+	and	rsp,-32
+	movaps	XMMWORD[(-168)+r9],xmm6
+	movaps	XMMWORD[(-152)+r9],xmm7
+	movaps	XMMWORD[(-136)+r9],xmm8
+	movaps	XMMWORD[(-120)+r9],xmm9
+	movaps	XMMWORD[(-104)+r9],xmm10
+	movaps	XMMWORD[(-88)+r9],xmm11
+	movaps	XMMWORD[(-72)+r9],xmm12
+	movaps	XMMWORD[(-56)+r9],xmm13
+	movaps	XMMWORD[(-40)+r9],xmm14
+	movaps	XMMWORD[(-24)+r9],xmm15
+$L$8x_body:
+	vzeroupper
+
+	vbroadcasti128	ymm11,XMMWORD[$L$sigma]
+	vbroadcasti128	ymm3,XMMWORD[rcx]
+	vbroadcasti128	ymm15,XMMWORD[16+rcx]
+	vbroadcasti128	ymm7,XMMWORD[r8]
+	lea	rcx,[256+rsp]
+	lea	rax,[512+rsp]
+	lea	r10,[$L$rot16]
+	lea	r11,[$L$rot24]
+
+	vpshufd	ymm8,ymm11,0x00
+	vpshufd	ymm9,ymm11,0x55
+	vmovdqa	YMMWORD[(128-256)+rcx],ymm8
+	vpshufd	ymm10,ymm11,0xaa
+	vmovdqa	YMMWORD[(160-256)+rcx],ymm9
+	vpshufd	ymm11,ymm11,0xff
+	vmovdqa	YMMWORD[(192-256)+rcx],ymm10
+	vmovdqa	YMMWORD[(224-256)+rcx],ymm11
+
+	vpshufd	ymm0,ymm3,0x00
+	vpshufd	ymm1,ymm3,0x55
+	vmovdqa	YMMWORD[(256-256)+rcx],ymm0
+	vpshufd	ymm2,ymm3,0xaa
+	vmovdqa	YMMWORD[(288-256)+rcx],ymm1
+	vpshufd	ymm3,ymm3,0xff
+	vmovdqa	YMMWORD[(320-256)+rcx],ymm2
+	vmovdqa	YMMWORD[(352-256)+rcx],ymm3
+
+	vpshufd	ymm12,ymm15,0x00
+	vpshufd	ymm13,ymm15,0x55
+	vmovdqa	YMMWORD[(384-512)+rax],ymm12
+	vpshufd	ymm14,ymm15,0xaa
+	vmovdqa	YMMWORD[(416-512)+rax],ymm13
+	vpshufd	ymm15,ymm15,0xff
+	vmovdqa	YMMWORD[(448-512)+rax],ymm14
+	vmovdqa	YMMWORD[(480-512)+rax],ymm15
+
+	vpshufd	ymm4,ymm7,0x00
+	vpshufd	ymm5,ymm7,0x55
+	vpaddd	ymm4,ymm4,YMMWORD[$L$incy]
+	vpshufd	ymm6,ymm7,0xaa
+	vmovdqa	YMMWORD[(544-512)+rax],ymm5
+	vpshufd	ymm7,ymm7,0xff
+	vmovdqa	YMMWORD[(576-512)+rax],ymm6
+	vmovdqa	YMMWORD[(608-512)+rax],ymm7
+
+	jmp	NEAR $L$oop_enter8x
+
+ALIGN	32
+$L$oop_outer8x:
+	vmovdqa	ymm8,YMMWORD[((128-256))+rcx]
+	vmovdqa	ymm9,YMMWORD[((160-256))+rcx]
+	vmovdqa	ymm10,YMMWORD[((192-256))+rcx]
+	vmovdqa	ymm11,YMMWORD[((224-256))+rcx]
+	vmovdqa	ymm0,YMMWORD[((256-256))+rcx]
+	vmovdqa	ymm1,YMMWORD[((288-256))+rcx]
+	vmovdqa	ymm2,YMMWORD[((320-256))+rcx]
+	vmovdqa	ymm3,YMMWORD[((352-256))+rcx]
+	vmovdqa	ymm12,YMMWORD[((384-512))+rax]
+	vmovdqa	ymm13,YMMWORD[((416-512))+rax]
+	vmovdqa	ymm14,YMMWORD[((448-512))+rax]
+	vmovdqa	ymm15,YMMWORD[((480-512))+rax]
+	vmovdqa	ymm4,YMMWORD[((512-512))+rax]
+	vmovdqa	ymm5,YMMWORD[((544-512))+rax]
+	vmovdqa	ymm6,YMMWORD[((576-512))+rax]
+	vmovdqa	ymm7,YMMWORD[((608-512))+rax]
+	vpaddd	ymm4,ymm4,YMMWORD[$L$eight]
+
+$L$oop_enter8x:
+	vmovdqa	YMMWORD[64+rsp],ymm14
+	vmovdqa	YMMWORD[96+rsp],ymm15
+	vbroadcasti128	ymm15,XMMWORD[r10]
+	vmovdqa	YMMWORD[(512-512)+rax],ymm4
+	mov	eax,10
+	jmp	NEAR $L$oop8x
+
+ALIGN	32
+$L$oop8x:
+	vpaddd	ymm8,ymm8,ymm0
+	vpxor	ymm4,ymm8,ymm4
+	vpshufb	ymm4,ymm4,ymm15
+	vpaddd	ymm9,ymm9,ymm1
+	vpxor	ymm5,ymm9,ymm5
+	vpshufb	ymm5,ymm5,ymm15
+	vpaddd	ymm12,ymm12,ymm4
+	vpxor	ymm0,ymm12,ymm0
+	vpslld	ymm14,ymm0,12
+	vpsrld	ymm0,ymm0,20
+	vpor	ymm0,ymm14,ymm0
+	vbroadcasti128	ymm14,XMMWORD[r11]
+	vpaddd	ymm13,ymm13,ymm5
+	vpxor	ymm1,ymm13,ymm1
+	vpslld	ymm15,ymm1,12
+	vpsrld	ymm1,ymm1,20
+	vpor	ymm1,ymm15,ymm1
+	vpaddd	ymm8,ymm8,ymm0
+	vpxor	ymm4,ymm8,ymm4
+	vpshufb	ymm4,ymm4,ymm14
+	vpaddd	ymm9,ymm9,ymm1
+	vpxor	ymm5,ymm9,ymm5
+	vpshufb	ymm5,ymm5,ymm14
+	vpaddd	ymm12,ymm12,ymm4
+	vpxor	ymm0,ymm12,ymm0
+	vpslld	ymm15,ymm0,7
+	vpsrld	ymm0,ymm0,25
+	vpor	ymm0,ymm15,ymm0
+	vbroadcasti128	ymm15,XMMWORD[r10]
+	vpaddd	ymm13,ymm13,ymm5
+	vpxor	ymm1,ymm13,ymm1
+	vpslld	ymm14,ymm1,7
+	vpsrld	ymm1,ymm1,25
+	vpor	ymm1,ymm14,ymm1
+	vmovdqa	YMMWORD[rsp],ymm12
+	vmovdqa	YMMWORD[32+rsp],ymm13
+	vmovdqa	ymm12,YMMWORD[64+rsp]
+	vmovdqa	ymm13,YMMWORD[96+rsp]
+	vpaddd	ymm10,ymm10,ymm2
+	vpxor	ymm6,ymm10,ymm6
+	vpshufb	ymm6,ymm6,ymm15
+	vpaddd	ymm11,ymm11,ymm3
+	vpxor	ymm7,ymm11,ymm7
+	vpshufb	ymm7,ymm7,ymm15
+	vpaddd	ymm12,ymm12,ymm6
+	vpxor	ymm2,ymm12,ymm2
+	vpslld	ymm14,ymm2,12
+	vpsrld	ymm2,ymm2,20
+	vpor	ymm2,ymm14,ymm2
+	vbroadcasti128	ymm14,XMMWORD[r11]
+	vpaddd	ymm13,ymm13,ymm7
+	vpxor	ymm3,ymm13,ymm3
+	vpslld	ymm15,ymm3,12
+	vpsrld	ymm3,ymm3,20
+	vpor	ymm3,ymm15,ymm3
+	vpaddd	ymm10,ymm10,ymm2
+	vpxor	ymm6,ymm10,ymm6
+	vpshufb	ymm6,ymm6,ymm14
+	vpaddd	ymm11,ymm11,ymm3
+	vpxor	ymm7,ymm11,ymm7
+	vpshufb	ymm7,ymm7,ymm14
+	vpaddd	ymm12,ymm12,ymm6
+	vpxor	ymm2,ymm12,ymm2
+	vpslld	ymm15,ymm2,7
+	vpsrld	ymm2,ymm2,25
+	vpor	ymm2,ymm15,ymm2
+	vbroadcasti128	ymm15,XMMWORD[r10]
+	vpaddd	ymm13,ymm13,ymm7
+	vpxor	ymm3,ymm13,ymm3
+	vpslld	ymm14,ymm3,7
+	vpsrld	ymm3,ymm3,25
+	vpor	ymm3,ymm14,ymm3
+	vpaddd	ymm8,ymm8,ymm1
+	vpxor	ymm7,ymm8,ymm7
+	vpshufb	ymm7,ymm7,ymm15
+	vpaddd	ymm9,ymm9,ymm2
+	vpxor	ymm4,ymm9,ymm4
+	vpshufb	ymm4,ymm4,ymm15
+	vpaddd	ymm12,ymm12,ymm7
+	vpxor	ymm1,ymm12,ymm1
+	vpslld	ymm14,ymm1,12
+	vpsrld	ymm1,ymm1,20
+	vpor	ymm1,ymm14,ymm1
+	vbroadcasti128	ymm14,XMMWORD[r11]
+	vpaddd	ymm13,ymm13,ymm4
+	vpxor	ymm2,ymm13,ymm2
+	vpslld	ymm15,ymm2,12
+	vpsrld	ymm2,ymm2,20
+	vpor	ymm2,ymm15,ymm2
+	vpaddd	ymm8,ymm8,ymm1
+	vpxor	ymm7,ymm8,ymm7
+	vpshufb	ymm7,ymm7,ymm14
+	vpaddd	ymm9,ymm9,ymm2
+	vpxor	ymm4,ymm9,ymm4
+	vpshufb	ymm4,ymm4,ymm14
+	vpaddd	ymm12,ymm12,ymm7
+	vpxor	ymm1,ymm12,ymm1
+	vpslld	ymm15,ymm1,7
+	vpsrld	ymm1,ymm1,25
+	vpor	ymm1,ymm15,ymm1
+	vbroadcasti128	ymm15,XMMWORD[r10]
+	vpaddd	ymm13,ymm13,ymm4
+	vpxor	ymm2,ymm13,ymm2
+	vpslld	ymm14,ymm2,7
+	vpsrld	ymm2,ymm2,25
+	vpor	ymm2,ymm14,ymm2
+	vmovdqa	YMMWORD[64+rsp],ymm12
+	vmovdqa	YMMWORD[96+rsp],ymm13
+	vmovdqa	ymm12,YMMWORD[rsp]
+	vmovdqa	ymm13,YMMWORD[32+rsp]
+	vpaddd	ymm10,ymm10,ymm3
+	vpxor	ymm5,ymm10,ymm5
+	vpshufb	ymm5,ymm5,ymm15
+	vpaddd	ymm11,ymm11,ymm0
+	vpxor	ymm6,ymm11,ymm6
+	vpshufb	ymm6,ymm6,ymm15
+	vpaddd	ymm12,ymm12,ymm5
+	vpxor	ymm3,ymm12,ymm3
+	vpslld	ymm14,ymm3,12
+	vpsrld	ymm3,ymm3,20
+	vpor	ymm3,ymm14,ymm3
+	vbroadcasti128	ymm14,XMMWORD[r11]
+	vpaddd	ymm13,ymm13,ymm6
+	vpxor	ymm0,ymm13,ymm0
+	vpslld	ymm15,ymm0,12
+	vpsrld	ymm0,ymm0,20
+	vpor	ymm0,ymm15,ymm0
+	vpaddd	ymm10,ymm10,ymm3
+	vpxor	ymm5,ymm10,ymm5
+	vpshufb	ymm5,ymm5,ymm14
+	vpaddd	ymm11,ymm11,ymm0
+	vpxor	ymm6,ymm11,ymm6
+	vpshufb	ymm6,ymm6,ymm14
+	vpaddd	ymm12,ymm12,ymm5
+	vpxor	ymm3,ymm12,ymm3
+	vpslld	ymm15,ymm3,7
+	vpsrld	ymm3,ymm3,25
+	vpor	ymm3,ymm15,ymm3
+	vbroadcasti128	ymm15,XMMWORD[r10]
+	vpaddd	ymm13,ymm13,ymm6
+	vpxor	ymm0,ymm13,ymm0
+	vpslld	ymm14,ymm0,7
+	vpsrld	ymm0,ymm0,25
+	vpor	ymm0,ymm14,ymm0
+	dec	eax
+	jnz	NEAR $L$oop8x
+
+	lea	rax,[512+rsp]
+	vpaddd	ymm8,ymm8,YMMWORD[((128-256))+rcx]
+	vpaddd	ymm9,ymm9,YMMWORD[((160-256))+rcx]
+	vpaddd	ymm10,ymm10,YMMWORD[((192-256))+rcx]
+	vpaddd	ymm11,ymm11,YMMWORD[((224-256))+rcx]
+
+	vpunpckldq	ymm14,ymm8,ymm9
+	vpunpckldq	ymm15,ymm10,ymm11
+	vpunpckhdq	ymm8,ymm8,ymm9
+	vpunpckhdq	ymm10,ymm10,ymm11
+	vpunpcklqdq	ymm9,ymm14,ymm15
+	vpunpckhqdq	ymm14,ymm14,ymm15
+	vpunpcklqdq	ymm11,ymm8,ymm10
+	vpunpckhqdq	ymm8,ymm8,ymm10
+	vpaddd	ymm0,ymm0,YMMWORD[((256-256))+rcx]
+	vpaddd	ymm1,ymm1,YMMWORD[((288-256))+rcx]
+	vpaddd	ymm2,ymm2,YMMWORD[((320-256))+rcx]
+	vpaddd	ymm3,ymm3,YMMWORD[((352-256))+rcx]
+
+	vpunpckldq	ymm10,ymm0,ymm1
+	vpunpckldq	ymm15,ymm2,ymm3
+	vpunpckhdq	ymm0,ymm0,ymm1
+	vpunpckhdq	ymm2,ymm2,ymm3
+	vpunpcklqdq	ymm1,ymm10,ymm15
+	vpunpckhqdq	ymm10,ymm10,ymm15
+	vpunpcklqdq	ymm3,ymm0,ymm2
+	vpunpckhqdq	ymm0,ymm0,ymm2
+	vperm2i128	ymm15,ymm9,ymm1,0x20
+	vperm2i128	ymm1,ymm9,ymm1,0x31
+	vperm2i128	ymm9,ymm14,ymm10,0x20
+	vperm2i128	ymm10,ymm14,ymm10,0x31
+	vperm2i128	ymm14,ymm11,ymm3,0x20
+	vperm2i128	ymm3,ymm11,ymm3,0x31
+	vperm2i128	ymm11,ymm8,ymm0,0x20
+	vperm2i128	ymm0,ymm8,ymm0,0x31
+	vmovdqa	YMMWORD[rsp],ymm15
+	vmovdqa	YMMWORD[32+rsp],ymm9
+	vmovdqa	ymm15,YMMWORD[64+rsp]
+	vmovdqa	ymm9,YMMWORD[96+rsp]
+
+	vpaddd	ymm12,ymm12,YMMWORD[((384-512))+rax]
+	vpaddd	ymm13,ymm13,YMMWORD[((416-512))+rax]
+	vpaddd	ymm15,ymm15,YMMWORD[((448-512))+rax]
+	vpaddd	ymm9,ymm9,YMMWORD[((480-512))+rax]
+
+	vpunpckldq	ymm2,ymm12,ymm13
+	vpunpckldq	ymm8,ymm15,ymm9
+	vpunpckhdq	ymm12,ymm12,ymm13
+	vpunpckhdq	ymm15,ymm15,ymm9
+	vpunpcklqdq	ymm13,ymm2,ymm8
+	vpunpckhqdq	ymm2,ymm2,ymm8
+	vpunpcklqdq	ymm9,ymm12,ymm15
+	vpunpckhqdq	ymm12,ymm12,ymm15
+	vpaddd	ymm4,ymm4,YMMWORD[((512-512))+rax]
+	vpaddd	ymm5,ymm5,YMMWORD[((544-512))+rax]
+	vpaddd	ymm6,ymm6,YMMWORD[((576-512))+rax]
+	vpaddd	ymm7,ymm7,YMMWORD[((608-512))+rax]
+
+	vpunpckldq	ymm15,ymm4,ymm5
+	vpunpckldq	ymm8,ymm6,ymm7
+	vpunpckhdq	ymm4,ymm4,ymm5
+	vpunpckhdq	ymm6,ymm6,ymm7
+	vpunpcklqdq	ymm5,ymm15,ymm8
+	vpunpckhqdq	ymm15,ymm15,ymm8
+	vpunpcklqdq	ymm7,ymm4,ymm6
+	vpunpckhqdq	ymm4,ymm4,ymm6
+	vperm2i128	ymm8,ymm13,ymm5,0x20
+	vperm2i128	ymm5,ymm13,ymm5,0x31
+	vperm2i128	ymm13,ymm2,ymm15,0x20
+	vperm2i128	ymm15,ymm2,ymm15,0x31
+	vperm2i128	ymm2,ymm9,ymm7,0x20
+	vperm2i128	ymm7,ymm9,ymm7,0x31
+	vperm2i128	ymm9,ymm12,ymm4,0x20
+	vperm2i128	ymm4,ymm12,ymm4,0x31
+	vmovdqa	ymm6,YMMWORD[rsp]
+	vmovdqa	ymm12,YMMWORD[32+rsp]
+
+	cmp	rdx,64*8
+	jb	NEAR $L$tail8x
+
+	vpxor	ymm6,ymm6,YMMWORD[rsi]
+	vpxor	ymm8,ymm8,YMMWORD[32+rsi]
+	vpxor	ymm1,ymm1,YMMWORD[64+rsi]
+	vpxor	ymm5,ymm5,YMMWORD[96+rsi]
+	lea	rsi,[128+rsi]
+	vmovdqu	YMMWORD[rdi],ymm6
+	vmovdqu	YMMWORD[32+rdi],ymm8
+	vmovdqu	YMMWORD[64+rdi],ymm1
+	vmovdqu	YMMWORD[96+rdi],ymm5
+	lea	rdi,[128+rdi]
+
+	vpxor	ymm12,ymm12,YMMWORD[rsi]
+	vpxor	ymm13,ymm13,YMMWORD[32+rsi]
+	vpxor	ymm10,ymm10,YMMWORD[64+rsi]
+	vpxor	ymm15,ymm15,YMMWORD[96+rsi]
+	lea	rsi,[128+rsi]
+	vmovdqu	YMMWORD[rdi],ymm12
+	vmovdqu	YMMWORD[32+rdi],ymm13
+	vmovdqu	YMMWORD[64+rdi],ymm10
+	vmovdqu	YMMWORD[96+rdi],ymm15
+	lea	rdi,[128+rdi]
+
+	vpxor	ymm14,ymm14,YMMWORD[rsi]
+	vpxor	ymm2,ymm2,YMMWORD[32+rsi]
+	vpxor	ymm3,ymm3,YMMWORD[64+rsi]
+	vpxor	ymm7,ymm7,YMMWORD[96+rsi]
+	lea	rsi,[128+rsi]
+	vmovdqu	YMMWORD[rdi],ymm14
+	vmovdqu	YMMWORD[32+rdi],ymm2
+	vmovdqu	YMMWORD[64+rdi],ymm3
+	vmovdqu	YMMWORD[96+rdi],ymm7
+	lea	rdi,[128+rdi]
+
+	vpxor	ymm11,ymm11,YMMWORD[rsi]
+	vpxor	ymm9,ymm9,YMMWORD[32+rsi]
+	vpxor	ymm0,ymm0,YMMWORD[64+rsi]
+	vpxor	ymm4,ymm4,YMMWORD[96+rsi]
+	lea	rsi,[128+rsi]
+	vmovdqu	YMMWORD[rdi],ymm11
+	vmovdqu	YMMWORD[32+rdi],ymm9
+	vmovdqu	YMMWORD[64+rdi],ymm0
+	vmovdqu	YMMWORD[96+rdi],ymm4
+	lea	rdi,[128+rdi]
+
+	sub	rdx,64*8
+	jnz	NEAR $L$oop_outer8x
+
+	jmp	NEAR $L$done8x
+
+$L$tail8x:
+	cmp	rdx,448
+	jae	NEAR $L$448_or_more8x
+	cmp	rdx,384
+	jae	NEAR $L$384_or_more8x
+	cmp	rdx,320
+	jae	NEAR $L$320_or_more8x
+	cmp	rdx,256
+	jae	NEAR $L$256_or_more8x
+	cmp	rdx,192
+	jae	NEAR $L$192_or_more8x
+	cmp	rdx,128
+	jae	NEAR $L$128_or_more8x
+	cmp	rdx,64
+	jae	NEAR $L$64_or_more8x
+
+	xor	r10,r10
+	vmovdqa	YMMWORD[rsp],ymm6
+	vmovdqa	YMMWORD[32+rsp],ymm8
+	jmp	NEAR $L$oop_tail8x
+
+ALIGN	32
+$L$64_or_more8x:
+	vpxor	ymm6,ymm6,YMMWORD[rsi]
+	vpxor	ymm8,ymm8,YMMWORD[32+rsi]
+	vmovdqu	YMMWORD[rdi],ymm6
+	vmovdqu	YMMWORD[32+rdi],ymm8
+	je	NEAR $L$done8x
+
+	lea	rsi,[64+rsi]
+	xor	r10,r10
+	vmovdqa	YMMWORD[rsp],ymm1
+	lea	rdi,[64+rdi]
+	sub	rdx,64
+	vmovdqa	YMMWORD[32+rsp],ymm5
+	jmp	NEAR $L$oop_tail8x
+
+ALIGN	32
+$L$128_or_more8x:
+	vpxor	ymm6,ymm6,YMMWORD[rsi]
+	vpxor	ymm8,ymm8,YMMWORD[32+rsi]
+	vpxor	ymm1,ymm1,YMMWORD[64+rsi]
+	vpxor	ymm5,ymm5,YMMWORD[96+rsi]
+	vmovdqu	YMMWORD[rdi],ymm6
+	vmovdqu	YMMWORD[32+rdi],ymm8
+	vmovdqu	YMMWORD[64+rdi],ymm1
+	vmovdqu	YMMWORD[96+rdi],ymm5
+	je	NEAR $L$done8x
+
+	lea	rsi,[128+rsi]
+	xor	r10,r10
+	vmovdqa	YMMWORD[rsp],ymm12
+	lea	rdi,[128+rdi]
+	sub	rdx,128
+	vmovdqa	YMMWORD[32+rsp],ymm13
+	jmp	NEAR $L$oop_tail8x
+
+ALIGN	32
+$L$192_or_more8x:
+	vpxor	ymm6,ymm6,YMMWORD[rsi]
+	vpxor	ymm8,ymm8,YMMWORD[32+rsi]
+	vpxor	ymm1,ymm1,YMMWORD[64+rsi]
+	vpxor	ymm5,ymm5,YMMWORD[96+rsi]
+	vpxor	ymm12,ymm12,YMMWORD[128+rsi]
+	vpxor	ymm13,ymm13,YMMWORD[160+rsi]
+	vmovdqu	YMMWORD[rdi],ymm6
+	vmovdqu	YMMWORD[32+rdi],ymm8
+	vmovdqu	YMMWORD[64+rdi],ymm1
+	vmovdqu	YMMWORD[96+rdi],ymm5
+	vmovdqu	YMMWORD[128+rdi],ymm12
+	vmovdqu	YMMWORD[160+rdi],ymm13
+	je	NEAR $L$done8x
+
+	lea	rsi,[192+rsi]
+	xor	r10,r10
+	vmovdqa	YMMWORD[rsp],ymm10
+	lea	rdi,[192+rdi]
+	sub	rdx,192
+	vmovdqa	YMMWORD[32+rsp],ymm15
+	jmp	NEAR $L$oop_tail8x
+
+ALIGN	32
+$L$256_or_more8x:
+	vpxor	ymm6,ymm6,YMMWORD[rsi]
+	vpxor	ymm8,ymm8,YMMWORD[32+rsi]
+	vpxor	ymm1,ymm1,YMMWORD[64+rsi]
+	vpxor	ymm5,ymm5,YMMWORD[96+rsi]
+	vpxor	ymm12,ymm12,YMMWORD[128+rsi]
+	vpxor	ymm13,ymm13,YMMWORD[160+rsi]
+	vpxor	ymm10,ymm10,YMMWORD[192+rsi]
+	vpxor	ymm15,ymm15,YMMWORD[224+rsi]
+	vmovdqu	YMMWORD[rdi],ymm6
+	vmovdqu	YMMWORD[32+rdi],ymm8
+	vmovdqu	YMMWORD[64+rdi],ymm1
+	vmovdqu	YMMWORD[96+rdi],ymm5
+	vmovdqu	YMMWORD[128+rdi],ymm12
+	vmovdqu	YMMWORD[160+rdi],ymm13
+	vmovdqu	YMMWORD[192+rdi],ymm10
+	vmovdqu	YMMWORD[224+rdi],ymm15
+	je	NEAR $L$done8x
+
+	lea	rsi,[256+rsi]
+	xor	r10,r10
+	vmovdqa	YMMWORD[rsp],ymm14
+	lea	rdi,[256+rdi]
+	sub	rdx,256
+	vmovdqa	YMMWORD[32+rsp],ymm2
+	jmp	NEAR $L$oop_tail8x
+
+ALIGN	32
+$L$320_or_more8x:
+	vpxor	ymm6,ymm6,YMMWORD[rsi]
+	vpxor	ymm8,ymm8,YMMWORD[32+rsi]
+	vpxor	ymm1,ymm1,YMMWORD[64+rsi]
+	vpxor	ymm5,ymm5,YMMWORD[96+rsi]
+	vpxor	ymm12,ymm12,YMMWORD[128+rsi]
+	vpxor	ymm13,ymm13,YMMWORD[160+rsi]
+	vpxor	ymm10,ymm10,YMMWORD[192+rsi]
+	vpxor	ymm15,ymm15,YMMWORD[224+rsi]
+	vpxor	ymm14,ymm14,YMMWORD[256+rsi]
+	vpxor	ymm2,ymm2,YMMWORD[288+rsi]
+	vmovdqu	YMMWORD[rdi],ymm6
+	vmovdqu	YMMWORD[32+rdi],ymm8
+	vmovdqu	YMMWORD[64+rdi],ymm1
+	vmovdqu	YMMWORD[96+rdi],ymm5
+	vmovdqu	YMMWORD[128+rdi],ymm12
+	vmovdqu	YMMWORD[160+rdi],ymm13
+	vmovdqu	YMMWORD[192+rdi],ymm10
+	vmovdqu	YMMWORD[224+rdi],ymm15
+	vmovdqu	YMMWORD[256+rdi],ymm14
+	vmovdqu	YMMWORD[288+rdi],ymm2
+	je	NEAR $L$done8x
+
+	lea	rsi,[320+rsi]
+	xor	r10,r10
+	vmovdqa	YMMWORD[rsp],ymm3
+	lea	rdi,[320+rdi]
+	sub	rdx,320
+	vmovdqa	YMMWORD[32+rsp],ymm7
+	jmp	NEAR $L$oop_tail8x
+
+ALIGN	32
+$L$384_or_more8x:
+	vpxor	ymm6,ymm6,YMMWORD[rsi]
+	vpxor	ymm8,ymm8,YMMWORD[32+rsi]
+	vpxor	ymm1,ymm1,YMMWORD[64+rsi]
+	vpxor	ymm5,ymm5,YMMWORD[96+rsi]
+	vpxor	ymm12,ymm12,YMMWORD[128+rsi]
+	vpxor	ymm13,ymm13,YMMWORD[160+rsi]
+	vpxor	ymm10,ymm10,YMMWORD[192+rsi]
+	vpxor	ymm15,ymm15,YMMWORD[224+rsi]
+	vpxor	ymm14,ymm14,YMMWORD[256+rsi]
+	vpxor	ymm2,ymm2,YMMWORD[288+rsi]
+	vpxor	ymm3,ymm3,YMMWORD[320+rsi]
+	vpxor	ymm7,ymm7,YMMWORD[352+rsi]
+	vmovdqu	YMMWORD[rdi],ymm6
+	vmovdqu	YMMWORD[32+rdi],ymm8
+	vmovdqu	YMMWORD[64+rdi],ymm1
+	vmovdqu	YMMWORD[96+rdi],ymm5
+	vmovdqu	YMMWORD[128+rdi],ymm12
+	vmovdqu	YMMWORD[160+rdi],ymm13
+	vmovdqu	YMMWORD[192+rdi],ymm10
+	vmovdqu	YMMWORD[224+rdi],ymm15
+	vmovdqu	YMMWORD[256+rdi],ymm14
+	vmovdqu	YMMWORD[288+rdi],ymm2
+	vmovdqu	YMMWORD[320+rdi],ymm3
+	vmovdqu	YMMWORD[352+rdi],ymm7
+	je	NEAR $L$done8x
+
+	lea	rsi,[384+rsi]
+	xor	r10,r10
+	vmovdqa	YMMWORD[rsp],ymm11
+	lea	rdi,[384+rdi]
+	sub	rdx,384
+	vmovdqa	YMMWORD[32+rsp],ymm9
+	jmp	NEAR $L$oop_tail8x
+
+ALIGN	32
+$L$448_or_more8x:
+	vpxor	ymm6,ymm6,YMMWORD[rsi]
+	vpxor	ymm8,ymm8,YMMWORD[32+rsi]
+	vpxor	ymm1,ymm1,YMMWORD[64+rsi]
+	vpxor	ymm5,ymm5,YMMWORD[96+rsi]
+	vpxor	ymm12,ymm12,YMMWORD[128+rsi]
+	vpxor	ymm13,ymm13,YMMWORD[160+rsi]
+	vpxor	ymm10,ymm10,YMMWORD[192+rsi]
+	vpxor	ymm15,ymm15,YMMWORD[224+rsi]
+	vpxor	ymm14,ymm14,YMMWORD[256+rsi]
+	vpxor	ymm2,ymm2,YMMWORD[288+rsi]
+	vpxor	ymm3,ymm3,YMMWORD[320+rsi]
+	vpxor	ymm7,ymm7,YMMWORD[352+rsi]
+	vpxor	ymm11,ymm11,YMMWORD[384+rsi]
+	vpxor	ymm9,ymm9,YMMWORD[416+rsi]
+	vmovdqu	YMMWORD[rdi],ymm6
+	vmovdqu	YMMWORD[32+rdi],ymm8
+	vmovdqu	YMMWORD[64+rdi],ymm1
+	vmovdqu	YMMWORD[96+rdi],ymm5
+	vmovdqu	YMMWORD[128+rdi],ymm12
+	vmovdqu	YMMWORD[160+rdi],ymm13
+	vmovdqu	YMMWORD[192+rdi],ymm10
+	vmovdqu	YMMWORD[224+rdi],ymm15
+	vmovdqu	YMMWORD[256+rdi],ymm14
+	vmovdqu	YMMWORD[288+rdi],ymm2
+	vmovdqu	YMMWORD[320+rdi],ymm3
+	vmovdqu	YMMWORD[352+rdi],ymm7
+	vmovdqu	YMMWORD[384+rdi],ymm11
+	vmovdqu	YMMWORD[416+rdi],ymm9
+	je	NEAR $L$done8x
+
+	lea	rsi,[448+rsi]
+	xor	r10,r10
+	vmovdqa	YMMWORD[rsp],ymm0
+	lea	rdi,[448+rdi]
+	sub	rdx,448
+	vmovdqa	YMMWORD[32+rsp],ymm4
+
+$L$oop_tail8x:
+	movzx	eax,BYTE[r10*1+rsi]
+	movzx	ecx,BYTE[r10*1+rsp]
+	lea	r10,[1+r10]
+	xor	eax,ecx
+	mov	BYTE[((-1))+r10*1+rdi],al
+	dec	rdx
+	jnz	NEAR $L$oop_tail8x
+
+$L$done8x:
+	vzeroall
+	movaps	xmm6,XMMWORD[((-168))+r9]
+	movaps	xmm7,XMMWORD[((-152))+r9]
+	movaps	xmm8,XMMWORD[((-136))+r9]
+	movaps	xmm9,XMMWORD[((-120))+r9]
+	movaps	xmm10,XMMWORD[((-104))+r9]
+	movaps	xmm11,XMMWORD[((-88))+r9]
+	movaps	xmm12,XMMWORD[((-72))+r9]
+	movaps	xmm13,XMMWORD[((-56))+r9]
+	movaps	xmm14,XMMWORD[((-40))+r9]
+	movaps	xmm15,XMMWORD[((-24))+r9]
+	lea	rsp,[r9]
+
+$L$8x_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_chacha20_avx2:
+global	chacha20_avx512
+
+ALIGN	32
+chacha20_avx512:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_chacha20_avx512:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+$L$chacha20_avx512:
+	mov	r9,rsp
+
+	cmp	rdx,512
+	ja	NEAR $L$chacha20_16x
+
+	sub	rsp,64+40
+	movaps	XMMWORD[(-40)+r9],xmm6
+	movaps	XMMWORD[(-24)+r9],xmm7
+$L$avx512_body:
+	vbroadcasti32x4	zmm0,ZMMWORD[$L$sigma]
+	vbroadcasti32x4	zmm1,ZMMWORD[rcx]
+	vbroadcasti32x4	zmm2,ZMMWORD[16+rcx]
+	vbroadcasti32x4	zmm3,ZMMWORD[r8]
+
+	vmovdqa32	zmm16,zmm0
+	vmovdqa32	zmm17,zmm1
+	vmovdqa32	zmm18,zmm2
+	vpaddd	zmm3,zmm3,ZMMWORD[$L$zeroz]
+	vmovdqa32	zmm20,ZMMWORD[$L$fourz]
+	mov	r8,10
+	vmovdqa32	zmm19,zmm3
+	jmp	NEAR $L$oop_avx512
+
+ALIGN	16
+$L$oop_outer_avx512:
+	vmovdqa32	zmm0,zmm16
+	vmovdqa32	zmm1,zmm17
+	vmovdqa32	zmm2,zmm18
+	vpaddd	zmm3,zmm19,zmm20
+	mov	r8,10
+	vmovdqa32	zmm19,zmm3
+	jmp	NEAR $L$oop_avx512
+
+ALIGN	32
+$L$oop_avx512:
+	vpaddd	zmm0,zmm0,zmm1
+	vpxord	zmm3,zmm3,zmm0
+	vprold	zmm3,zmm3,16
+	vpaddd	zmm2,zmm2,zmm3
+	vpxord	zmm1,zmm1,zmm2
+	vprold	zmm1,zmm1,12
+	vpaddd	zmm0,zmm0,zmm1
+	vpxord	zmm3,zmm3,zmm0
+	vprold	zmm3,zmm3,8
+	vpaddd	zmm2,zmm2,zmm3
+	vpxord	zmm1,zmm1,zmm2
+	vprold	zmm1,zmm1,7
+	vpshufd	zmm2,zmm2,78
+	vpshufd	zmm1,zmm1,57
+	vpshufd	zmm3,zmm3,147
+	vpaddd	zmm0,zmm0,zmm1
+	vpxord	zmm3,zmm3,zmm0
+	vprold	zmm3,zmm3,16
+	vpaddd	zmm2,zmm2,zmm3
+	vpxord	zmm1,zmm1,zmm2
+	vprold	zmm1,zmm1,12
+	vpaddd	zmm0,zmm0,zmm1
+	vpxord	zmm3,zmm3,zmm0
+	vprold	zmm3,zmm3,8
+	vpaddd	zmm2,zmm2,zmm3
+	vpxord	zmm1,zmm1,zmm2
+	vprold	zmm1,zmm1,7
+	vpshufd	zmm2,zmm2,78
+	vpshufd	zmm1,zmm1,147
+	vpshufd	zmm3,zmm3,57
+	dec	r8
+	jnz	NEAR $L$oop_avx512
+	vpaddd	zmm0,zmm0,zmm16
+	vpaddd	zmm1,zmm1,zmm17
+	vpaddd	zmm2,zmm2,zmm18
+	vpaddd	zmm3,zmm3,zmm19
+
+	sub	rdx,64
+	jb	NEAR $L$tail64_avx512
+
+	vpxor	xmm4,xmm0,XMMWORD[rsi]
+	vpxor	xmm5,xmm1,XMMWORD[16+rsi]
+	vpxor	xmm6,xmm2,XMMWORD[32+rsi]
+	vpxor	xmm7,xmm3,XMMWORD[48+rsi]
+	lea	rsi,[64+rsi]
+
+	vmovdqu	XMMWORD[rdi],xmm4
+	vmovdqu	XMMWORD[16+rdi],xmm5
+	vmovdqu	XMMWORD[32+rdi],xmm6
+	vmovdqu	XMMWORD[48+rdi],xmm7
+	lea	rdi,[64+rdi]
+
+	jz	NEAR $L$done_avx512
+
+	vextracti32x4	xmm4,zmm0,1
+	vextracti32x4	xmm5,zmm1,1
+	vextracti32x4	xmm6,zmm2,1
+	vextracti32x4	xmm7,zmm3,1
+
+	sub	rdx,64
+	jb	NEAR $L$tail_avx512
+
+	vpxor	xmm4,xmm4,XMMWORD[rsi]
+	vpxor	xmm5,xmm5,XMMWORD[16+rsi]
+	vpxor	xmm6,xmm6,XMMWORD[32+rsi]
+	vpxor	xmm7,xmm7,XMMWORD[48+rsi]
+	lea	rsi,[64+rsi]
+
+	vmovdqu	XMMWORD[rdi],xmm4
+	vmovdqu	XMMWORD[16+rdi],xmm5
+	vmovdqu	XMMWORD[32+rdi],xmm6
+	vmovdqu	XMMWORD[48+rdi],xmm7
+	lea	rdi,[64+rdi]
+
+	jz	NEAR $L$done_avx512
+
+	vextracti32x4	xmm4,zmm0,2
+	vextracti32x4	xmm5,zmm1,2
+	vextracti32x4	xmm6,zmm2,2
+	vextracti32x4	xmm7,zmm3,2
+
+	sub	rdx,64
+	jb	NEAR $L$tail_avx512
+
+	vpxor	xmm4,xmm4,XMMWORD[rsi]
+	vpxor	xmm5,xmm5,XMMWORD[16+rsi]
+	vpxor	xmm6,xmm6,XMMWORD[32+rsi]
+	vpxor	xmm7,xmm7,XMMWORD[48+rsi]
+	lea	rsi,[64+rsi]
+
+	vmovdqu	XMMWORD[rdi],xmm4
+	vmovdqu	XMMWORD[16+rdi],xmm5
+	vmovdqu	XMMWORD[32+rdi],xmm6
+	vmovdqu	XMMWORD[48+rdi],xmm7
+	lea	rdi,[64+rdi]
+
+	jz	NEAR $L$done_avx512
+
+	vextracti32x4	xmm4,zmm0,3
+	vextracti32x4	xmm5,zmm1,3
+	vextracti32x4	xmm6,zmm2,3
+	vextracti32x4	xmm7,zmm3,3
+
+	sub	rdx,64
+	jb	NEAR $L$tail_avx512
+
+	vpxor	xmm4,xmm4,XMMWORD[rsi]
+	vpxor	xmm5,xmm5,XMMWORD[16+rsi]
+	vpxor	xmm6,xmm6,XMMWORD[32+rsi]
+	vpxor	xmm7,xmm7,XMMWORD[48+rsi]
+	lea	rsi,[64+rsi]
+
+	vmovdqu	XMMWORD[rdi],xmm4
+	vmovdqu	XMMWORD[16+rdi],xmm5
+	vmovdqu	XMMWORD[32+rdi],xmm6
+	vmovdqu	XMMWORD[48+rdi],xmm7
+	lea	rdi,[64+rdi]
+
+	jnz	NEAR $L$oop_outer_avx512
+
+	jmp	NEAR $L$done_avx512
+
+ALIGN	16
+$L$tail64_avx512:
+	vmovdqa	XMMWORD[rsp],xmm0
+	vmovdqa	XMMWORD[16+rsp],xmm1
+	vmovdqa	XMMWORD[32+rsp],xmm2
+	vmovdqa	XMMWORD[48+rsp],xmm3
+	add	rdx,64
+	jmp	NEAR $L$oop_tail_avx512
+
+ALIGN	16
+$L$tail_avx512:
+	vmovdqa	XMMWORD[rsp],xmm4
+	vmovdqa	XMMWORD[16+rsp],xmm5
+	vmovdqa	XMMWORD[32+rsp],xmm6
+	vmovdqa	XMMWORD[48+rsp],xmm7
+	add	rdx,64
+
+$L$oop_tail_avx512:
+	movzx	eax,BYTE[r8*1+rsi]
+	movzx	ecx,BYTE[r8*1+rsp]
+	lea	r8,[1+r8]
+	xor	eax,ecx
+	mov	BYTE[((-1))+r8*1+rdi],al
+	dec	rdx
+	jnz	NEAR $L$oop_tail_avx512
+
+	vmovdqu32	ZMMWORD[rsp],zmm16
+
+$L$done_avx512:
+	vzeroall
+	movaps	xmm6,XMMWORD[((-40))+r9]
+	movaps	xmm7,XMMWORD[((-24))+r9]
+	lea	rsp,[r9]
+
+$L$avx512_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_chacha20_avx512:
+global	chacha20_avx512vl
+
+ALIGN	32
+chacha20_avx512vl:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_chacha20_avx512vl:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+$L$chacha20_avx512vl:
+	mov	r9,rsp
+
+	cmp	rdx,128
+	ja	NEAR $L$chacha20_8xvl
+
+	sub	rsp,64+40
+	movaps	XMMWORD[(-40)+r9],xmm6
+	movaps	XMMWORD[(-24)+r9],xmm7
+$L$avx512vl_body:
+	vbroadcasti128	ymm0,XMMWORD[$L$sigma]
+	vbroadcasti128	ymm1,XMMWORD[rcx]
+	vbroadcasti128	ymm2,XMMWORD[16+rcx]
+	vbroadcasti128	ymm3,XMMWORD[r8]
+
+	vmovdqa32	ymm16,ymm0
+	vmovdqa32	ymm17,ymm1
+	vmovdqa32	ymm18,ymm2
+	vpaddd	ymm3,ymm3,YMMWORD[$L$zeroz]
+	vmovdqa32	ymm20,YMMWORD[$L$twoy]
+	mov	r8,10
+	vmovdqa32	ymm19,ymm3
+	jmp	NEAR $L$oop_avx512vl
+
+ALIGN	16
+$L$oop_outer_avx512vl:
+	vmovdqa32	ymm2,ymm18
+	vpaddd	ymm3,ymm19,ymm20
+	mov	r8,10
+	vmovdqa32	ymm19,ymm3
+	jmp	NEAR $L$oop_avx512vl
+
+ALIGN	32
+$L$oop_avx512vl:
+	vpaddd	ymm0,ymm0,ymm1
+	vpxor	ymm3,ymm3,ymm0
+	vprold	ymm3,ymm3,16
+	vpaddd	ymm2,ymm2,ymm3
+	vpxor	ymm1,ymm1,ymm2
+	vprold	ymm1,ymm1,12
+	vpaddd	ymm0,ymm0,ymm1
+	vpxor	ymm3,ymm3,ymm0
+	vprold	ymm3,ymm3,8
+	vpaddd	ymm2,ymm2,ymm3
+	vpxor	ymm1,ymm1,ymm2
+	vprold	ymm1,ymm1,7
+	vpshufd	ymm2,ymm2,78
+	vpshufd	ymm1,ymm1,57
+	vpshufd	ymm3,ymm3,147
+	vpaddd	ymm0,ymm0,ymm1
+	vpxor	ymm3,ymm3,ymm0
+	vprold	ymm3,ymm3,16
+	vpaddd	ymm2,ymm2,ymm3
+	vpxor	ymm1,ymm1,ymm2
+	vprold	ymm1,ymm1,12
+	vpaddd	ymm0,ymm0,ymm1
+	vpxor	ymm3,ymm3,ymm0
+	vprold	ymm3,ymm3,8
+	vpaddd	ymm2,ymm2,ymm3
+	vpxor	ymm1,ymm1,ymm2
+	vprold	ymm1,ymm1,7
+	vpshufd	ymm2,ymm2,78
+	vpshufd	ymm1,ymm1,147
+	vpshufd	ymm3,ymm3,57
+	dec	r8
+	jnz	NEAR $L$oop_avx512vl
+	vpaddd	ymm0,ymm0,ymm16
+	vpaddd	ymm1,ymm1,ymm17
+	vpaddd	ymm2,ymm2,ymm18
+	vpaddd	ymm3,ymm3,ymm19
+
+	sub	rdx,64
+	jb	NEAR $L$tail64_avx512vl
+
+	vpxor	xmm4,xmm0,XMMWORD[rsi]
+	vpxor	xmm5,xmm1,XMMWORD[16+rsi]
+	vpxor	xmm6,xmm2,XMMWORD[32+rsi]
+	vpxor	xmm7,xmm3,XMMWORD[48+rsi]
+	lea	rsi,[64+rsi]
+
+	vmovdqu	XMMWORD[rdi],xmm4
+	vmovdqu	XMMWORD[16+rdi],xmm5
+	vmovdqu	XMMWORD[32+rdi],xmm6
+	vmovdqu	XMMWORD[48+rdi],xmm7
+	lea	rdi,[64+rdi]
+
+	jz	NEAR $L$done_avx512vl
+
+	vextracti128	xmm4,ymm0,1
+	vextracti128	xmm5,ymm1,1
+	vextracti128	xmm6,ymm2,1
+	vextracti128	xmm7,ymm3,1
+
+	sub	rdx,64
+	jb	NEAR $L$tail_avx512vl
+
+	vpxor	xmm4,xmm4,XMMWORD[rsi]
+	vpxor	xmm5,xmm5,XMMWORD[16+rsi]
+	vpxor	xmm6,xmm6,XMMWORD[32+rsi]
+	vpxor	xmm7,xmm7,XMMWORD[48+rsi]
+	lea	rsi,[64+rsi]
+
+	vmovdqu	XMMWORD[rdi],xmm4
+	vmovdqu	XMMWORD[16+rdi],xmm5
+	vmovdqu	XMMWORD[32+rdi],xmm6
+	vmovdqu	XMMWORD[48+rdi],xmm7
+	lea	rdi,[64+rdi]
+
+	vmovdqa32	ymm0,ymm16
+	vmovdqa32	ymm1,ymm17
+	jnz	NEAR $L$oop_outer_avx512vl
+
+	jmp	NEAR $L$done_avx512vl
+
+ALIGN	16
+$L$tail64_avx512vl:
+	vmovdqa	XMMWORD[rsp],xmm0
+	vmovdqa	XMMWORD[16+rsp],xmm1
+	vmovdqa	XMMWORD[32+rsp],xmm2
+	vmovdqa	XMMWORD[48+rsp],xmm3
+	add	rdx,64
+	jmp	NEAR $L$oop_tail_avx512vl
+
+ALIGN	16
+$L$tail_avx512vl:
+	vmovdqa	XMMWORD[rsp],xmm4
+	vmovdqa	XMMWORD[16+rsp],xmm5
+	vmovdqa	XMMWORD[32+rsp],xmm6
+	vmovdqa	XMMWORD[48+rsp],xmm7
+	add	rdx,64
+
+$L$oop_tail_avx512vl:
+	movzx	eax,BYTE[r8*1+rsi]
+	movzx	ecx,BYTE[r8*1+rsp]
+	lea	r8,[1+r8]
+	xor	eax,ecx
+	mov	BYTE[((-1))+r8*1+rdi],al
+	dec	rdx
+	jnz	NEAR $L$oop_tail_avx512vl
+
+	vmovdqu32	YMMWORD[rsp],ymm16
+	vmovdqu32	YMMWORD[32+rsp],ymm16
+
+$L$done_avx512vl:
+	vzeroall
+	movaps	xmm6,XMMWORD[((-40))+r9]
+	movaps	xmm7,XMMWORD[((-24))+r9]
+	lea	rsp,[r9]
+
+$L$avx512vl_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_chacha20_avx512vl:
+global	chacha20_16x
+
+ALIGN	32
+chacha20_16x:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_chacha20_16x:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+$L$chacha20_16x:
+	mov	r9,rsp
+
+	sub	rsp,64+168
+	and	rsp,-64
+	movaps	XMMWORD[(-168)+r9],xmm6
+	movaps	XMMWORD[(-152)+r9],xmm7
+	movaps	XMMWORD[(-136)+r9],xmm8
+	movaps	XMMWORD[(-120)+r9],xmm9
+	movaps	XMMWORD[(-104)+r9],xmm10
+	movaps	XMMWORD[(-88)+r9],xmm11
+	movaps	XMMWORD[(-72)+r9],xmm12
+	movaps	XMMWORD[(-56)+r9],xmm13
+	movaps	XMMWORD[(-40)+r9],xmm14
+	movaps	XMMWORD[(-24)+r9],xmm15
+$L$16x_body:
+	vzeroupper
+
+	lea	r10,[$L$sigma]
+	vbroadcasti32x4	zmm3,ZMMWORD[r10]
+	vbroadcasti32x4	zmm7,ZMMWORD[rcx]
+	vbroadcasti32x4	zmm11,ZMMWORD[16+rcx]
+	vbroadcasti32x4	zmm15,ZMMWORD[r8]
+
+	vpshufd	zmm0,zmm3,0x00
+	vpshufd	zmm1,zmm3,0x55
+	vpshufd	zmm2,zmm3,0xaa
+	vpshufd	zmm3,zmm3,0xff
+	vmovdqa64	zmm16,zmm0
+	vmovdqa64	zmm17,zmm1
+	vmovdqa64	zmm18,zmm2
+	vmovdqa64	zmm19,zmm3
+
+	vpshufd	zmm4,zmm7,0x00
+	vpshufd	zmm5,zmm7,0x55
+	vpshufd	zmm6,zmm7,0xaa
+	vpshufd	zmm7,zmm7,0xff
+	vmovdqa64	zmm20,zmm4
+	vmovdqa64	zmm21,zmm5
+	vmovdqa64	zmm22,zmm6
+	vmovdqa64	zmm23,zmm7
+
+	vpshufd	zmm8,zmm11,0x00
+	vpshufd	zmm9,zmm11,0x55
+	vpshufd	zmm10,zmm11,0xaa
+	vpshufd	zmm11,zmm11,0xff
+	vmovdqa64	zmm24,zmm8
+	vmovdqa64	zmm25,zmm9
+	vmovdqa64	zmm26,zmm10
+	vmovdqa64	zmm27,zmm11
+
+	vpshufd	zmm12,zmm15,0x00
+	vpshufd	zmm13,zmm15,0x55
+	vpshufd	zmm14,zmm15,0xaa
+	vpshufd	zmm15,zmm15,0xff
+	vpaddd	zmm12,zmm12,ZMMWORD[$L$incz]
+	vmovdqa64	zmm28,zmm12
+	vmovdqa64	zmm29,zmm13
+	vmovdqa64	zmm30,zmm14
+	vmovdqa64	zmm31,zmm15
+
+	mov	eax,10
+	jmp	NEAR $L$oop16x
+
+ALIGN	32
+$L$oop_outer16x:
+	vpbroadcastd	zmm0,DWORD[r10]
+	vpbroadcastd	zmm1,DWORD[4+r10]
+	vpbroadcastd	zmm2,DWORD[8+r10]
+	vpbroadcastd	zmm3,DWORD[12+r10]
+	vpaddd	zmm28,zmm28,ZMMWORD[$L$sixteen]
+	vmovdqa64	zmm4,zmm20
+	vmovdqa64	zmm5,zmm21
+	vmovdqa64	zmm6,zmm22
+	vmovdqa64	zmm7,zmm23
+	vmovdqa64	zmm8,zmm24
+	vmovdqa64	zmm9,zmm25
+	vmovdqa64	zmm10,zmm26
+	vmovdqa64	zmm11,zmm27
+	vmovdqa64	zmm12,zmm28
+	vmovdqa64	zmm13,zmm29
+	vmovdqa64	zmm14,zmm30
+	vmovdqa64	zmm15,zmm31
+
+	vmovdqa64	zmm16,zmm0
+	vmovdqa64	zmm17,zmm1
+	vmovdqa64	zmm18,zmm2
+	vmovdqa64	zmm19,zmm3
+
+	mov	eax,10
+	jmp	NEAR $L$oop16x
+
+ALIGN	32
+$L$oop16x:
+	vpaddd	zmm0,zmm0,zmm4
+	vpaddd	zmm1,zmm1,zmm5
+	vpaddd	zmm2,zmm2,zmm6
+	vpaddd	zmm3,zmm3,zmm7
+	vpxord	zmm12,zmm12,zmm0
+	vpxord	zmm13,zmm13,zmm1
+	vpxord	zmm14,zmm14,zmm2
+	vpxord	zmm15,zmm15,zmm3
+	vprold	zmm12,zmm12,16
+	vprold	zmm13,zmm13,16
+	vprold	zmm14,zmm14,16
+	vprold	zmm15,zmm15,16
+	vpaddd	zmm8,zmm8,zmm12
+	vpaddd	zmm9,zmm9,zmm13
+	vpaddd	zmm10,zmm10,zmm14
+	vpaddd	zmm11,zmm11,zmm15
+	vpxord	zmm4,zmm4,zmm8
+	vpxord	zmm5,zmm5,zmm9
+	vpxord	zmm6,zmm6,zmm10
+	vpxord	zmm7,zmm7,zmm11
+	vprold	zmm4,zmm4,12
+	vprold	zmm5,zmm5,12
+	vprold	zmm6,zmm6,12
+	vprold	zmm7,zmm7,12
+	vpaddd	zmm0,zmm0,zmm4
+	vpaddd	zmm1,zmm1,zmm5
+	vpaddd	zmm2,zmm2,zmm6
+	vpaddd	zmm3,zmm3,zmm7
+	vpxord	zmm12,zmm12,zmm0
+	vpxord	zmm13,zmm13,zmm1
+	vpxord	zmm14,zmm14,zmm2
+	vpxord	zmm15,zmm15,zmm3
+	vprold	zmm12,zmm12,8
+	vprold	zmm13,zmm13,8
+	vprold	zmm14,zmm14,8
+	vprold	zmm15,zmm15,8
+	vpaddd	zmm8,zmm8,zmm12
+	vpaddd	zmm9,zmm9,zmm13
+	vpaddd	zmm10,zmm10,zmm14
+	vpaddd	zmm11,zmm11,zmm15
+	vpxord	zmm4,zmm4,zmm8
+	vpxord	zmm5,zmm5,zmm9
+	vpxord	zmm6,zmm6,zmm10
+	vpxord	zmm7,zmm7,zmm11
+	vprold	zmm4,zmm4,7
+	vprold	zmm5,zmm5,7
+	vprold	zmm6,zmm6,7
+	vprold	zmm7,zmm7,7
+	vpaddd	zmm0,zmm0,zmm5
+	vpaddd	zmm1,zmm1,zmm6
+	vpaddd	zmm2,zmm2,zmm7
+	vpaddd	zmm3,zmm3,zmm4
+	vpxord	zmm15,zmm15,zmm0
+	vpxord	zmm12,zmm12,zmm1
+	vpxord	zmm13,zmm13,zmm2
+	vpxord	zmm14,zmm14,zmm3
+	vprold	zmm15,zmm15,16
+	vprold	zmm12,zmm12,16
+	vprold	zmm13,zmm13,16
+	vprold	zmm14,zmm14,16
+	vpaddd	zmm10,zmm10,zmm15
+	vpaddd	zmm11,zmm11,zmm12
+	vpaddd	zmm8,zmm8,zmm13
+	vpaddd	zmm9,zmm9,zmm14
+	vpxord	zmm5,zmm5,zmm10
+	vpxord	zmm6,zmm6,zmm11
+	vpxord	zmm7,zmm7,zmm8
+	vpxord	zmm4,zmm4,zmm9
+	vprold	zmm5,zmm5,12
+	vprold	zmm6,zmm6,12
+	vprold	zmm7,zmm7,12
+	vprold	zmm4,zmm4,12
+	vpaddd	zmm0,zmm0,zmm5
+	vpaddd	zmm1,zmm1,zmm6
+	vpaddd	zmm2,zmm2,zmm7
+	vpaddd	zmm3,zmm3,zmm4
+	vpxord	zmm15,zmm15,zmm0
+	vpxord	zmm12,zmm12,zmm1
+	vpxord	zmm13,zmm13,zmm2
+	vpxord	zmm14,zmm14,zmm3
+	vprold	zmm15,zmm15,8
+	vprold	zmm12,zmm12,8
+	vprold	zmm13,zmm13,8
+	vprold	zmm14,zmm14,8
+	vpaddd	zmm10,zmm10,zmm15
+	vpaddd	zmm11,zmm11,zmm12
+	vpaddd	zmm8,zmm8,zmm13
+	vpaddd	zmm9,zmm9,zmm14
+	vpxord	zmm5,zmm5,zmm10
+	vpxord	zmm6,zmm6,zmm11
+	vpxord	zmm7,zmm7,zmm8
+	vpxord	zmm4,zmm4,zmm9
+	vprold	zmm5,zmm5,7
+	vprold	zmm6,zmm6,7
+	vprold	zmm7,zmm7,7
+	vprold	zmm4,zmm4,7
+	dec	eax
+	jnz	NEAR $L$oop16x
+
+	vpaddd	zmm0,zmm0,zmm16
+	vpaddd	zmm1,zmm1,zmm17
+	vpaddd	zmm2,zmm2,zmm18
+	vpaddd	zmm3,zmm3,zmm19
+
+	vpunpckldq	zmm18,zmm0,zmm1
+	vpunpckldq	zmm19,zmm2,zmm3
+	vpunpckhdq	zmm0,zmm0,zmm1
+	vpunpckhdq	zmm2,zmm2,zmm3
+	vpunpcklqdq	zmm1,zmm18,zmm19
+	vpunpckhqdq	zmm18,zmm18,zmm19
+	vpunpcklqdq	zmm3,zmm0,zmm2
+	vpunpckhqdq	zmm0,zmm0,zmm2
+	vpaddd	zmm4,zmm4,zmm20
+	vpaddd	zmm5,zmm5,zmm21
+	vpaddd	zmm6,zmm6,zmm22
+	vpaddd	zmm7,zmm7,zmm23
+
+	vpunpckldq	zmm2,zmm4,zmm5
+	vpunpckldq	zmm19,zmm6,zmm7
+	vpunpckhdq	zmm4,zmm4,zmm5
+	vpunpckhdq	zmm6,zmm6,zmm7
+	vpunpcklqdq	zmm5,zmm2,zmm19
+	vpunpckhqdq	zmm2,zmm2,zmm19
+	vpunpcklqdq	zmm7,zmm4,zmm6
+	vpunpckhqdq	zmm4,zmm4,zmm6
+	vshufi32x4	zmm19,zmm1,zmm5,0x44
+	vshufi32x4	zmm5,zmm1,zmm5,0xee
+	vshufi32x4	zmm1,zmm18,zmm2,0x44
+	vshufi32x4	zmm2,zmm18,zmm2,0xee
+	vshufi32x4	zmm18,zmm3,zmm7,0x44
+	vshufi32x4	zmm7,zmm3,zmm7,0xee
+	vshufi32x4	zmm3,zmm0,zmm4,0x44
+	vshufi32x4	zmm4,zmm0,zmm4,0xee
+	vpaddd	zmm8,zmm8,zmm24
+	vpaddd	zmm9,zmm9,zmm25
+	vpaddd	zmm10,zmm10,zmm26
+	vpaddd	zmm11,zmm11,zmm27
+
+	vpunpckldq	zmm6,zmm8,zmm9
+	vpunpckldq	zmm0,zmm10,zmm11
+	vpunpckhdq	zmm8,zmm8,zmm9
+	vpunpckhdq	zmm10,zmm10,zmm11
+	vpunpcklqdq	zmm9,zmm6,zmm0
+	vpunpckhqdq	zmm6,zmm6,zmm0
+	vpunpcklqdq	zmm11,zmm8,zmm10
+	vpunpckhqdq	zmm8,zmm8,zmm10
+	vpaddd	zmm12,zmm12,zmm28
+	vpaddd	zmm13,zmm13,zmm29
+	vpaddd	zmm14,zmm14,zmm30
+	vpaddd	zmm15,zmm15,zmm31
+
+	vpunpckldq	zmm10,zmm12,zmm13
+	vpunpckldq	zmm0,zmm14,zmm15
+	vpunpckhdq	zmm12,zmm12,zmm13
+	vpunpckhdq	zmm14,zmm14,zmm15
+	vpunpcklqdq	zmm13,zmm10,zmm0
+	vpunpckhqdq	zmm10,zmm10,zmm0
+	vpunpcklqdq	zmm15,zmm12,zmm14
+	vpunpckhqdq	zmm12,zmm12,zmm14
+	vshufi32x4	zmm0,zmm9,zmm13,0x44
+	vshufi32x4	zmm13,zmm9,zmm13,0xee
+	vshufi32x4	zmm9,zmm6,zmm10,0x44
+	vshufi32x4	zmm10,zmm6,zmm10,0xee
+	vshufi32x4	zmm6,zmm11,zmm15,0x44
+	vshufi32x4	zmm15,zmm11,zmm15,0xee
+	vshufi32x4	zmm11,zmm8,zmm12,0x44
+	vshufi32x4	zmm12,zmm8,zmm12,0xee
+	vshufi32x4	zmm16,zmm19,zmm0,0x88
+	vshufi32x4	zmm19,zmm19,zmm0,0xdd
+	vshufi32x4	zmm0,zmm5,zmm13,0x88
+	vshufi32x4	zmm13,zmm5,zmm13,0xdd
+	vshufi32x4	zmm17,zmm1,zmm9,0x88
+	vshufi32x4	zmm1,zmm1,zmm9,0xdd
+	vshufi32x4	zmm9,zmm2,zmm10,0x88
+	vshufi32x4	zmm10,zmm2,zmm10,0xdd
+	vshufi32x4	zmm14,zmm18,zmm6,0x88
+	vshufi32x4	zmm18,zmm18,zmm6,0xdd
+	vshufi32x4	zmm6,zmm7,zmm15,0x88
+	vshufi32x4	zmm15,zmm7,zmm15,0xdd
+	vshufi32x4	zmm8,zmm3,zmm11,0x88
+	vshufi32x4	zmm3,zmm3,zmm11,0xdd
+	vshufi32x4	zmm11,zmm4,zmm12,0x88
+	vshufi32x4	zmm12,zmm4,zmm12,0xdd
+	cmp	rdx,64*16
+	jb	NEAR $L$tail16x
+
+	vpxord	zmm16,zmm16,ZMMWORD[rsi]
+	vpxord	zmm17,zmm17,ZMMWORD[64+rsi]
+	vpxord	zmm14,zmm14,ZMMWORD[128+rsi]
+	vpxord	zmm8,zmm8,ZMMWORD[192+rsi]
+	vmovdqu32	ZMMWORD[rdi],zmm16
+	vmovdqu32	ZMMWORD[64+rdi],zmm17
+	vmovdqu32	ZMMWORD[128+rdi],zmm14
+	vmovdqu32	ZMMWORD[192+rdi],zmm8
+
+	vpxord	zmm19,zmm19,ZMMWORD[256+rsi]
+	vpxord	zmm1,zmm1,ZMMWORD[320+rsi]
+	vpxord	zmm18,zmm18,ZMMWORD[384+rsi]
+	vpxord	zmm3,zmm3,ZMMWORD[448+rsi]
+	vmovdqu32	ZMMWORD[256+rdi],zmm19
+	vmovdqu32	ZMMWORD[320+rdi],zmm1
+	vmovdqu32	ZMMWORD[384+rdi],zmm18
+	vmovdqu32	ZMMWORD[448+rdi],zmm3
+
+	vpxord	zmm0,zmm0,ZMMWORD[512+rsi]
+	vpxord	zmm9,zmm9,ZMMWORD[576+rsi]
+	vpxord	zmm6,zmm6,ZMMWORD[640+rsi]
+	vpxord	zmm11,zmm11,ZMMWORD[704+rsi]
+	vmovdqu32	ZMMWORD[512+rdi],zmm0
+	vmovdqu32	ZMMWORD[576+rdi],zmm9
+	vmovdqu32	ZMMWORD[640+rdi],zmm6
+	vmovdqu32	ZMMWORD[704+rdi],zmm11
+
+	vpxord	zmm13,zmm13,ZMMWORD[768+rsi]
+	vpxord	zmm10,zmm10,ZMMWORD[832+rsi]
+	vpxord	zmm15,zmm15,ZMMWORD[896+rsi]
+	vpxord	zmm12,zmm12,ZMMWORD[960+rsi]
+	lea	rsi,[1024+rsi]
+	vmovdqu32	ZMMWORD[768+rdi],zmm13
+	vmovdqu32	ZMMWORD[832+rdi],zmm10
+	vmovdqu32	ZMMWORD[896+rdi],zmm15
+	vmovdqu32	ZMMWORD[960+rdi],zmm12
+	lea	rdi,[1024+rdi]
+
+	sub	rdx,64*16
+	jnz	NEAR $L$oop_outer16x
+
+	jmp	NEAR $L$done16x
+
+ALIGN	32
+$L$tail16x:
+	xor	r10,r10
+	sub	rdi,rsi
+	cmp	rdx,64*1
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm16,zmm16,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm16
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm17
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*2
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm17,zmm17,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm17
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm14
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*3
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm14,zmm14,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm14
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm8
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*4
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm8,zmm8,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm8
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm19
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*5
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm19,zmm19,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm19
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm1
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*6
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm1,zmm1,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm1
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm18
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*7
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm18,zmm18,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm18
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm3
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*8
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm3,zmm3,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm3
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm0
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*9
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm0,zmm0,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm0
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm9
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*10
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm9,zmm9,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm9
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm6
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*11
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm6,zmm6,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm6
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm11
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*12
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm11,zmm11,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm11
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm13
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*13
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm13,zmm13,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm13
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm10
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*14
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm10,zmm10,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm10
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm15
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*15
+	jb	NEAR $L$ess_than_64_16x
+	vpxord	zmm15,zmm15,ZMMWORD[rsi]
+	vmovdqu32	ZMMWORD[rsi*1+rdi],zmm15
+	je	NEAR $L$done16x
+	vmovdqa32	zmm16,zmm12
+	lea	rsi,[64+rsi]
+
+$L$ess_than_64_16x:
+	vmovdqa32	ZMMWORD[rsp],zmm16
+	lea	rdi,[rsi*1+rdi]
+	and	rdx,63
+
+$L$oop_tail16x:
+	movzx	eax,BYTE[r10*1+rsi]
+	movzx	ecx,BYTE[r10*1+rsp]
+	lea	r10,[1+r10]
+	xor	eax,ecx
+	mov	BYTE[((-1))+r10*1+rdi],al
+	dec	rdx
+	jnz	NEAR $L$oop_tail16x
+
+	vpxord	zmm16,zmm16,zmm16
+	vmovdqa32	ZMMWORD[rsp],zmm16
+
+$L$done16x:
+	vzeroall
+	movaps	xmm6,XMMWORD[((-168))+r9]
+	movaps	xmm7,XMMWORD[((-152))+r9]
+	movaps	xmm8,XMMWORD[((-136))+r9]
+	movaps	xmm9,XMMWORD[((-120))+r9]
+	movaps	xmm10,XMMWORD[((-104))+r9]
+	movaps	xmm11,XMMWORD[((-88))+r9]
+	movaps	xmm12,XMMWORD[((-72))+r9]
+	movaps	xmm13,XMMWORD[((-56))+r9]
+	movaps	xmm14,XMMWORD[((-40))+r9]
+	movaps	xmm15,XMMWORD[((-24))+r9]
+	lea	rsp,[r9]
+
+$L$16x_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_chacha20_16x:
+global	chacha20_8xvl
+
+ALIGN	32
+chacha20_8xvl:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_chacha20_8xvl:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+	mov	r8,QWORD[40+rsp]
+
+
+
+$L$chacha20_8xvl:
+	mov	r9,rsp
+
+	sub	rsp,64+168
+	and	rsp,-64
+	movaps	XMMWORD[(-168)+r9],xmm6
+	movaps	XMMWORD[(-152)+r9],xmm7
+	movaps	XMMWORD[(-136)+r9],xmm8
+	movaps	XMMWORD[(-120)+r9],xmm9
+	movaps	XMMWORD[(-104)+r9],xmm10
+	movaps	XMMWORD[(-88)+r9],xmm11
+	movaps	XMMWORD[(-72)+r9],xmm12
+	movaps	XMMWORD[(-56)+r9],xmm13
+	movaps	XMMWORD[(-40)+r9],xmm14
+	movaps	XMMWORD[(-24)+r9],xmm15
+$L$8xvl_body:
+	vzeroupper
+
+	lea	r10,[$L$sigma]
+	vbroadcasti128	ymm3,XMMWORD[r10]
+	vbroadcasti128	ymm7,XMMWORD[rcx]
+	vbroadcasti128	ymm11,XMMWORD[16+rcx]
+	vbroadcasti128	ymm15,XMMWORD[r8]
+
+	vpshufd	ymm0,ymm3,0x00
+	vpshufd	ymm1,ymm3,0x55
+	vpshufd	ymm2,ymm3,0xaa
+	vpshufd	ymm3,ymm3,0xff
+	vmovdqa64	ymm16,ymm0
+	vmovdqa64	ymm17,ymm1
+	vmovdqa64	ymm18,ymm2
+	vmovdqa64	ymm19,ymm3
+
+	vpshufd	ymm4,ymm7,0x00
+	vpshufd	ymm5,ymm7,0x55
+	vpshufd	ymm6,ymm7,0xaa
+	vpshufd	ymm7,ymm7,0xff
+	vmovdqa64	ymm20,ymm4
+	vmovdqa64	ymm21,ymm5
+	vmovdqa64	ymm22,ymm6
+	vmovdqa64	ymm23,ymm7
+
+	vpshufd	ymm8,ymm11,0x00
+	vpshufd	ymm9,ymm11,0x55
+	vpshufd	ymm10,ymm11,0xaa
+	vpshufd	ymm11,ymm11,0xff
+	vmovdqa64	ymm24,ymm8
+	vmovdqa64	ymm25,ymm9
+	vmovdqa64	ymm26,ymm10
+	vmovdqa64	ymm27,ymm11
+
+	vpshufd	ymm12,ymm15,0x00
+	vpshufd	ymm13,ymm15,0x55
+	vpshufd	ymm14,ymm15,0xaa
+	vpshufd	ymm15,ymm15,0xff
+	vpaddd	ymm12,ymm12,YMMWORD[$L$incy]
+	vmovdqa64	ymm28,ymm12
+	vmovdqa64	ymm29,ymm13
+	vmovdqa64	ymm30,ymm14
+	vmovdqa64	ymm31,ymm15
+
+	mov	eax,10
+	jmp	NEAR $L$oop8xvl
+
+ALIGN	32
+$L$oop_outer8xvl:
+
+
+	vpbroadcastd	ymm2,DWORD[8+r10]
+	vpbroadcastd	ymm3,DWORD[12+r10]
+	vpaddd	ymm28,ymm28,YMMWORD[$L$eight]
+	vmovdqa64	ymm4,ymm20
+	vmovdqa64	ymm5,ymm21
+	vmovdqa64	ymm6,ymm22
+	vmovdqa64	ymm7,ymm23
+	vmovdqa64	ymm8,ymm24
+	vmovdqa64	ymm9,ymm25
+	vmovdqa64	ymm10,ymm26
+	vmovdqa64	ymm11,ymm27
+	vmovdqa64	ymm12,ymm28
+	vmovdqa64	ymm13,ymm29
+	vmovdqa64	ymm14,ymm30
+	vmovdqa64	ymm15,ymm31
+
+	vmovdqa64	ymm16,ymm0
+	vmovdqa64	ymm17,ymm1
+	vmovdqa64	ymm18,ymm2
+	vmovdqa64	ymm19,ymm3
+
+	mov	eax,10
+	jmp	NEAR $L$oop8xvl
+
+ALIGN	32
+$L$oop8xvl:
+	vpaddd	ymm0,ymm0,ymm4
+	vpaddd	ymm1,ymm1,ymm5
+	vpaddd	ymm2,ymm2,ymm6
+	vpaddd	ymm3,ymm3,ymm7
+	vpxor	ymm12,ymm12,ymm0
+	vpxor	ymm13,ymm13,ymm1
+	vpxor	ymm14,ymm14,ymm2
+	vpxor	ymm15,ymm15,ymm3
+	vprold	ymm12,ymm12,16
+	vprold	ymm13,ymm13,16
+	vprold	ymm14,ymm14,16
+	vprold	ymm15,ymm15,16
+	vpaddd	ymm8,ymm8,ymm12
+	vpaddd	ymm9,ymm9,ymm13
+	vpaddd	ymm10,ymm10,ymm14
+	vpaddd	ymm11,ymm11,ymm15
+	vpxor	ymm4,ymm4,ymm8
+	vpxor	ymm5,ymm5,ymm9
+	vpxor	ymm6,ymm6,ymm10
+	vpxor	ymm7,ymm7,ymm11
+	vprold	ymm4,ymm4,12
+	vprold	ymm5,ymm5,12
+	vprold	ymm6,ymm6,12
+	vprold	ymm7,ymm7,12
+	vpaddd	ymm0,ymm0,ymm4
+	vpaddd	ymm1,ymm1,ymm5
+	vpaddd	ymm2,ymm2,ymm6
+	vpaddd	ymm3,ymm3,ymm7
+	vpxor	ymm12,ymm12,ymm0
+	vpxor	ymm13,ymm13,ymm1
+	vpxor	ymm14,ymm14,ymm2
+	vpxor	ymm15,ymm15,ymm3
+	vprold	ymm12,ymm12,8
+	vprold	ymm13,ymm13,8
+	vprold	ymm14,ymm14,8
+	vprold	ymm15,ymm15,8
+	vpaddd	ymm8,ymm8,ymm12
+	vpaddd	ymm9,ymm9,ymm13
+	vpaddd	ymm10,ymm10,ymm14
+	vpaddd	ymm11,ymm11,ymm15
+	vpxor	ymm4,ymm4,ymm8
+	vpxor	ymm5,ymm5,ymm9
+	vpxor	ymm6,ymm6,ymm10
+	vpxor	ymm7,ymm7,ymm11
+	vprold	ymm4,ymm4,7
+	vprold	ymm5,ymm5,7
+	vprold	ymm6,ymm6,7
+	vprold	ymm7,ymm7,7
+	vpaddd	ymm0,ymm0,ymm5
+	vpaddd	ymm1,ymm1,ymm6
+	vpaddd	ymm2,ymm2,ymm7
+	vpaddd	ymm3,ymm3,ymm4
+	vpxor	ymm15,ymm15,ymm0
+	vpxor	ymm12,ymm12,ymm1
+	vpxor	ymm13,ymm13,ymm2
+	vpxor	ymm14,ymm14,ymm3
+	vprold	ymm15,ymm15,16
+	vprold	ymm12,ymm12,16
+	vprold	ymm13,ymm13,16
+	vprold	ymm14,ymm14,16
+	vpaddd	ymm10,ymm10,ymm15
+	vpaddd	ymm11,ymm11,ymm12
+	vpaddd	ymm8,ymm8,ymm13
+	vpaddd	ymm9,ymm9,ymm14
+	vpxor	ymm5,ymm5,ymm10
+	vpxor	ymm6,ymm6,ymm11
+	vpxor	ymm7,ymm7,ymm8
+	vpxor	ymm4,ymm4,ymm9
+	vprold	ymm5,ymm5,12
+	vprold	ymm6,ymm6,12
+	vprold	ymm7,ymm7,12
+	vprold	ymm4,ymm4,12
+	vpaddd	ymm0,ymm0,ymm5
+	vpaddd	ymm1,ymm1,ymm6
+	vpaddd	ymm2,ymm2,ymm7
+	vpaddd	ymm3,ymm3,ymm4
+	vpxor	ymm15,ymm15,ymm0
+	vpxor	ymm12,ymm12,ymm1
+	vpxor	ymm13,ymm13,ymm2
+	vpxor	ymm14,ymm14,ymm3
+	vprold	ymm15,ymm15,8
+	vprold	ymm12,ymm12,8
+	vprold	ymm13,ymm13,8
+	vprold	ymm14,ymm14,8
+	vpaddd	ymm10,ymm10,ymm15
+	vpaddd	ymm11,ymm11,ymm12
+	vpaddd	ymm8,ymm8,ymm13
+	vpaddd	ymm9,ymm9,ymm14
+	vpxor	ymm5,ymm5,ymm10
+	vpxor	ymm6,ymm6,ymm11
+	vpxor	ymm7,ymm7,ymm8
+	vpxor	ymm4,ymm4,ymm9
+	vprold	ymm5,ymm5,7
+	vprold	ymm6,ymm6,7
+	vprold	ymm7,ymm7,7
+	vprold	ymm4,ymm4,7
+	dec	eax
+	jnz	NEAR $L$oop8xvl
+
+	vpaddd	ymm0,ymm0,ymm16
+	vpaddd	ymm1,ymm1,ymm17
+	vpaddd	ymm2,ymm2,ymm18
+	vpaddd	ymm3,ymm3,ymm19
+
+	vpunpckldq	ymm18,ymm0,ymm1
+	vpunpckldq	ymm19,ymm2,ymm3
+	vpunpckhdq	ymm0,ymm0,ymm1
+	vpunpckhdq	ymm2,ymm2,ymm3
+	vpunpcklqdq	ymm1,ymm18,ymm19
+	vpunpckhqdq	ymm18,ymm18,ymm19
+	vpunpcklqdq	ymm3,ymm0,ymm2
+	vpunpckhqdq	ymm0,ymm0,ymm2
+	vpaddd	ymm4,ymm4,ymm20
+	vpaddd	ymm5,ymm5,ymm21
+	vpaddd	ymm6,ymm6,ymm22
+	vpaddd	ymm7,ymm7,ymm23
+
+	vpunpckldq	ymm2,ymm4,ymm5
+	vpunpckldq	ymm19,ymm6,ymm7
+	vpunpckhdq	ymm4,ymm4,ymm5
+	vpunpckhdq	ymm6,ymm6,ymm7
+	vpunpcklqdq	ymm5,ymm2,ymm19
+	vpunpckhqdq	ymm2,ymm2,ymm19
+	vpunpcklqdq	ymm7,ymm4,ymm6
+	vpunpckhqdq	ymm4,ymm4,ymm6
+	vshufi32x4	ymm19,ymm1,ymm5,0
+	vshufi32x4	ymm5,ymm1,ymm5,3
+	vshufi32x4	ymm1,ymm18,ymm2,0
+	vshufi32x4	ymm2,ymm18,ymm2,3
+	vshufi32x4	ymm18,ymm3,ymm7,0
+	vshufi32x4	ymm7,ymm3,ymm7,3
+	vshufi32x4	ymm3,ymm0,ymm4,0
+	vshufi32x4	ymm4,ymm0,ymm4,3
+	vpaddd	ymm8,ymm8,ymm24
+	vpaddd	ymm9,ymm9,ymm25
+	vpaddd	ymm10,ymm10,ymm26
+	vpaddd	ymm11,ymm11,ymm27
+
+	vpunpckldq	ymm6,ymm8,ymm9
+	vpunpckldq	ymm0,ymm10,ymm11
+	vpunpckhdq	ymm8,ymm8,ymm9
+	vpunpckhdq	ymm10,ymm10,ymm11
+	vpunpcklqdq	ymm9,ymm6,ymm0
+	vpunpckhqdq	ymm6,ymm6,ymm0
+	vpunpcklqdq	ymm11,ymm8,ymm10
+	vpunpckhqdq	ymm8,ymm8,ymm10
+	vpaddd	ymm12,ymm12,ymm28
+	vpaddd	ymm13,ymm13,ymm29
+	vpaddd	ymm14,ymm14,ymm30
+	vpaddd	ymm15,ymm15,ymm31
+
+	vpunpckldq	ymm10,ymm12,ymm13
+	vpunpckldq	ymm0,ymm14,ymm15
+	vpunpckhdq	ymm12,ymm12,ymm13
+	vpunpckhdq	ymm14,ymm14,ymm15
+	vpunpcklqdq	ymm13,ymm10,ymm0
+	vpunpckhqdq	ymm10,ymm10,ymm0
+	vpunpcklqdq	ymm15,ymm12,ymm14
+	vpunpckhqdq	ymm12,ymm12,ymm14
+	vperm2i128	ymm0,ymm9,ymm13,0x20
+	vperm2i128	ymm13,ymm9,ymm13,0x31
+	vperm2i128	ymm9,ymm6,ymm10,0x20
+	vperm2i128	ymm10,ymm6,ymm10,0x31
+	vperm2i128	ymm6,ymm11,ymm15,0x20
+	vperm2i128	ymm15,ymm11,ymm15,0x31
+	vperm2i128	ymm11,ymm8,ymm12,0x20
+	vperm2i128	ymm12,ymm8,ymm12,0x31
+	cmp	rdx,64*8
+	jb	NEAR $L$tail8xvl
+
+	mov	eax,0x80
+	vpxord	ymm19,ymm19,YMMWORD[rsi]
+	vpxor	ymm0,ymm0,YMMWORD[32+rsi]
+	vpxor	ymm5,ymm5,YMMWORD[64+rsi]
+	vpxor	ymm13,ymm13,YMMWORD[96+rsi]
+	lea	rsi,[rax*1+rsi]
+	vmovdqu32	YMMWORD[rdi],ymm19
+	vmovdqu	YMMWORD[32+rdi],ymm0
+	vmovdqu	YMMWORD[64+rdi],ymm5
+	vmovdqu	YMMWORD[96+rdi],ymm13
+	lea	rdi,[rax*1+rdi]
+
+	vpxor	ymm1,ymm1,YMMWORD[rsi]
+	vpxor	ymm9,ymm9,YMMWORD[32+rsi]
+	vpxor	ymm2,ymm2,YMMWORD[64+rsi]
+	vpxor	ymm10,ymm10,YMMWORD[96+rsi]
+	lea	rsi,[rax*1+rsi]
+	vmovdqu	YMMWORD[rdi],ymm1
+	vmovdqu	YMMWORD[32+rdi],ymm9
+	vmovdqu	YMMWORD[64+rdi],ymm2
+	vmovdqu	YMMWORD[96+rdi],ymm10
+	lea	rdi,[rax*1+rdi]
+
+	vpxord	ymm18,ymm18,YMMWORD[rsi]
+	vpxor	ymm6,ymm6,YMMWORD[32+rsi]
+	vpxor	ymm7,ymm7,YMMWORD[64+rsi]
+	vpxor	ymm15,ymm15,YMMWORD[96+rsi]
+	lea	rsi,[rax*1+rsi]
+	vmovdqu32	YMMWORD[rdi],ymm18
+	vmovdqu	YMMWORD[32+rdi],ymm6
+	vmovdqu	YMMWORD[64+rdi],ymm7
+	vmovdqu	YMMWORD[96+rdi],ymm15
+	lea	rdi,[rax*1+rdi]
+
+	vpxor	ymm3,ymm3,YMMWORD[rsi]
+	vpxor	ymm11,ymm11,YMMWORD[32+rsi]
+	vpxor	ymm4,ymm4,YMMWORD[64+rsi]
+	vpxor	ymm12,ymm12,YMMWORD[96+rsi]
+	lea	rsi,[rax*1+rsi]
+	vmovdqu	YMMWORD[rdi],ymm3
+	vmovdqu	YMMWORD[32+rdi],ymm11
+	vmovdqu	YMMWORD[64+rdi],ymm4
+	vmovdqu	YMMWORD[96+rdi],ymm12
+	lea	rdi,[rax*1+rdi]
+
+	vpbroadcastd	ymm0,DWORD[r10]
+	vpbroadcastd	ymm1,DWORD[4+r10]
+
+	sub	rdx,64*8
+	jnz	NEAR $L$oop_outer8xvl
+
+	jmp	NEAR $L$done8xvl
+
+ALIGN	32
+$L$tail8xvl:
+	vmovdqa64	ymm8,ymm19
+	xor	r10,r10
+	sub	rdi,rsi
+	cmp	rdx,64*1
+	jb	NEAR $L$ess_than_64_8xvl
+	vpxor	ymm8,ymm8,YMMWORD[rsi]
+	vpxor	ymm0,ymm0,YMMWORD[32+rsi]
+	vmovdqu	YMMWORD[rsi*1+rdi],ymm8
+	vmovdqu	YMMWORD[32+rsi*1+rdi],ymm0
+	je	NEAR $L$done8xvl
+	vmovdqa	ymm8,ymm5
+	vmovdqa	ymm0,ymm13
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*2
+	jb	NEAR $L$ess_than_64_8xvl
+	vpxor	ymm5,ymm5,YMMWORD[rsi]
+	vpxor	ymm13,ymm13,YMMWORD[32+rsi]
+	vmovdqu	YMMWORD[rsi*1+rdi],ymm5
+	vmovdqu	YMMWORD[32+rsi*1+rdi],ymm13
+	je	NEAR $L$done8xvl
+	vmovdqa	ymm8,ymm1
+	vmovdqa	ymm0,ymm9
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*3
+	jb	NEAR $L$ess_than_64_8xvl
+	vpxor	ymm1,ymm1,YMMWORD[rsi]
+	vpxor	ymm9,ymm9,YMMWORD[32+rsi]
+	vmovdqu	YMMWORD[rsi*1+rdi],ymm1
+	vmovdqu	YMMWORD[32+rsi*1+rdi],ymm9
+	je	NEAR $L$done8xvl
+	vmovdqa	ymm8,ymm2
+	vmovdqa	ymm0,ymm10
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*4
+	jb	NEAR $L$ess_than_64_8xvl
+	vpxor	ymm2,ymm2,YMMWORD[rsi]
+	vpxor	ymm10,ymm10,YMMWORD[32+rsi]
+	vmovdqu	YMMWORD[rsi*1+rdi],ymm2
+	vmovdqu	YMMWORD[32+rsi*1+rdi],ymm10
+	je	NEAR $L$done8xvl
+	vmovdqa32	ymm8,ymm18
+	vmovdqa	ymm0,ymm6
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*5
+	jb	NEAR $L$ess_than_64_8xvl
+	vpxord	ymm18,ymm18,YMMWORD[rsi]
+	vpxor	ymm6,ymm6,YMMWORD[32+rsi]
+	vmovdqu32	YMMWORD[rsi*1+rdi],ymm18
+	vmovdqu	YMMWORD[32+rsi*1+rdi],ymm6
+	je	NEAR $L$done8xvl
+	vmovdqa	ymm8,ymm7
+	vmovdqa	ymm0,ymm15
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*6
+	jb	NEAR $L$ess_than_64_8xvl
+	vpxor	ymm7,ymm7,YMMWORD[rsi]
+	vpxor	ymm15,ymm15,YMMWORD[32+rsi]
+	vmovdqu	YMMWORD[rsi*1+rdi],ymm7
+	vmovdqu	YMMWORD[32+rsi*1+rdi],ymm15
+	je	NEAR $L$done8xvl
+	vmovdqa	ymm8,ymm3
+	vmovdqa	ymm0,ymm11
+	lea	rsi,[64+rsi]
+
+	cmp	rdx,64*7
+	jb	NEAR $L$ess_than_64_8xvl
+	vpxor	ymm3,ymm3,YMMWORD[rsi]
+	vpxor	ymm11,ymm11,YMMWORD[32+rsi]
+	vmovdqu	YMMWORD[rsi*1+rdi],ymm3
+	vmovdqu	YMMWORD[32+rsi*1+rdi],ymm11
+	je	NEAR $L$done8xvl
+	vmovdqa	ymm8,ymm4
+	vmovdqa	ymm0,ymm12
+	lea	rsi,[64+rsi]
+
+$L$ess_than_64_8xvl:
+	vmovdqa	YMMWORD[rsp],ymm8
+	vmovdqa	YMMWORD[32+rsp],ymm0
+	lea	rdi,[rsi*1+rdi]
+	and	rdx,63
+
+$L$oop_tail8xvl:
+	movzx	eax,BYTE[r10*1+rsi]
+	movzx	ecx,BYTE[r10*1+rsp]
+	lea	r10,[1+r10]
+	xor	eax,ecx
+	mov	BYTE[((-1))+r10*1+rdi],al
+	dec	rdx
+	jnz	NEAR $L$oop_tail8xvl
+
+	vpxor	ymm8,ymm8,ymm8
+	vmovdqa	YMMWORD[rsp],ymm8
+	vmovdqa	YMMWORD[32+rsp],ymm8
+
+$L$done8xvl:
+	vzeroall
+	movaps	xmm6,XMMWORD[((-168))+r9]
+	movaps	xmm7,XMMWORD[((-152))+r9]
+	movaps	xmm8,XMMWORD[((-136))+r9]
+	movaps	xmm9,XMMWORD[((-120))+r9]
+	movaps	xmm10,XMMWORD[((-104))+r9]
+	movaps	xmm11,XMMWORD[((-88))+r9]
+	movaps	xmm12,XMMWORD[((-72))+r9]
+	movaps	xmm13,XMMWORD[((-56))+r9]
+	movaps	xmm14,XMMWORD[((-40))+r9]
+	movaps	xmm15,XMMWORD[((-24))+r9]
+	lea	rsp,[r9]
+
+$L$8xvl_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_chacha20_8xvl:
+EXTERN	__imp_RtlVirtualUnwind
+
+ALIGN	16
+ssse3_handler:
+	push	rsi
+	push	rdi
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	pushfq
+	sub	rsp,64
+
+	mov	rax,QWORD[120+r8]
+	mov	rbx,QWORD[248+r8]
+
+	mov	rsi,QWORD[8+r9]
+	mov	r11,QWORD[56+r9]
+
+	mov	r10d,DWORD[r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jb	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[192+r8]
+
+	mov	r10d,DWORD[4+r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jae	NEAR $L$common_seh_tail
+
+	lea	rsi,[((-40))+rax]
+	lea	rdi,[512+r8]
+	mov	ecx,4
+	DD	0xa548f3fc
+
+$L$common_seh_tail:
+	mov	rdi,QWORD[8+rax]
+	mov	rsi,QWORD[16+rax]
+	mov	QWORD[152+r8],rax
+	mov	QWORD[168+r8],rsi
+	mov	QWORD[176+r8],rdi
+
+	mov	rdi,QWORD[40+r9]
+	mov	rsi,r8
+	mov	ecx,154
+	DD	0xa548f3fc
+
+	mov	rsi,r9
+	xor	rcx,rcx
+	mov	rdx,QWORD[8+rsi]
+	mov	r8,QWORD[rsi]
+	mov	r9,QWORD[16+rsi]
+	mov	r10,QWORD[40+rsi]
+	lea	r11,[56+rsi]
+	lea	r12,[24+rsi]
+	mov	QWORD[32+rsp],r10
+	mov	QWORD[40+rsp],r11
+	mov	QWORD[48+rsp],r12
+	mov	QWORD[56+rsp],rcx
+	call	QWORD[__imp_RtlVirtualUnwind]
+
+	mov	eax,1
+	add	rsp,64
+	popfq
+	pop	r15
+	pop	r14
+	pop	r13
+	pop	r12
+	pop	rbp
+	pop	rbx
+	pop	rdi
+	pop	rsi
+	DB	0F3h,0C3h		;repret
+
+
+
+ALIGN	16
+full_handler:
+	push	rsi
+	push	rdi
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	pushfq
+	sub	rsp,64
+
+	mov	rax,QWORD[120+r8]
+	mov	rbx,QWORD[248+r8]
+
+	mov	rsi,QWORD[8+r9]
+	mov	r11,QWORD[56+r9]
+
+	mov	r10d,DWORD[r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jb	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[192+r8]
+
+	mov	r10d,DWORD[4+r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jae	NEAR $L$common_seh_tail
+
+	lea	rsi,[((-168))+rax]
+	lea	rdi,[512+r8]
+	mov	ecx,20
+	DD	0xa548f3fc
+
+	jmp	NEAR $L$common_seh_tail
+
+
+section	.pdata rdata align=4
+ALIGN	4
+	DD	$L$SEH_begin_chacha20_ssse3 wrt ..imagebase
+	DD	$L$SEH_end_chacha20_ssse3 wrt ..imagebase
+	DD	$L$SEH_info_chacha20_ssse3 wrt ..imagebase
+
+	DD	$L$SEH_begin_chacha20_4x wrt ..imagebase
+	DD	$L$SEH_end_chacha20_4x wrt ..imagebase
+	DD	$L$SEH_info_chacha20_4x wrt ..imagebase
+	DD	$L$SEH_begin_chacha20_avx2 wrt ..imagebase
+	DD	$L$SEH_end_chacha20_avx2 wrt ..imagebase
+	DD	$L$SEH_info_chacha20_avx2 wrt ..imagebase
+	DD	$L$SEH_begin_chacha20_avx512 wrt ..imagebase
+	DD	$L$SEH_end_chacha20_avx512 wrt ..imagebase
+	DD	$L$SEH_info_chacha20_avx512 wrt ..imagebase
+
+	DD	$L$SEH_begin_chacha20_avx512vl wrt ..imagebase
+	DD	$L$SEH_end_chacha20_avx512vl wrt ..imagebase
+	DD	$L$SEH_info_chacha20_avx512vl wrt ..imagebase
+
+	DD	$L$SEH_begin_chacha20_16x wrt ..imagebase
+	DD	$L$SEH_end_chacha20_16x wrt ..imagebase
+	DD	$L$SEH_info_chacha20_16x wrt ..imagebase
+
+	DD	$L$SEH_begin_chacha20_8xvl wrt ..imagebase
+	DD	$L$SEH_end_chacha20_8xvl wrt ..imagebase
+	DD	$L$SEH_info_chacha20_8xvl wrt ..imagebase
+section	.xdata rdata align=8
+ALIGN	8
+$L$SEH_info_chacha20_ssse3:
+DB	9,0,0,0
+	DD	ssse3_handler wrt ..imagebase
+	DD	$L$ssse3_body wrt ..imagebase,$L$ssse3_epilogue wrt ..imagebase
+
+$L$SEH_info_chacha20_4x:
+DB	9,0,0,0
+	DD	full_handler wrt ..imagebase
+	DD	$L$4x_body wrt ..imagebase,$L$4x_epilogue wrt ..imagebase
+$L$SEH_info_chacha20_avx2:
+DB	9,0,0,0
+	DD	full_handler wrt ..imagebase
+	DD	$L$8x_body wrt ..imagebase,$L$8x_epilogue wrt ..imagebase
+$L$SEH_info_chacha20_avx512:
+DB	9,0,0,0
+	DD	ssse3_handler wrt ..imagebase
+	DD	$L$avx512_body wrt ..imagebase,$L$avx512_epilogue wrt ..imagebase
+
+$L$SEH_info_chacha20_avx512vl:
+DB	9,0,0,0
+	DD	ssse3_handler wrt ..imagebase
+	DD	$L$avx512vl_body wrt ..imagebase,$L$avx512vl_epilogue wrt ..imagebase
+
+$L$SEH_info_chacha20_16x:
+DB	9,0,0,0
+	DD	full_handler wrt ..imagebase
+	DD	$L$16x_body wrt ..imagebase,$L$16x_epilogue wrt ..imagebase
+
+$L$SEH_info_chacha20_8xvl:
+DB	9,0,0,0
+	DD	full_handler wrt ..imagebase
+	DD	$L$8xvl_body wrt ..imagebase,$L$8xvl_epilogue wrt ..imagebase
diff --git a/crypto/chacha20_x64_gas.s b/crypto/chacha20_x64_gas.s
new file mode 100644
index 0000000..0aaf4ba
--- /dev/null
+++ b/crypto/chacha20_x64_gas.s
@@ -0,0 +1,2623 @@
+.text	
+
+.align	64
+.Lzero:
+.long	0,0,0,0
+.Lone:
+.long	1,0,0,0
+.Linc:
+.long	0,1,2,3
+.Lfour:
+.long	4,4,4,4
+.Lincy:
+.long	0,2,4,6,1,3,5,7
+.Leight:
+.long	8,8,8,8,8,8,8,8
+.Lrot16:
+.byte	0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd
+.Lrot24:
+.byte	0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe
+.Lsigma:
+.byte	101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0
+.align	64
+.Lzeroz:
+.long	0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0
+.Lfourz:
+.long	4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0
+.Lincz:
+.long	0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+.Lsixteen:
+.long	16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16
+.align	64
+.Ltwoy:
+.long	2,0,0,0, 2,0,0,0
+
+.global	hchacha20_ssse3
+.type	hchacha20_ssse3,@function
+.align	32
+hchacha20_ssse3:
+.cfi_startproc	
+.Lhchacha20_ssse3:
+	movdqa	.Lsigma(%rip),%xmm0
+	movdqu	(%rdx),%xmm1
+	movdqu	16(%rdx),%xmm2
+	movdqu	(%rsi),%xmm3
+	movdqa	.Lrot16(%rip),%xmm6
+	movdqa	.Lrot24(%rip),%xmm7
+	movq	10,%r8
+.align	32
+.Loop_hssse3:
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	20,%xmm1
+	pslld	12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	25,%xmm1
+	pslld	7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$57,%xmm1,%xmm1
+	pshufd	$147,%xmm3,%xmm3
+	nop
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	20,%xmm1
+	pslld	12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	25,%xmm1
+	pslld	7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$147,%xmm1,%xmm1
+	pshufd	$57,%xmm3,%xmm3
+	decq	%r8
+	jnz	.Loop_hssse3
+	movdqu	%xmm0,0(%rdi)
+	movdqu	%xmm3,16(%rdi)
+	ret
+.cfi_endproc	
+.size	hchacha20_ssse3,.-hchacha20_ssse3
+.global	chacha20_ssse3
+.type	chacha20_ssse3,@function
+.align	32
+chacha20_ssse3:
+.cfi_startproc	
+.Lchacha20_ssse3:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	cmpq	$128,%rdx
+	ja	.Lchacha20_4x
+
+.Ldo_sse3_after_all:
+	subq	$64+8,%rsp
+	movdqa	.Lsigma(%rip),%xmm0
+	movdqu	(%rcx),%xmm1
+	movdqu	16(%rcx),%xmm2
+	movdqu	(%r8),%xmm3
+	movdqa	.Lrot16(%rip),%xmm6
+	movdqa	.Lrot24(%rip),%xmm7
+
+	movdqa	%xmm0,0(%rsp)
+	movdqa	%xmm1,16(%rsp)
+	movdqa	%xmm2,32(%rsp)
+	movdqa	%xmm3,48(%rsp)
+	movq	$10,%r8
+	jmp	.Loop_ssse3
+
+.align	32
+.Loop_outer_ssse3:
+	movdqa	.Lone(%rip),%xmm3
+	movdqa	0(%rsp),%xmm0
+	movdqa	16(%rsp),%xmm1
+	movdqa	32(%rsp),%xmm2
+	paddd	48(%rsp),%xmm3
+	movq	$10,%r8
+	movdqa	%xmm3,48(%rsp)
+	jmp	.Loop_ssse3
+
+.align	32
+.Loop_ssse3:
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$20,%xmm1
+	pslld	$12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$25,%xmm1
+	pslld	$7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$57,%xmm1,%xmm1
+	pshufd	$147,%xmm3,%xmm3
+	nop
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$20,%xmm1
+	pslld	$12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$25,%xmm1
+	pslld	$7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$147,%xmm1,%xmm1
+	pshufd	$57,%xmm3,%xmm3
+	decq	%r8
+	jnz	.Loop_ssse3
+	paddd	0(%rsp),%xmm0
+	paddd	16(%rsp),%xmm1
+	paddd	32(%rsp),%xmm2
+	paddd	48(%rsp),%xmm3
+
+	cmpq	$64,%rdx
+	jb	.Ltail_ssse3
+
+	movdqu	0(%rsi),%xmm4
+	movdqu	16(%rsi),%xmm5
+	pxor	%xmm4,%xmm0
+	movdqu	32(%rsi),%xmm4
+	pxor	%xmm5,%xmm1
+	movdqu	48(%rsi),%xmm5
+	leaq	64(%rsi),%rsi
+	pxor	%xmm4,%xmm2
+	pxor	%xmm5,%xmm3
+
+	movdqu	%xmm0,0(%rdi)
+	movdqu	%xmm1,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm3,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	subq	$64,%rdx
+	jnz	.Loop_outer_ssse3
+
+	jmp	.Ldone_ssse3
+
+.align	16
+.Ltail_ssse3:
+	movdqa	%xmm0,0(%rsp)
+	movdqa	%xmm1,16(%rsp)
+	movdqa	%xmm2,32(%rsp)
+	movdqa	%xmm3,48(%rsp)
+	xorq	%r8,%r8
+
+.Loop_tail_ssse3:
+	movzbl	(%rsi,%r8,1),%eax
+	movzbl	(%rsp,%r8,1),%ecx
+	leaq	1(%r8),%r8
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r8,1)
+	decq	%rdx
+	jnz	.Loop_tail_ssse3
+
+.Ldone_ssse3:
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lssse3_epilogue:
+	ret
+.cfi_endproc	
+.size	chacha20_ssse3,.-chacha20_ssse3
+.global	chacha20_4x
+.type	chacha20_4x,@function
+.align	32
+chacha20_4x:
+.cfi_startproc	
+.Lchacha20_4x:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+
+
+
+
+
+
+
+
+
+
+
+.Lproceed4x:
+	subq	$0x140+8,%rsp
+	movdqa	.Lsigma(%rip),%xmm11
+	movdqu	(%rcx),%xmm15
+	movdqu	16(%rcx),%xmm7
+	movdqu	(%r8),%xmm3
+	leaq	256(%rsp),%rcx
+	leaq	.Lrot16(%rip),%r10
+	leaq	.Lrot24(%rip),%r11
+
+	pshufd	$0x00,%xmm11,%xmm8
+	pshufd	$0x55,%xmm11,%xmm9
+	movdqa	%xmm8,64(%rsp)
+	pshufd	$0xaa,%xmm11,%xmm10
+	movdqa	%xmm9,80(%rsp)
+	pshufd	$0xff,%xmm11,%xmm11
+	movdqa	%xmm10,96(%rsp)
+	movdqa	%xmm11,112(%rsp)
+
+	pshufd	$0x00,%xmm15,%xmm12
+	pshufd	$0x55,%xmm15,%xmm13
+	movdqa	%xmm12,128-256(%rcx)
+	pshufd	$0xaa,%xmm15,%xmm14
+	movdqa	%xmm13,144-256(%rcx)
+	pshufd	$0xff,%xmm15,%xmm15
+	movdqa	%xmm14,160-256(%rcx)
+	movdqa	%xmm15,176-256(%rcx)
+
+	pshufd	$0x00,%xmm7,%xmm4
+	pshufd	$0x55,%xmm7,%xmm5
+	movdqa	%xmm4,192-256(%rcx)
+	pshufd	$0xaa,%xmm7,%xmm6
+	movdqa	%xmm5,208-256(%rcx)
+	pshufd	$0xff,%xmm7,%xmm7
+	movdqa	%xmm6,224-256(%rcx)
+	movdqa	%xmm7,240-256(%rcx)
+
+	pshufd	$0x00,%xmm3,%xmm0
+	pshufd	$0x55,%xmm3,%xmm1
+	paddd	.Linc(%rip),%xmm0
+	pshufd	$0xaa,%xmm3,%xmm2
+	movdqa	%xmm1,272-256(%rcx)
+	pshufd	$0xff,%xmm3,%xmm3
+	movdqa	%xmm2,288-256(%rcx)
+	movdqa	%xmm3,304-256(%rcx)
+
+	jmp	.Loop_enter4x
+
+.align	32
+.Loop_outer4x:
+	movdqa	64(%rsp),%xmm8
+	movdqa	80(%rsp),%xmm9
+	movdqa	96(%rsp),%xmm10
+	movdqa	112(%rsp),%xmm11
+	movdqa	128-256(%rcx),%xmm12
+	movdqa	144-256(%rcx),%xmm13
+	movdqa	160-256(%rcx),%xmm14
+	movdqa	176-256(%rcx),%xmm15
+	movdqa	192-256(%rcx),%xmm4
+	movdqa	208-256(%rcx),%xmm5
+	movdqa	224-256(%rcx),%xmm6
+	movdqa	240-256(%rcx),%xmm7
+	movdqa	256-256(%rcx),%xmm0
+	movdqa	272-256(%rcx),%xmm1
+	movdqa	288-256(%rcx),%xmm2
+	movdqa	304-256(%rcx),%xmm3
+	paddd	.Lfour(%rip),%xmm0
+
+.Loop_enter4x:
+	movdqa	%xmm6,32(%rsp)
+	movdqa	%xmm7,48(%rsp)
+	movdqa	(%r10),%xmm7
+	movl	$10,%eax
+	movdqa	%xmm0,256-256(%rcx)
+	jmp	.Loop4x
+
+.align	32
+.Loop4x:
+	paddd	%xmm12,%xmm8
+	paddd	%xmm13,%xmm9
+	pxor	%xmm8,%xmm0
+	pxor	%xmm9,%xmm1
+	pshufb	%xmm7,%xmm0
+	pshufb	%xmm7,%xmm1
+	paddd	%xmm0,%xmm4
+	paddd	%xmm1,%xmm5
+	pxor	%xmm4,%xmm12
+	pxor	%xmm5,%xmm13
+	movdqa	%xmm12,%xmm6
+	pslld	$12,%xmm12
+	psrld	$20,%xmm6
+	movdqa	%xmm13,%xmm7
+	pslld	$12,%xmm13
+	por	%xmm6,%xmm12
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm13
+	paddd	%xmm12,%xmm8
+	paddd	%xmm13,%xmm9
+	pxor	%xmm8,%xmm0
+	pxor	%xmm9,%xmm1
+	pshufb	%xmm6,%xmm0
+	pshufb	%xmm6,%xmm1
+	paddd	%xmm0,%xmm4
+	paddd	%xmm1,%xmm5
+	pxor	%xmm4,%xmm12
+	pxor	%xmm5,%xmm13
+	movdqa	%xmm12,%xmm7
+	pslld	$7,%xmm12
+	psrld	$25,%xmm7
+	movdqa	%xmm13,%xmm6
+	pslld	$7,%xmm13
+	por	%xmm7,%xmm12
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm13
+	movdqa	%xmm4,0(%rsp)
+	movdqa	%xmm5,16(%rsp)
+	movdqa	32(%rsp),%xmm4
+	movdqa	48(%rsp),%xmm5
+	paddd	%xmm14,%xmm10
+	paddd	%xmm15,%xmm11
+	pxor	%xmm10,%xmm2
+	pxor	%xmm11,%xmm3
+	pshufb	%xmm7,%xmm2
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm2,%xmm4
+	paddd	%xmm3,%xmm5
+	pxor	%xmm4,%xmm14
+	pxor	%xmm5,%xmm15
+	movdqa	%xmm14,%xmm6
+	pslld	$12,%xmm14
+	psrld	$20,%xmm6
+	movdqa	%xmm15,%xmm7
+	pslld	$12,%xmm15
+	por	%xmm6,%xmm14
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm15
+	paddd	%xmm14,%xmm10
+	paddd	%xmm15,%xmm11
+	pxor	%xmm10,%xmm2
+	pxor	%xmm11,%xmm3
+	pshufb	%xmm6,%xmm2
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm2,%xmm4
+	paddd	%xmm3,%xmm5
+	pxor	%xmm4,%xmm14
+	pxor	%xmm5,%xmm15
+	movdqa	%xmm14,%xmm7
+	pslld	$7,%xmm14
+	psrld	$25,%xmm7
+	movdqa	%xmm15,%xmm6
+	pslld	$7,%xmm15
+	por	%xmm7,%xmm14
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm15
+	paddd	%xmm13,%xmm8
+	paddd	%xmm14,%xmm9
+	pxor	%xmm8,%xmm3
+	pxor	%xmm9,%xmm0
+	pshufb	%xmm7,%xmm3
+	pshufb	%xmm7,%xmm0
+	paddd	%xmm3,%xmm4
+	paddd	%xmm0,%xmm5
+	pxor	%xmm4,%xmm13
+	pxor	%xmm5,%xmm14
+	movdqa	%xmm13,%xmm6
+	pslld	$12,%xmm13
+	psrld	$20,%xmm6
+	movdqa	%xmm14,%xmm7
+	pslld	$12,%xmm14
+	por	%xmm6,%xmm13
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm14
+	paddd	%xmm13,%xmm8
+	paddd	%xmm14,%xmm9
+	pxor	%xmm8,%xmm3
+	pxor	%xmm9,%xmm0
+	pshufb	%xmm6,%xmm3
+	pshufb	%xmm6,%xmm0
+	paddd	%xmm3,%xmm4
+	paddd	%xmm0,%xmm5
+	pxor	%xmm4,%xmm13
+	pxor	%xmm5,%xmm14
+	movdqa	%xmm13,%xmm7
+	pslld	$7,%xmm13
+	psrld	$25,%xmm7
+	movdqa	%xmm14,%xmm6
+	pslld	$7,%xmm14
+	por	%xmm7,%xmm13
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm14
+	movdqa	%xmm4,32(%rsp)
+	movdqa	%xmm5,48(%rsp)
+	movdqa	0(%rsp),%xmm4
+	movdqa	16(%rsp),%xmm5
+	paddd	%xmm15,%xmm10
+	paddd	%xmm12,%xmm11
+	pxor	%xmm10,%xmm1
+	pxor	%xmm11,%xmm2
+	pshufb	%xmm7,%xmm1
+	pshufb	%xmm7,%xmm2
+	paddd	%xmm1,%xmm4
+	paddd	%xmm2,%xmm5
+	pxor	%xmm4,%xmm15
+	pxor	%xmm5,%xmm12
+	movdqa	%xmm15,%xmm6
+	pslld	$12,%xmm15
+	psrld	$20,%xmm6
+	movdqa	%xmm12,%xmm7
+	pslld	$12,%xmm12
+	por	%xmm6,%xmm15
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm12
+	paddd	%xmm15,%xmm10
+	paddd	%xmm12,%xmm11
+	pxor	%xmm10,%xmm1
+	pxor	%xmm11,%xmm2
+	pshufb	%xmm6,%xmm1
+	pshufb	%xmm6,%xmm2
+	paddd	%xmm1,%xmm4
+	paddd	%xmm2,%xmm5
+	pxor	%xmm4,%xmm15
+	pxor	%xmm5,%xmm12
+	movdqa	%xmm15,%xmm7
+	pslld	$7,%xmm15
+	psrld	$25,%xmm7
+	movdqa	%xmm12,%xmm6
+	pslld	$7,%xmm12
+	por	%xmm7,%xmm15
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm12
+	decl	%eax
+	jnz	.Loop4x
+
+	paddd	64(%rsp),%xmm8
+	paddd	80(%rsp),%xmm9
+	paddd	96(%rsp),%xmm10
+	paddd	112(%rsp),%xmm11
+
+	movdqa	%xmm8,%xmm6
+	punpckldq	%xmm9,%xmm8
+	movdqa	%xmm10,%xmm7
+	punpckldq	%xmm11,%xmm10
+	punpckhdq	%xmm9,%xmm6
+	punpckhdq	%xmm11,%xmm7
+	movdqa	%xmm8,%xmm9
+	punpcklqdq	%xmm10,%xmm8
+	movdqa	%xmm6,%xmm11
+	punpcklqdq	%xmm7,%xmm6
+	punpckhqdq	%xmm10,%xmm9
+	punpckhqdq	%xmm7,%xmm11
+	paddd	128-256(%rcx),%xmm12
+	paddd	144-256(%rcx),%xmm13
+	paddd	160-256(%rcx),%xmm14
+	paddd	176-256(%rcx),%xmm15
+
+	movdqa	%xmm8,0(%rsp)
+	movdqa	%xmm9,16(%rsp)
+	movdqa	32(%rsp),%xmm8
+	movdqa	48(%rsp),%xmm9
+
+	movdqa	%xmm12,%xmm10
+	punpckldq	%xmm13,%xmm12
+	movdqa	%xmm14,%xmm7
+	punpckldq	%xmm15,%xmm14
+	punpckhdq	%xmm13,%xmm10
+	punpckhdq	%xmm15,%xmm7
+	movdqa	%xmm12,%xmm13
+	punpcklqdq	%xmm14,%xmm12
+	movdqa	%xmm10,%xmm15
+	punpcklqdq	%xmm7,%xmm10
+	punpckhqdq	%xmm14,%xmm13
+	punpckhqdq	%xmm7,%xmm15
+	paddd	192-256(%rcx),%xmm4
+	paddd	208-256(%rcx),%xmm5
+	paddd	224-256(%rcx),%xmm8
+	paddd	240-256(%rcx),%xmm9
+
+	movdqa	%xmm6,32(%rsp)
+	movdqa	%xmm11,48(%rsp)
+
+	movdqa	%xmm4,%xmm14
+	punpckldq	%xmm5,%xmm4
+	movdqa	%xmm8,%xmm7
+	punpckldq	%xmm9,%xmm8
+	punpckhdq	%xmm5,%xmm14
+	punpckhdq	%xmm9,%xmm7
+	movdqa	%xmm4,%xmm5
+	punpcklqdq	%xmm8,%xmm4
+	movdqa	%xmm14,%xmm9
+	punpcklqdq	%xmm7,%xmm14
+	punpckhqdq	%xmm8,%xmm5
+	punpckhqdq	%xmm7,%xmm9
+	paddd	256-256(%rcx),%xmm0
+	paddd	272-256(%rcx),%xmm1
+	paddd	288-256(%rcx),%xmm2
+	paddd	304-256(%rcx),%xmm3
+
+	movdqa	%xmm0,%xmm8
+	punpckldq	%xmm1,%xmm0
+	movdqa	%xmm2,%xmm7
+	punpckldq	%xmm3,%xmm2
+	punpckhdq	%xmm1,%xmm8
+	punpckhdq	%xmm3,%xmm7
+	movdqa	%xmm0,%xmm1
+	punpcklqdq	%xmm2,%xmm0
+	movdqa	%xmm8,%xmm3
+	punpcklqdq	%xmm7,%xmm8
+	punpckhqdq	%xmm2,%xmm1
+	punpckhqdq	%xmm7,%xmm3
+	cmpq	$256,%rdx
+	jb	.Ltail4x
+
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+
+	movdqu	%xmm6,64(%rdi)
+	movdqu	0(%rsi),%xmm6
+	movdqu	%xmm11,80(%rdi)
+	movdqu	16(%rsi),%xmm11
+	movdqu	%xmm2,96(%rdi)
+	movdqu	32(%rsi),%xmm2
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+	movdqu	48(%rsi),%xmm7
+	pxor	32(%rsp),%xmm6
+	pxor	%xmm10,%xmm11
+	pxor	%xmm14,%xmm2
+	pxor	%xmm8,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	48(%rsp),%xmm6
+	pxor	%xmm15,%xmm11
+	pxor	%xmm9,%xmm2
+	pxor	%xmm3,%xmm7
+	movdqu	%xmm6,64(%rdi)
+	movdqu	%xmm11,80(%rdi)
+	movdqu	%xmm2,96(%rdi)
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+
+	subq	$256,%rdx
+	jnz	.Loop_outer4x
+
+	jmp	.Ldone4x
+
+.Ltail4x:
+	cmpq	$192,%rdx
+	jae	.L192_or_more4x
+	cmpq	$128,%rdx
+	jae	.L128_or_more4x
+	cmpq	$64,%rdx
+	jae	.L64_or_more4x
+
+
+	xorq	%r10,%r10
+
+	movdqa	%xmm12,16(%rsp)
+	movdqa	%xmm4,32(%rsp)
+	movdqa	%xmm0,48(%rsp)
+	jmp	.Loop_tail4x
+
+.align	32
+.L64_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+	movdqu	%xmm6,0(%rdi)
+	movdqu	%xmm11,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm7,48(%rdi)
+	je	.Ldone4x
+
+	movdqa	16(%rsp),%xmm6
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm13,16(%rsp)
+	leaq	64(%rdi),%rdi
+	movdqa	%xmm5,32(%rsp)
+	subq	$64,%rdx
+	movdqa	%xmm1,48(%rsp)
+	jmp	.Loop_tail4x
+
+.align	32
+.L128_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+	movdqu	%xmm6,64(%rdi)
+	movdqu	%xmm11,80(%rdi)
+	movdqu	%xmm2,96(%rdi)
+	movdqu	%xmm7,112(%rdi)
+	je	.Ldone4x
+
+	movdqa	32(%rsp),%xmm6
+	leaq	128(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm10,16(%rsp)
+	leaq	128(%rdi),%rdi
+	movdqa	%xmm14,32(%rsp)
+	subq	$128,%rdx
+	movdqa	%xmm8,48(%rsp)
+	jmp	.Loop_tail4x
+
+.align	32
+.L192_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+
+	movdqu	%xmm6,64(%rdi)
+	movdqu	0(%rsi),%xmm6
+	movdqu	%xmm11,80(%rdi)
+	movdqu	16(%rsi),%xmm11
+	movdqu	%xmm2,96(%rdi)
+	movdqu	32(%rsi),%xmm2
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+	movdqu	48(%rsi),%xmm7
+	pxor	32(%rsp),%xmm6
+	pxor	%xmm10,%xmm11
+	pxor	%xmm14,%xmm2
+	pxor	%xmm8,%xmm7
+	movdqu	%xmm6,0(%rdi)
+	movdqu	%xmm11,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm7,48(%rdi)
+	je	.Ldone4x
+
+	movdqa	48(%rsp),%xmm6
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm15,16(%rsp)
+	leaq	64(%rdi),%rdi
+	movdqa	%xmm9,32(%rsp)
+	subq	$192,%rdx
+	movdqa	%xmm3,48(%rsp)
+
+.Loop_tail4x:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail4x
+
+.Ldone4x:
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L4x_epilogue:
+	ret
+.cfi_endproc	
+.size	chacha20_4x,.-chacha20_4x
+.global	chacha20_avx2
+.type	chacha20_avx2,@function
+.align	32
+chacha20_avx2:
+.cfi_startproc	
+.Lchacha20_avx2:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	subq	$0x280+8,%rsp
+	andq	$-32,%rsp
+	vzeroupper
+
+	vbroadcasti128	.Lsigma(%rip),%ymm11
+	vbroadcasti128	(%rcx),%ymm3
+	vbroadcasti128	16(%rcx),%ymm15
+	vbroadcasti128	(%r8),%ymm7
+	leaq	256(%rsp),%rcx
+	leaq	512(%rsp),%rax
+	leaq	.Lrot16(%rip),%r10
+	leaq	.Lrot24(%rip),%r11
+
+	vpshufd	$0x00,%ymm11,%ymm8
+	vpshufd	$0x55,%ymm11,%ymm9
+	vmovdqa	%ymm8,128-256(%rcx)
+	vpshufd	$0xaa,%ymm11,%ymm10
+	vmovdqa	%ymm9,160-256(%rcx)
+	vpshufd	$0xff,%ymm11,%ymm11
+	vmovdqa	%ymm10,192-256(%rcx)
+	vmovdqa	%ymm11,224-256(%rcx)
+
+	vpshufd	$0x00,%ymm3,%ymm0
+	vpshufd	$0x55,%ymm3,%ymm1
+	vmovdqa	%ymm0,256-256(%rcx)
+	vpshufd	$0xaa,%ymm3,%ymm2
+	vmovdqa	%ymm1,288-256(%rcx)
+	vpshufd	$0xff,%ymm3,%ymm3
+	vmovdqa	%ymm2,320-256(%rcx)
+	vmovdqa	%ymm3,352-256(%rcx)
+
+	vpshufd	$0x00,%ymm15,%ymm12
+	vpshufd	$0x55,%ymm15,%ymm13
+	vmovdqa	%ymm12,384-512(%rax)
+	vpshufd	$0xaa,%ymm15,%ymm14
+	vmovdqa	%ymm13,416-512(%rax)
+	vpshufd	$0xff,%ymm15,%ymm15
+	vmovdqa	%ymm14,448-512(%rax)
+	vmovdqa	%ymm15,480-512(%rax)
+
+	vpshufd	$0x00,%ymm7,%ymm4
+	vpshufd	$0x55,%ymm7,%ymm5
+	vpaddd	.Lincy(%rip),%ymm4,%ymm4
+	vpshufd	$0xaa,%ymm7,%ymm6
+	vmovdqa	%ymm5,544-512(%rax)
+	vpshufd	$0xff,%ymm7,%ymm7
+	vmovdqa	%ymm6,576-512(%rax)
+	vmovdqa	%ymm7,608-512(%rax)
+
+	jmp	.Loop_enter8x
+
+.align	32
+.Loop_outer8x:
+	vmovdqa	128-256(%rcx),%ymm8
+	vmovdqa	160-256(%rcx),%ymm9
+	vmovdqa	192-256(%rcx),%ymm10
+	vmovdqa	224-256(%rcx),%ymm11
+	vmovdqa	256-256(%rcx),%ymm0
+	vmovdqa	288-256(%rcx),%ymm1
+	vmovdqa	320-256(%rcx),%ymm2
+	vmovdqa	352-256(%rcx),%ymm3
+	vmovdqa	384-512(%rax),%ymm12
+	vmovdqa	416-512(%rax),%ymm13
+	vmovdqa	448-512(%rax),%ymm14
+	vmovdqa	480-512(%rax),%ymm15
+	vmovdqa	512-512(%rax),%ymm4
+	vmovdqa	544-512(%rax),%ymm5
+	vmovdqa	576-512(%rax),%ymm6
+	vmovdqa	608-512(%rax),%ymm7
+	vpaddd	.Leight(%rip),%ymm4,%ymm4
+
+.Loop_enter8x:
+	vmovdqa	%ymm14,64(%rsp)
+	vmovdqa	%ymm15,96(%rsp)
+	vbroadcasti128	(%r10),%ymm15
+	vmovdqa	%ymm4,512-512(%rax)
+	movl	$10,%eax
+	jmp	.Loop8x
+
+.align	32
+.Loop8x:
+	vpaddd	%ymm0,%ymm8,%ymm8
+	vpxor	%ymm4,%ymm8,%ymm4
+	vpshufb	%ymm15,%ymm4,%ymm4
+	vpaddd	%ymm1,%ymm9,%ymm9
+	vpxor	%ymm5,%ymm9,%ymm5
+	vpshufb	%ymm15,%ymm5,%ymm5
+	vpaddd	%ymm4,%ymm12,%ymm12
+	vpxor	%ymm0,%ymm12,%ymm0
+	vpslld	$12,%ymm0,%ymm14
+	vpsrld	$20,%ymm0,%ymm0
+	vpor	%ymm0,%ymm14,%ymm0
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm5,%ymm13,%ymm13
+	vpxor	%ymm1,%ymm13,%ymm1
+	vpslld	$12,%ymm1,%ymm15
+	vpsrld	$20,%ymm1,%ymm1
+	vpor	%ymm1,%ymm15,%ymm1
+	vpaddd	%ymm0,%ymm8,%ymm8
+	vpxor	%ymm4,%ymm8,%ymm4
+	vpshufb	%ymm14,%ymm4,%ymm4
+	vpaddd	%ymm1,%ymm9,%ymm9
+	vpxor	%ymm5,%ymm9,%ymm5
+	vpshufb	%ymm14,%ymm5,%ymm5
+	vpaddd	%ymm4,%ymm12,%ymm12
+	vpxor	%ymm0,%ymm12,%ymm0
+	vpslld	$7,%ymm0,%ymm15
+	vpsrld	$25,%ymm0,%ymm0
+	vpor	%ymm0,%ymm15,%ymm0
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm5,%ymm13,%ymm13
+	vpxor	%ymm1,%ymm13,%ymm1
+	vpslld	$7,%ymm1,%ymm14
+	vpsrld	$25,%ymm1,%ymm1
+	vpor	%ymm1,%ymm14,%ymm1
+	vmovdqa	%ymm12,0(%rsp)
+	vmovdqa	%ymm13,32(%rsp)
+	vmovdqa	64(%rsp),%ymm12
+	vmovdqa	96(%rsp),%ymm13
+	vpaddd	%ymm2,%ymm10,%ymm10
+	vpxor	%ymm6,%ymm10,%ymm6
+	vpshufb	%ymm15,%ymm6,%ymm6
+	vpaddd	%ymm3,%ymm11,%ymm11
+	vpxor	%ymm7,%ymm11,%ymm7
+	vpshufb	%ymm15,%ymm7,%ymm7
+	vpaddd	%ymm6,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm12,%ymm2
+	vpslld	$12,%ymm2,%ymm14
+	vpsrld	$20,%ymm2,%ymm2
+	vpor	%ymm2,%ymm14,%ymm2
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm7,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm13,%ymm3
+	vpslld	$12,%ymm3,%ymm15
+	vpsrld	$20,%ymm3,%ymm3
+	vpor	%ymm3,%ymm15,%ymm3
+	vpaddd	%ymm2,%ymm10,%ymm10
+	vpxor	%ymm6,%ymm10,%ymm6
+	vpshufb	%ymm14,%ymm6,%ymm6
+	vpaddd	%ymm3,%ymm11,%ymm11
+	vpxor	%ymm7,%ymm11,%ymm7
+	vpshufb	%ymm14,%ymm7,%ymm7
+	vpaddd	%ymm6,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm12,%ymm2
+	vpslld	$7,%ymm2,%ymm15
+	vpsrld	$25,%ymm2,%ymm2
+	vpor	%ymm2,%ymm15,%ymm2
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm7,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm13,%ymm3
+	vpslld	$7,%ymm3,%ymm14
+	vpsrld	$25,%ymm3,%ymm3
+	vpor	%ymm3,%ymm14,%ymm3
+	vpaddd	%ymm1,%ymm8,%ymm8
+	vpxor	%ymm7,%ymm8,%ymm7
+	vpshufb	%ymm15,%ymm7,%ymm7
+	vpaddd	%ymm2,%ymm9,%ymm9
+	vpxor	%ymm4,%ymm9,%ymm4
+	vpshufb	%ymm15,%ymm4,%ymm4
+	vpaddd	%ymm7,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm12,%ymm1
+	vpslld	$12,%ymm1,%ymm14
+	vpsrld	$20,%ymm1,%ymm1
+	vpor	%ymm1,%ymm14,%ymm1
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm4,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm13,%ymm2
+	vpslld	$12,%ymm2,%ymm15
+	vpsrld	$20,%ymm2,%ymm2
+	vpor	%ymm2,%ymm15,%ymm2
+	vpaddd	%ymm1,%ymm8,%ymm8
+	vpxor	%ymm7,%ymm8,%ymm7
+	vpshufb	%ymm14,%ymm7,%ymm7
+	vpaddd	%ymm2,%ymm9,%ymm9
+	vpxor	%ymm4,%ymm9,%ymm4
+	vpshufb	%ymm14,%ymm4,%ymm4
+	vpaddd	%ymm7,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm12,%ymm1
+	vpslld	$7,%ymm1,%ymm15
+	vpsrld	$25,%ymm1,%ymm1
+	vpor	%ymm1,%ymm15,%ymm1
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm4,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm13,%ymm2
+	vpslld	$7,%ymm2,%ymm14
+	vpsrld	$25,%ymm2,%ymm2
+	vpor	%ymm2,%ymm14,%ymm2
+	vmovdqa	%ymm12,64(%rsp)
+	vmovdqa	%ymm13,96(%rsp)
+	vmovdqa	0(%rsp),%ymm12
+	vmovdqa	32(%rsp),%ymm13
+	vpaddd	%ymm3,%ymm10,%ymm10
+	vpxor	%ymm5,%ymm10,%ymm5
+	vpshufb	%ymm15,%ymm5,%ymm5
+	vpaddd	%ymm0,%ymm11,%ymm11
+	vpxor	%ymm6,%ymm11,%ymm6
+	vpshufb	%ymm15,%ymm6,%ymm6
+	vpaddd	%ymm5,%ymm12,%ymm12
+	vpxor	%ymm3,%ymm12,%ymm3
+	vpslld	$12,%ymm3,%ymm14
+	vpsrld	$20,%ymm3,%ymm3
+	vpor	%ymm3,%ymm14,%ymm3
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm6,%ymm13,%ymm13
+	vpxor	%ymm0,%ymm13,%ymm0
+	vpslld	$12,%ymm0,%ymm15
+	vpsrld	$20,%ymm0,%ymm0
+	vpor	%ymm0,%ymm15,%ymm0
+	vpaddd	%ymm3,%ymm10,%ymm10
+	vpxor	%ymm5,%ymm10,%ymm5
+	vpshufb	%ymm14,%ymm5,%ymm5
+	vpaddd	%ymm0,%ymm11,%ymm11
+	vpxor	%ymm6,%ymm11,%ymm6
+	vpshufb	%ymm14,%ymm6,%ymm6
+	vpaddd	%ymm5,%ymm12,%ymm12
+	vpxor	%ymm3,%ymm12,%ymm3
+	vpslld	$7,%ymm3,%ymm15
+	vpsrld	$25,%ymm3,%ymm3
+	vpor	%ymm3,%ymm15,%ymm3
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm6,%ymm13,%ymm13
+	vpxor	%ymm0,%ymm13,%ymm0
+	vpslld	$7,%ymm0,%ymm14
+	vpsrld	$25,%ymm0,%ymm0
+	vpor	%ymm0,%ymm14,%ymm0
+	decl	%eax
+	jnz	.Loop8x
+
+	leaq	512(%rsp),%rax
+	vpaddd	128-256(%rcx),%ymm8,%ymm8
+	vpaddd	160-256(%rcx),%ymm9,%ymm9
+	vpaddd	192-256(%rcx),%ymm10,%ymm10
+	vpaddd	224-256(%rcx),%ymm11,%ymm11
+
+	vpunpckldq	%ymm9,%ymm8,%ymm14
+	vpunpckldq	%ymm11,%ymm10,%ymm15
+	vpunpckhdq	%ymm9,%ymm8,%ymm8
+	vpunpckhdq	%ymm11,%ymm10,%ymm10
+	vpunpcklqdq	%ymm15,%ymm14,%ymm9
+	vpunpckhqdq	%ymm15,%ymm14,%ymm14
+	vpunpcklqdq	%ymm10,%ymm8,%ymm11
+	vpunpckhqdq	%ymm10,%ymm8,%ymm8
+	vpaddd	256-256(%rcx),%ymm0,%ymm0
+	vpaddd	288-256(%rcx),%ymm1,%ymm1
+	vpaddd	320-256(%rcx),%ymm2,%ymm2
+	vpaddd	352-256(%rcx),%ymm3,%ymm3
+
+	vpunpckldq	%ymm1,%ymm0,%ymm10
+	vpunpckldq	%ymm3,%ymm2,%ymm15
+	vpunpckhdq	%ymm1,%ymm0,%ymm0
+	vpunpckhdq	%ymm3,%ymm2,%ymm2
+	vpunpcklqdq	%ymm15,%ymm10,%ymm1
+	vpunpckhqdq	%ymm15,%ymm10,%ymm10
+	vpunpcklqdq	%ymm2,%ymm0,%ymm3
+	vpunpckhqdq	%ymm2,%ymm0,%ymm0
+	vperm2i128	$0x20,%ymm1,%ymm9,%ymm15
+	vperm2i128	$0x31,%ymm1,%ymm9,%ymm1
+	vperm2i128	$0x20,%ymm10,%ymm14,%ymm9
+	vperm2i128	$0x31,%ymm10,%ymm14,%ymm10
+	vperm2i128	$0x20,%ymm3,%ymm11,%ymm14
+	vperm2i128	$0x31,%ymm3,%ymm11,%ymm3
+	vperm2i128	$0x20,%ymm0,%ymm8,%ymm11
+	vperm2i128	$0x31,%ymm0,%ymm8,%ymm0
+	vmovdqa	%ymm15,0(%rsp)
+	vmovdqa	%ymm9,32(%rsp)
+	vmovdqa	64(%rsp),%ymm15
+	vmovdqa	96(%rsp),%ymm9
+
+	vpaddd	384-512(%rax),%ymm12,%ymm12
+	vpaddd	416-512(%rax),%ymm13,%ymm13
+	vpaddd	448-512(%rax),%ymm15,%ymm15
+	vpaddd	480-512(%rax),%ymm9,%ymm9
+
+	vpunpckldq	%ymm13,%ymm12,%ymm2
+	vpunpckldq	%ymm9,%ymm15,%ymm8
+	vpunpckhdq	%ymm13,%ymm12,%ymm12
+	vpunpckhdq	%ymm9,%ymm15,%ymm15
+	vpunpcklqdq	%ymm8,%ymm2,%ymm13
+	vpunpckhqdq	%ymm8,%ymm2,%ymm2
+	vpunpcklqdq	%ymm15,%ymm12,%ymm9
+	vpunpckhqdq	%ymm15,%ymm12,%ymm12
+	vpaddd	512-512(%rax),%ymm4,%ymm4
+	vpaddd	544-512(%rax),%ymm5,%ymm5
+	vpaddd	576-512(%rax),%ymm6,%ymm6
+	vpaddd	608-512(%rax),%ymm7,%ymm7
+
+	vpunpckldq	%ymm5,%ymm4,%ymm15
+	vpunpckldq	%ymm7,%ymm6,%ymm8
+	vpunpckhdq	%ymm5,%ymm4,%ymm4
+	vpunpckhdq	%ymm7,%ymm6,%ymm6
+	vpunpcklqdq	%ymm8,%ymm15,%ymm5
+	vpunpckhqdq	%ymm8,%ymm15,%ymm15
+	vpunpcklqdq	%ymm6,%ymm4,%ymm7
+	vpunpckhqdq	%ymm6,%ymm4,%ymm4
+	vperm2i128	$0x20,%ymm5,%ymm13,%ymm8
+	vperm2i128	$0x31,%ymm5,%ymm13,%ymm5
+	vperm2i128	$0x20,%ymm15,%ymm2,%ymm13
+	vperm2i128	$0x31,%ymm15,%ymm2,%ymm15
+	vperm2i128	$0x20,%ymm7,%ymm9,%ymm2
+	vperm2i128	$0x31,%ymm7,%ymm9,%ymm7
+	vperm2i128	$0x20,%ymm4,%ymm12,%ymm9
+	vperm2i128	$0x31,%ymm4,%ymm12,%ymm4
+	vmovdqa	0(%rsp),%ymm6
+	vmovdqa	32(%rsp),%ymm12
+
+	cmpq	$512,%rdx
+	jb	.Ltail8x
+
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm12,%ymm12
+	vpxor	32(%rsi),%ymm13,%ymm13
+	vpxor	64(%rsi),%ymm10,%ymm10
+	vpxor	96(%rsi),%ymm15,%ymm15
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm12,0(%rdi)
+	vmovdqu	%ymm13,32(%rdi)
+	vmovdqu	%ymm10,64(%rdi)
+	vmovdqu	%ymm15,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm14,%ymm14
+	vpxor	32(%rsi),%ymm2,%ymm2
+	vpxor	64(%rsi),%ymm3,%ymm3
+	vpxor	96(%rsi),%ymm7,%ymm7
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm14,0(%rdi)
+	vmovdqu	%ymm2,32(%rdi)
+	vmovdqu	%ymm3,64(%rdi)
+	vmovdqu	%ymm7,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm11,%ymm11
+	vpxor	32(%rsi),%ymm9,%ymm9
+	vpxor	64(%rsi),%ymm0,%ymm0
+	vpxor	96(%rsi),%ymm4,%ymm4
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm11,0(%rdi)
+	vmovdqu	%ymm9,32(%rdi)
+	vmovdqu	%ymm0,64(%rdi)
+	vmovdqu	%ymm4,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	subq	$512,%rdx
+	jnz	.Loop_outer8x
+
+	jmp	.Ldone8x
+
+.Ltail8x:
+	cmpq	$448,%rdx
+	jae	.L448_or_more8x
+	cmpq	$384,%rdx
+	jae	.L384_or_more8x
+	cmpq	$320,%rdx
+	jae	.L320_or_more8x
+	cmpq	$256,%rdx
+	jae	.L256_or_more8x
+	cmpq	$192,%rdx
+	jae	.L192_or_more8x
+	cmpq	$128,%rdx
+	jae	.L128_or_more8x
+	cmpq	$64,%rdx
+	jae	.L64_or_more8x
+
+	xorq	%r10,%r10
+	vmovdqa	%ymm6,0(%rsp)
+	vmovdqa	%ymm8,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L64_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	je	.Ldone8x
+
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm1,0(%rsp)
+	leaq	64(%rdi),%rdi
+	subq	$64,%rdx
+	vmovdqa	%ymm5,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L128_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	je	.Ldone8x
+
+	leaq	128(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm12,0(%rsp)
+	leaq	128(%rdi),%rdi
+	subq	$128,%rdx
+	vmovdqa	%ymm13,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L192_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	je	.Ldone8x
+
+	leaq	192(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm10,0(%rsp)
+	leaq	192(%rdi),%rdi
+	subq	$192,%rdx
+	vmovdqa	%ymm15,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L256_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	je	.Ldone8x
+
+	leaq	256(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm14,0(%rsp)
+	leaq	256(%rdi),%rdi
+	subq	$256,%rdx
+	vmovdqa	%ymm2,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L320_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	je	.Ldone8x
+
+	leaq	320(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm3,0(%rsp)
+	leaq	320(%rdi),%rdi
+	subq	$320,%rdx
+	vmovdqa	%ymm7,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L384_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vpxor	320(%rsi),%ymm3,%ymm3
+	vpxor	352(%rsi),%ymm7,%ymm7
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	vmovdqu	%ymm3,320(%rdi)
+	vmovdqu	%ymm7,352(%rdi)
+	je	.Ldone8x
+
+	leaq	384(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm11,0(%rsp)
+	leaq	384(%rdi),%rdi
+	subq	$384,%rdx
+	vmovdqa	%ymm9,32(%rsp)
+	jmp	.Loop_tail8x
+
+.align	32
+.L448_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vpxor	320(%rsi),%ymm3,%ymm3
+	vpxor	352(%rsi),%ymm7,%ymm7
+	vpxor	384(%rsi),%ymm11,%ymm11
+	vpxor	416(%rsi),%ymm9,%ymm9
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	vmovdqu	%ymm3,320(%rdi)
+	vmovdqu	%ymm7,352(%rdi)
+	vmovdqu	%ymm11,384(%rdi)
+	vmovdqu	%ymm9,416(%rdi)
+	je	.Ldone8x
+
+	leaq	448(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm0,0(%rsp)
+	leaq	448(%rdi),%rdi
+	subq	$448,%rdx
+	vmovdqa	%ymm4,32(%rsp)
+
+.Loop_tail8x:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail8x
+
+.Ldone8x:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L8x_epilogue:
+	ret
+.cfi_endproc	
+.size	chacha20_avx2,.-chacha20_avx2
+.global	chacha20_avx512
+.type	chacha20_avx512,@function
+.align	32
+chacha20_avx512:
+.cfi_startproc	
+.Lchacha20_avx512:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	cmpq	$512,%rdx
+	ja	.Lchacha20_16x
+
+	subq	$64+8,%rsp
+	vbroadcasti32x4	.Lsigma(%rip),%zmm0
+	vbroadcasti32x4	(%rcx),%zmm1
+	vbroadcasti32x4	16(%rcx),%zmm2
+	vbroadcasti32x4	(%r8),%zmm3
+
+	vmovdqa32	%zmm0,%zmm16
+	vmovdqa32	%zmm1,%zmm17
+	vmovdqa32	%zmm2,%zmm18
+	vpaddd	.Lzeroz(%rip),%zmm3,%zmm3
+	vmovdqa32	.Lfourz(%rip),%zmm20
+	movq	$10,%r8
+	vmovdqa32	%zmm3,%zmm19
+	jmp	.Loop_avx512
+
+.align	16
+.Loop_outer_avx512:
+	vmovdqa32	%zmm16,%zmm0
+	vmovdqa32	%zmm17,%zmm1
+	vmovdqa32	%zmm18,%zmm2
+	vpaddd	%zmm20,%zmm19,%zmm3
+	movq	$10,%r8
+	vmovdqa32	%zmm3,%zmm19
+	jmp	.Loop_avx512
+
+.align	32
+.Loop_avx512:
+	vpaddd	%zmm1,%zmm0,%zmm0
+	vpxord	%zmm0,%zmm3,%zmm3
+	vprold	$16,%zmm3,%zmm3
+	vpaddd	%zmm3,%zmm2,%zmm2
+	vpxord	%zmm2,%zmm1,%zmm1
+	vprold	$12,%zmm1,%zmm1
+	vpaddd	%zmm1,%zmm0,%zmm0
+	vpxord	%zmm0,%zmm3,%zmm3
+	vprold	$8,%zmm3,%zmm3
+	vpaddd	%zmm3,%zmm2,%zmm2
+	vpxord	%zmm2,%zmm1,%zmm1
+	vprold	$7,%zmm1,%zmm1
+	vpshufd	$78,%zmm2,%zmm2
+	vpshufd	$57,%zmm1,%zmm1
+	vpshufd	$147,%zmm3,%zmm3
+	vpaddd	%zmm1,%zmm0,%zmm0
+	vpxord	%zmm0,%zmm3,%zmm3
+	vprold	$16,%zmm3,%zmm3
+	vpaddd	%zmm3,%zmm2,%zmm2
+	vpxord	%zmm2,%zmm1,%zmm1
+	vprold	$12,%zmm1,%zmm1
+	vpaddd	%zmm1,%zmm0,%zmm0
+	vpxord	%zmm0,%zmm3,%zmm3
+	vprold	$8,%zmm3,%zmm3
+	vpaddd	%zmm3,%zmm2,%zmm2
+	vpxord	%zmm2,%zmm1,%zmm1
+	vprold	$7,%zmm1,%zmm1
+	vpshufd	$78,%zmm2,%zmm2
+	vpshufd	$147,%zmm1,%zmm1
+	vpshufd	$57,%zmm3,%zmm3
+	decq	%r8
+	jnz	.Loop_avx512
+	vpaddd	%zmm16,%zmm0,%zmm0
+	vpaddd	%zmm17,%zmm1,%zmm1
+	vpaddd	%zmm18,%zmm2,%zmm2
+	vpaddd	%zmm19,%zmm3,%zmm3
+
+	subq	$64,%rdx
+	jb	.Ltail64_avx512
+
+	vpxor	0(%rsi),%xmm0,%xmm4
+	vpxor	16(%rsi),%xmm1,%xmm5
+	vpxor	32(%rsi),%xmm2,%xmm6
+	vpxor	48(%rsi),%xmm3,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jz	.Ldone_avx512
+
+	vextracti32x4	$1,%zmm0,%xmm4
+	vextracti32x4	$1,%zmm1,%xmm5
+	vextracti32x4	$1,%zmm2,%xmm6
+	vextracti32x4	$1,%zmm3,%xmm7
+
+	subq	$64,%rdx
+	jb	.Ltail_avx512
+
+	vpxor	0(%rsi),%xmm4,%xmm4
+	vpxor	16(%rsi),%xmm5,%xmm5
+	vpxor	32(%rsi),%xmm6,%xmm6
+	vpxor	48(%rsi),%xmm7,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jz	.Ldone_avx512
+
+	vextracti32x4	$2,%zmm0,%xmm4
+	vextracti32x4	$2,%zmm1,%xmm5
+	vextracti32x4	$2,%zmm2,%xmm6
+	vextracti32x4	$2,%zmm3,%xmm7
+
+	subq	$64,%rdx
+	jb	.Ltail_avx512
+
+	vpxor	0(%rsi),%xmm4,%xmm4
+	vpxor	16(%rsi),%xmm5,%xmm5
+	vpxor	32(%rsi),%xmm6,%xmm6
+	vpxor	48(%rsi),%xmm7,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jz	.Ldone_avx512
+
+	vextracti32x4	$3,%zmm0,%xmm4
+	vextracti32x4	$3,%zmm1,%xmm5
+	vextracti32x4	$3,%zmm2,%xmm6
+	vextracti32x4	$3,%zmm3,%xmm7
+
+	subq	$64,%rdx
+	jb	.Ltail_avx512
+
+	vpxor	0(%rsi),%xmm4,%xmm4
+	vpxor	16(%rsi),%xmm5,%xmm5
+	vpxor	32(%rsi),%xmm6,%xmm6
+	vpxor	48(%rsi),%xmm7,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jnz	.Loop_outer_avx512
+
+	jmp	.Ldone_avx512
+
+.align	16
+.Ltail64_avx512:
+	vmovdqa	%xmm0,0(%rsp)
+	vmovdqa	%xmm1,16(%rsp)
+	vmovdqa	%xmm2,32(%rsp)
+	vmovdqa	%xmm3,48(%rsp)
+	addq	$64,%rdx
+	jmp	.Loop_tail_avx512
+
+.align	16
+.Ltail_avx512:
+	vmovdqa	%xmm4,0(%rsp)
+	vmovdqa	%xmm5,16(%rsp)
+	vmovdqa	%xmm6,32(%rsp)
+	vmovdqa	%xmm7,48(%rsp)
+	addq	$64,%rdx
+
+.Loop_tail_avx512:
+	movzbl	(%rsi,%r8,1),%eax
+	movzbl	(%rsp,%r8,1),%ecx
+	leaq	1(%r8),%r8
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r8,1)
+	decq	%rdx
+	jnz	.Loop_tail_avx512
+
+	vmovdqu32	%zmm16,0(%rsp)
+
+.Ldone_avx512:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lavx512_epilogue:
+	ret
+.cfi_endproc	
+.size	chacha20_avx512,.-chacha20_avx512
+.global	chacha20_avx512vl
+.type	chacha20_avx512vl,@function
+.align	32
+chacha20_avx512vl:
+.cfi_startproc	
+.Lchacha20_avx512vl:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	cmpq	$128,%rdx
+	ja	.Lchacha20_8xvl
+
+	subq	$64+8,%rsp
+	vbroadcasti128	.Lsigma(%rip),%ymm0
+	vbroadcasti128	(%rcx),%ymm1
+	vbroadcasti128	16(%rcx),%ymm2
+	vbroadcasti128	(%r8),%ymm3
+
+	vmovdqa32	%ymm0,%ymm16
+	vmovdqa32	%ymm1,%ymm17
+	vmovdqa32	%ymm2,%ymm18
+	vpaddd	.Lzeroz(%rip),%ymm3,%ymm3
+	vmovdqa32	.Ltwoy(%rip),%ymm20
+	movq	$10,%r8
+	vmovdqa32	%ymm3,%ymm19
+	jmp	.Loop_avx512vl
+
+.align	16
+.Loop_outer_avx512vl:
+	vmovdqa32	%ymm18,%ymm2
+	vpaddd	%ymm20,%ymm19,%ymm3
+	movq	$10,%r8
+	vmovdqa32	%ymm3,%ymm19
+	jmp	.Loop_avx512vl
+
+.align	32
+.Loop_avx512vl:
+	vpaddd	%ymm1,%ymm0,%ymm0
+	vpxor	%ymm0,%ymm3,%ymm3
+	vprold	$16,%ymm3,%ymm3
+	vpaddd	%ymm3,%ymm2,%ymm2
+	vpxor	%ymm2,%ymm1,%ymm1
+	vprold	$12,%ymm1,%ymm1
+	vpaddd	%ymm1,%ymm0,%ymm0
+	vpxor	%ymm0,%ymm3,%ymm3
+	vprold	$8,%ymm3,%ymm3
+	vpaddd	%ymm3,%ymm2,%ymm2
+	vpxor	%ymm2,%ymm1,%ymm1
+	vprold	$7,%ymm1,%ymm1
+	vpshufd	$78,%ymm2,%ymm2
+	vpshufd	$57,%ymm1,%ymm1
+	vpshufd	$147,%ymm3,%ymm3
+	vpaddd	%ymm1,%ymm0,%ymm0
+	vpxor	%ymm0,%ymm3,%ymm3
+	vprold	$16,%ymm3,%ymm3
+	vpaddd	%ymm3,%ymm2,%ymm2
+	vpxor	%ymm2,%ymm1,%ymm1
+	vprold	$12,%ymm1,%ymm1
+	vpaddd	%ymm1,%ymm0,%ymm0
+	vpxor	%ymm0,%ymm3,%ymm3
+	vprold	$8,%ymm3,%ymm3
+	vpaddd	%ymm3,%ymm2,%ymm2
+	vpxor	%ymm2,%ymm1,%ymm1
+	vprold	$7,%ymm1,%ymm1
+	vpshufd	$78,%ymm2,%ymm2
+	vpshufd	$147,%ymm1,%ymm1
+	vpshufd	$57,%ymm3,%ymm3
+	decq	%r8
+	jnz	.Loop_avx512vl
+	vpaddd	%ymm16,%ymm0,%ymm0
+	vpaddd	%ymm17,%ymm1,%ymm1
+	vpaddd	%ymm18,%ymm2,%ymm2
+	vpaddd	%ymm19,%ymm3,%ymm3
+
+	subq	$64,%rdx
+	jb	.Ltail64_avx512vl
+
+	vpxor	0(%rsi),%xmm0,%xmm4
+	vpxor	16(%rsi),%xmm1,%xmm5
+	vpxor	32(%rsi),%xmm2,%xmm6
+	vpxor	48(%rsi),%xmm3,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	jz	.Ldone_avx512vl
+
+	vextracti128	$1,%ymm0,%xmm4
+	vextracti128	$1,%ymm1,%xmm5
+	vextracti128	$1,%ymm2,%xmm6
+	vextracti128	$1,%ymm3,%xmm7
+
+	subq	$64,%rdx
+	jb	.Ltail_avx512vl
+
+	vpxor	0(%rsi),%xmm4,%xmm4
+	vpxor	16(%rsi),%xmm5,%xmm5
+	vpxor	32(%rsi),%xmm6,%xmm6
+	vpxor	48(%rsi),%xmm7,%xmm7
+	leaq	64(%rsi),%rsi
+
+	vmovdqu	%xmm4,0(%rdi)
+	vmovdqu	%xmm5,16(%rdi)
+	vmovdqu	%xmm6,32(%rdi)
+	vmovdqu	%xmm7,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	vmovdqa32	%ymm16,%ymm0
+	vmovdqa32	%ymm17,%ymm1
+	jnz	.Loop_outer_avx512vl
+
+	jmp	.Ldone_avx512vl
+
+.align	16
+.Ltail64_avx512vl:
+	vmovdqa	%xmm0,0(%rsp)
+	vmovdqa	%xmm1,16(%rsp)
+	vmovdqa	%xmm2,32(%rsp)
+	vmovdqa	%xmm3,48(%rsp)
+	addq	$64,%rdx
+	jmp	.Loop_tail_avx512vl
+
+.align	16
+.Ltail_avx512vl:
+	vmovdqa	%xmm4,0(%rsp)
+	vmovdqa	%xmm5,16(%rsp)
+	vmovdqa	%xmm6,32(%rsp)
+	vmovdqa	%xmm7,48(%rsp)
+	addq	$64,%rdx
+
+.Loop_tail_avx512vl:
+	movzbl	(%rsi,%r8,1),%eax
+	movzbl	(%rsp,%r8,1),%ecx
+	leaq	1(%r8),%r8
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r8,1)
+	decq	%rdx
+	jnz	.Loop_tail_avx512vl
+
+	vmovdqu32	%ymm16,0(%rsp)
+	vmovdqu32	%ymm16,32(%rsp)
+
+.Ldone_avx512vl:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lavx512vl_epilogue:
+	ret
+.cfi_endproc	
+.size	chacha20_avx512vl,.-chacha20_avx512vl
+.global	chacha20_16x
+.type	chacha20_16x,@function
+.align	32
+chacha20_16x:
+.cfi_startproc	
+.Lchacha20_16x:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	subq	$64+8,%rsp
+	andq	$-64,%rsp
+	vzeroupper
+
+	leaq	.Lsigma(%rip),%r10
+	vbroadcasti32x4	(%r10),%zmm3
+	vbroadcasti32x4	(%rcx),%zmm7
+	vbroadcasti32x4	16(%rcx),%zmm11
+	vbroadcasti32x4	(%r8),%zmm15
+
+	vpshufd	$0x00,%zmm3,%zmm0
+	vpshufd	$0x55,%zmm3,%zmm1
+	vpshufd	$0xaa,%zmm3,%zmm2
+	vpshufd	$0xff,%zmm3,%zmm3
+	vmovdqa64	%zmm0,%zmm16
+	vmovdqa64	%zmm1,%zmm17
+	vmovdqa64	%zmm2,%zmm18
+	vmovdqa64	%zmm3,%zmm19
+
+	vpshufd	$0x00,%zmm7,%zmm4
+	vpshufd	$0x55,%zmm7,%zmm5
+	vpshufd	$0xaa,%zmm7,%zmm6
+	vpshufd	$0xff,%zmm7,%zmm7
+	vmovdqa64	%zmm4,%zmm20
+	vmovdqa64	%zmm5,%zmm21
+	vmovdqa64	%zmm6,%zmm22
+	vmovdqa64	%zmm7,%zmm23
+
+	vpshufd	$0x00,%zmm11,%zmm8
+	vpshufd	$0x55,%zmm11,%zmm9
+	vpshufd	$0xaa,%zmm11,%zmm10
+	vpshufd	$0xff,%zmm11,%zmm11
+	vmovdqa64	%zmm8,%zmm24
+	vmovdqa64	%zmm9,%zmm25
+	vmovdqa64	%zmm10,%zmm26
+	vmovdqa64	%zmm11,%zmm27
+
+	vpshufd	$0x00,%zmm15,%zmm12
+	vpshufd	$0x55,%zmm15,%zmm13
+	vpshufd	$0xaa,%zmm15,%zmm14
+	vpshufd	$0xff,%zmm15,%zmm15
+	vpaddd	.Lincz(%rip),%zmm12,%zmm12
+	vmovdqa64	%zmm12,%zmm28
+	vmovdqa64	%zmm13,%zmm29
+	vmovdqa64	%zmm14,%zmm30
+	vmovdqa64	%zmm15,%zmm31
+
+	movl	$10,%eax
+	jmp	.Loop16x
+
+.align	32
+.Loop_outer16x:
+	vpbroadcastd	0(%r10),%zmm0
+	vpbroadcastd	4(%r10),%zmm1
+	vpbroadcastd	8(%r10),%zmm2
+	vpbroadcastd	12(%r10),%zmm3
+	vpaddd	.Lsixteen(%rip),%zmm28,%zmm28
+	vmovdqa64	%zmm20,%zmm4
+	vmovdqa64	%zmm21,%zmm5
+	vmovdqa64	%zmm22,%zmm6
+	vmovdqa64	%zmm23,%zmm7
+	vmovdqa64	%zmm24,%zmm8
+	vmovdqa64	%zmm25,%zmm9
+	vmovdqa64	%zmm26,%zmm10
+	vmovdqa64	%zmm27,%zmm11
+	vmovdqa64	%zmm28,%zmm12
+	vmovdqa64	%zmm29,%zmm13
+	vmovdqa64	%zmm30,%zmm14
+	vmovdqa64	%zmm31,%zmm15
+
+	vmovdqa64	%zmm0,%zmm16
+	vmovdqa64	%zmm1,%zmm17
+	vmovdqa64	%zmm2,%zmm18
+	vmovdqa64	%zmm3,%zmm19
+
+	movl	$10,%eax
+	jmp	.Loop16x
+
+.align	32
+.Loop16x:
+	vpaddd	%zmm4,%zmm0,%zmm0
+	vpaddd	%zmm5,%zmm1,%zmm1
+	vpaddd	%zmm6,%zmm2,%zmm2
+	vpaddd	%zmm7,%zmm3,%zmm3
+	vpxord	%zmm0,%zmm12,%zmm12
+	vpxord	%zmm1,%zmm13,%zmm13
+	vpxord	%zmm2,%zmm14,%zmm14
+	vpxord	%zmm3,%zmm15,%zmm15
+	vprold	$16,%zmm12,%zmm12
+	vprold	$16,%zmm13,%zmm13
+	vprold	$16,%zmm14,%zmm14
+	vprold	$16,%zmm15,%zmm15
+	vpaddd	%zmm12,%zmm8,%zmm8
+	vpaddd	%zmm13,%zmm9,%zmm9
+	vpaddd	%zmm14,%zmm10,%zmm10
+	vpaddd	%zmm15,%zmm11,%zmm11
+	vpxord	%zmm8,%zmm4,%zmm4
+	vpxord	%zmm9,%zmm5,%zmm5
+	vpxord	%zmm10,%zmm6,%zmm6
+	vpxord	%zmm11,%zmm7,%zmm7
+	vprold	$12,%zmm4,%zmm4
+	vprold	$12,%zmm5,%zmm5
+	vprold	$12,%zmm6,%zmm6
+	vprold	$12,%zmm7,%zmm7
+	vpaddd	%zmm4,%zmm0,%zmm0
+	vpaddd	%zmm5,%zmm1,%zmm1
+	vpaddd	%zmm6,%zmm2,%zmm2
+	vpaddd	%zmm7,%zmm3,%zmm3
+	vpxord	%zmm0,%zmm12,%zmm12
+	vpxord	%zmm1,%zmm13,%zmm13
+	vpxord	%zmm2,%zmm14,%zmm14
+	vpxord	%zmm3,%zmm15,%zmm15
+	vprold	$8,%zmm12,%zmm12
+	vprold	$8,%zmm13,%zmm13
+	vprold	$8,%zmm14,%zmm14
+	vprold	$8,%zmm15,%zmm15
+	vpaddd	%zmm12,%zmm8,%zmm8
+	vpaddd	%zmm13,%zmm9,%zmm9
+	vpaddd	%zmm14,%zmm10,%zmm10
+	vpaddd	%zmm15,%zmm11,%zmm11
+	vpxord	%zmm8,%zmm4,%zmm4
+	vpxord	%zmm9,%zmm5,%zmm5
+	vpxord	%zmm10,%zmm6,%zmm6
+	vpxord	%zmm11,%zmm7,%zmm7
+	vprold	$7,%zmm4,%zmm4
+	vprold	$7,%zmm5,%zmm5
+	vprold	$7,%zmm6,%zmm6
+	vprold	$7,%zmm7,%zmm7
+	vpaddd	%zmm5,%zmm0,%zmm0
+	vpaddd	%zmm6,%zmm1,%zmm1
+	vpaddd	%zmm7,%zmm2,%zmm2
+	vpaddd	%zmm4,%zmm3,%zmm3
+	vpxord	%zmm0,%zmm15,%zmm15
+	vpxord	%zmm1,%zmm12,%zmm12
+	vpxord	%zmm2,%zmm13,%zmm13
+	vpxord	%zmm3,%zmm14,%zmm14
+	vprold	$16,%zmm15,%zmm15
+	vprold	$16,%zmm12,%zmm12
+	vprold	$16,%zmm13,%zmm13
+	vprold	$16,%zmm14,%zmm14
+	vpaddd	%zmm15,%zmm10,%zmm10
+	vpaddd	%zmm12,%zmm11,%zmm11
+	vpaddd	%zmm13,%zmm8,%zmm8
+	vpaddd	%zmm14,%zmm9,%zmm9
+	vpxord	%zmm10,%zmm5,%zmm5
+	vpxord	%zmm11,%zmm6,%zmm6
+	vpxord	%zmm8,%zmm7,%zmm7
+	vpxord	%zmm9,%zmm4,%zmm4
+	vprold	$12,%zmm5,%zmm5
+	vprold	$12,%zmm6,%zmm6
+	vprold	$12,%zmm7,%zmm7
+	vprold	$12,%zmm4,%zmm4
+	vpaddd	%zmm5,%zmm0,%zmm0
+	vpaddd	%zmm6,%zmm1,%zmm1
+	vpaddd	%zmm7,%zmm2,%zmm2
+	vpaddd	%zmm4,%zmm3,%zmm3
+	vpxord	%zmm0,%zmm15,%zmm15
+	vpxord	%zmm1,%zmm12,%zmm12
+	vpxord	%zmm2,%zmm13,%zmm13
+	vpxord	%zmm3,%zmm14,%zmm14
+	vprold	$8,%zmm15,%zmm15
+	vprold	$8,%zmm12,%zmm12
+	vprold	$8,%zmm13,%zmm13
+	vprold	$8,%zmm14,%zmm14
+	vpaddd	%zmm15,%zmm10,%zmm10
+	vpaddd	%zmm12,%zmm11,%zmm11
+	vpaddd	%zmm13,%zmm8,%zmm8
+	vpaddd	%zmm14,%zmm9,%zmm9
+	vpxord	%zmm10,%zmm5,%zmm5
+	vpxord	%zmm11,%zmm6,%zmm6
+	vpxord	%zmm8,%zmm7,%zmm7
+	vpxord	%zmm9,%zmm4,%zmm4
+	vprold	$7,%zmm5,%zmm5
+	vprold	$7,%zmm6,%zmm6
+	vprold	$7,%zmm7,%zmm7
+	vprold	$7,%zmm4,%zmm4
+	decl	%eax
+	jnz	.Loop16x
+
+	vpaddd	%zmm16,%zmm0,%zmm0
+	vpaddd	%zmm17,%zmm1,%zmm1
+	vpaddd	%zmm18,%zmm2,%zmm2
+	vpaddd	%zmm19,%zmm3,%zmm3
+
+	vpunpckldq	%zmm1,%zmm0,%zmm18
+	vpunpckldq	%zmm3,%zmm2,%zmm19
+	vpunpckhdq	%zmm1,%zmm0,%zmm0
+	vpunpckhdq	%zmm3,%zmm2,%zmm2
+	vpunpcklqdq	%zmm19,%zmm18,%zmm1
+	vpunpckhqdq	%zmm19,%zmm18,%zmm18
+	vpunpcklqdq	%zmm2,%zmm0,%zmm3
+	vpunpckhqdq	%zmm2,%zmm0,%zmm0
+	vpaddd	%zmm20,%zmm4,%zmm4
+	vpaddd	%zmm21,%zmm5,%zmm5
+	vpaddd	%zmm22,%zmm6,%zmm6
+	vpaddd	%zmm23,%zmm7,%zmm7
+
+	vpunpckldq	%zmm5,%zmm4,%zmm2
+	vpunpckldq	%zmm7,%zmm6,%zmm19
+	vpunpckhdq	%zmm5,%zmm4,%zmm4
+	vpunpckhdq	%zmm7,%zmm6,%zmm6
+	vpunpcklqdq	%zmm19,%zmm2,%zmm5
+	vpunpckhqdq	%zmm19,%zmm2,%zmm2
+	vpunpcklqdq	%zmm6,%zmm4,%zmm7
+	vpunpckhqdq	%zmm6,%zmm4,%zmm4
+	vshufi32x4	$0x44,%zmm5,%zmm1,%zmm19
+	vshufi32x4	$0xee,%zmm5,%zmm1,%zmm5
+	vshufi32x4	$0x44,%zmm2,%zmm18,%zmm1
+	vshufi32x4	$0xee,%zmm2,%zmm18,%zmm2
+	vshufi32x4	$0x44,%zmm7,%zmm3,%zmm18
+	vshufi32x4	$0xee,%zmm7,%zmm3,%zmm7
+	vshufi32x4	$0x44,%zmm4,%zmm0,%zmm3
+	vshufi32x4	$0xee,%zmm4,%zmm0,%zmm4
+	vpaddd	%zmm24,%zmm8,%zmm8
+	vpaddd	%zmm25,%zmm9,%zmm9
+	vpaddd	%zmm26,%zmm10,%zmm10
+	vpaddd	%zmm27,%zmm11,%zmm11
+
+	vpunpckldq	%zmm9,%zmm8,%zmm6
+	vpunpckldq	%zmm11,%zmm10,%zmm0
+	vpunpckhdq	%zmm9,%zmm8,%zmm8
+	vpunpckhdq	%zmm11,%zmm10,%zmm10
+	vpunpcklqdq	%zmm0,%zmm6,%zmm9
+	vpunpckhqdq	%zmm0,%zmm6,%zmm6
+	vpunpcklqdq	%zmm10,%zmm8,%zmm11
+	vpunpckhqdq	%zmm10,%zmm8,%zmm8
+	vpaddd	%zmm28,%zmm12,%zmm12
+	vpaddd	%zmm29,%zmm13,%zmm13
+	vpaddd	%zmm30,%zmm14,%zmm14
+	vpaddd	%zmm31,%zmm15,%zmm15
+
+	vpunpckldq	%zmm13,%zmm12,%zmm10
+	vpunpckldq	%zmm15,%zmm14,%zmm0
+	vpunpckhdq	%zmm13,%zmm12,%zmm12
+	vpunpckhdq	%zmm15,%zmm14,%zmm14
+	vpunpcklqdq	%zmm0,%zmm10,%zmm13
+	vpunpckhqdq	%zmm0,%zmm10,%zmm10
+	vpunpcklqdq	%zmm14,%zmm12,%zmm15
+	vpunpckhqdq	%zmm14,%zmm12,%zmm12
+	vshufi32x4	$0x44,%zmm13,%zmm9,%zmm0
+	vshufi32x4	$0xee,%zmm13,%zmm9,%zmm13
+	vshufi32x4	$0x44,%zmm10,%zmm6,%zmm9
+	vshufi32x4	$0xee,%zmm10,%zmm6,%zmm10
+	vshufi32x4	$0x44,%zmm15,%zmm11,%zmm6
+	vshufi32x4	$0xee,%zmm15,%zmm11,%zmm15
+	vshufi32x4	$0x44,%zmm12,%zmm8,%zmm11
+	vshufi32x4	$0xee,%zmm12,%zmm8,%zmm12
+	vshufi32x4	$0x88,%zmm0,%zmm19,%zmm16
+	vshufi32x4	$0xdd,%zmm0,%zmm19,%zmm19
+	vshufi32x4	$0x88,%zmm13,%zmm5,%zmm0
+	vshufi32x4	$0xdd,%zmm13,%zmm5,%zmm13
+	vshufi32x4	$0x88,%zmm9,%zmm1,%zmm17
+	vshufi32x4	$0xdd,%zmm9,%zmm1,%zmm1
+	vshufi32x4	$0x88,%zmm10,%zmm2,%zmm9
+	vshufi32x4	$0xdd,%zmm10,%zmm2,%zmm10
+	vshufi32x4	$0x88,%zmm6,%zmm18,%zmm14
+	vshufi32x4	$0xdd,%zmm6,%zmm18,%zmm18
+	vshufi32x4	$0x88,%zmm15,%zmm7,%zmm6
+	vshufi32x4	$0xdd,%zmm15,%zmm7,%zmm15
+	vshufi32x4	$0x88,%zmm11,%zmm3,%zmm8
+	vshufi32x4	$0xdd,%zmm11,%zmm3,%zmm3
+	vshufi32x4	$0x88,%zmm12,%zmm4,%zmm11
+	vshufi32x4	$0xdd,%zmm12,%zmm4,%zmm12
+	cmpq	$1024,%rdx
+	jb	.Ltail16x
+
+	vpxord	0(%rsi),%zmm16,%zmm16
+	vpxord	64(%rsi),%zmm17,%zmm17
+	vpxord	128(%rsi),%zmm14,%zmm14
+	vpxord	192(%rsi),%zmm8,%zmm8
+	vmovdqu32	%zmm16,0(%rdi)
+	vmovdqu32	%zmm17,64(%rdi)
+	vmovdqu32	%zmm14,128(%rdi)
+	vmovdqu32	%zmm8,192(%rdi)
+
+	vpxord	256(%rsi),%zmm19,%zmm19
+	vpxord	320(%rsi),%zmm1,%zmm1
+	vpxord	384(%rsi),%zmm18,%zmm18
+	vpxord	448(%rsi),%zmm3,%zmm3
+	vmovdqu32	%zmm19,256(%rdi)
+	vmovdqu32	%zmm1,320(%rdi)
+	vmovdqu32	%zmm18,384(%rdi)
+	vmovdqu32	%zmm3,448(%rdi)
+
+	vpxord	512(%rsi),%zmm0,%zmm0
+	vpxord	576(%rsi),%zmm9,%zmm9
+	vpxord	640(%rsi),%zmm6,%zmm6
+	vpxord	704(%rsi),%zmm11,%zmm11
+	vmovdqu32	%zmm0,512(%rdi)
+	vmovdqu32	%zmm9,576(%rdi)
+	vmovdqu32	%zmm6,640(%rdi)
+	vmovdqu32	%zmm11,704(%rdi)
+
+	vpxord	768(%rsi),%zmm13,%zmm13
+	vpxord	832(%rsi),%zmm10,%zmm10
+	vpxord	896(%rsi),%zmm15,%zmm15
+	vpxord	960(%rsi),%zmm12,%zmm12
+	leaq	1024(%rsi),%rsi
+	vmovdqu32	%zmm13,768(%rdi)
+	vmovdqu32	%zmm10,832(%rdi)
+	vmovdqu32	%zmm15,896(%rdi)
+	vmovdqu32	%zmm12,960(%rdi)
+	leaq	1024(%rdi),%rdi
+
+	subq	$1024,%rdx
+	jnz	.Loop_outer16x
+
+	jmp	.Ldone16x
+
+.align	32
+.Ltail16x:
+	xorq	%r10,%r10
+	subq	%rsi,%rdi
+	cmpq	$64,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm16,%zmm16
+	vmovdqu32	%zmm16,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm17,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$128,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm17,%zmm17
+	vmovdqu32	%zmm17,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm14,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$192,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm14,%zmm14
+	vmovdqu32	%zmm14,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm8,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$256,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm8,%zmm8
+	vmovdqu32	%zmm8,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm19,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$320,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm19,%zmm19
+	vmovdqu32	%zmm19,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm1,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$384,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm1,%zmm1
+	vmovdqu32	%zmm1,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm18,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$448,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm18,%zmm18
+	vmovdqu32	%zmm18,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm3,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$512,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm3,%zmm3
+	vmovdqu32	%zmm3,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm0,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$576,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm0,%zmm0
+	vmovdqu32	%zmm0,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm9,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$640,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm9,%zmm9
+	vmovdqu32	%zmm9,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm6,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$704,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm6,%zmm6
+	vmovdqu32	%zmm6,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm11,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$768,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm11,%zmm11
+	vmovdqu32	%zmm11,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm13,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$832,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm13,%zmm13
+	vmovdqu32	%zmm13,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm10,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$896,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm10,%zmm10
+	vmovdqu32	%zmm10,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm15,%zmm16
+	leaq	64(%rsi),%rsi
+
+	cmpq	$960,%rdx
+	jb	.Less_than_64_16x
+	vpxord	(%rsi),%zmm15,%zmm15
+	vmovdqu32	%zmm15,(%rdi,%rsi,1)
+	je	.Ldone16x
+	vmovdqa32	%zmm12,%zmm16
+	leaq	64(%rsi),%rsi
+
+.Less_than_64_16x:
+	vmovdqa32	%zmm16,0(%rsp)
+	leaq	(%rdi,%rsi,1),%rdi
+	andq	$63,%rdx
+
+.Loop_tail16x:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail16x
+
+	vpxord	%zmm16,%zmm16,%zmm16
+	vmovdqa32	%zmm16,0(%rsp)
+
+.Ldone16x:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L16x_epilogue:
+	ret
+.cfi_endproc	
+.size	chacha20_16x,.-chacha20_16x
+.global	chacha20_8xvl
+.type	chacha20_8xvl,@function
+.align	32
+chacha20_8xvl:
+.cfi_startproc	
+.Lchacha20_8xvl:
+	movq	%rsp,%r9
+.cfi_def_cfa_register	%r9
+	subq	$64+8,%rsp
+	andq	$-64,%rsp
+	vzeroupper
+
+	leaq	.Lsigma(%rip),%r10
+	vbroadcasti128	(%r10),%ymm3
+	vbroadcasti128	(%rcx),%ymm7
+	vbroadcasti128	16(%rcx),%ymm11
+	vbroadcasti128	(%r8),%ymm15
+
+	vpshufd	$0x00,%ymm3,%ymm0
+	vpshufd	$0x55,%ymm3,%ymm1
+	vpshufd	$0xaa,%ymm3,%ymm2
+	vpshufd	$0xff,%ymm3,%ymm3
+	vmovdqa64	%ymm0,%ymm16
+	vmovdqa64	%ymm1,%ymm17
+	vmovdqa64	%ymm2,%ymm18
+	vmovdqa64	%ymm3,%ymm19
+
+	vpshufd	$0x00,%ymm7,%ymm4
+	vpshufd	$0x55,%ymm7,%ymm5
+	vpshufd	$0xaa,%ymm7,%ymm6
+	vpshufd	$0xff,%ymm7,%ymm7
+	vmovdqa64	%ymm4,%ymm20
+	vmovdqa64	%ymm5,%ymm21
+	vmovdqa64	%ymm6,%ymm22
+	vmovdqa64	%ymm7,%ymm23
+
+	vpshufd	$0x00,%ymm11,%ymm8
+	vpshufd	$0x55,%ymm11,%ymm9
+	vpshufd	$0xaa,%ymm11,%ymm10
+	vpshufd	$0xff,%ymm11,%ymm11
+	vmovdqa64	%ymm8,%ymm24
+	vmovdqa64	%ymm9,%ymm25
+	vmovdqa64	%ymm10,%ymm26
+	vmovdqa64	%ymm11,%ymm27
+
+	vpshufd	$0x00,%ymm15,%ymm12
+	vpshufd	$0x55,%ymm15,%ymm13
+	vpshufd	$0xaa,%ymm15,%ymm14
+	vpshufd	$0xff,%ymm15,%ymm15
+	vpaddd	.Lincy(%rip),%ymm12,%ymm12
+	vmovdqa64	%ymm12,%ymm28
+	vmovdqa64	%ymm13,%ymm29
+	vmovdqa64	%ymm14,%ymm30
+	vmovdqa64	%ymm15,%ymm31
+
+	movl	$10,%eax
+	jmp	.Loop8xvl
+
+.align	32
+.Loop_outer8xvl:
+
+
+	vpbroadcastd	8(%r10),%ymm2
+	vpbroadcastd	12(%r10),%ymm3
+	vpaddd	.Leight(%rip),%ymm28,%ymm28
+	vmovdqa64	%ymm20,%ymm4
+	vmovdqa64	%ymm21,%ymm5
+	vmovdqa64	%ymm22,%ymm6
+	vmovdqa64	%ymm23,%ymm7
+	vmovdqa64	%ymm24,%ymm8
+	vmovdqa64	%ymm25,%ymm9
+	vmovdqa64	%ymm26,%ymm10
+	vmovdqa64	%ymm27,%ymm11
+	vmovdqa64	%ymm28,%ymm12
+	vmovdqa64	%ymm29,%ymm13
+	vmovdqa64	%ymm30,%ymm14
+	vmovdqa64	%ymm31,%ymm15
+
+	vmovdqa64	%ymm0,%ymm16
+	vmovdqa64	%ymm1,%ymm17
+	vmovdqa64	%ymm2,%ymm18
+	vmovdqa64	%ymm3,%ymm19
+
+	movl	$10,%eax
+	jmp	.Loop8xvl
+
+.align	32
+.Loop8xvl:
+	vpaddd	%ymm4,%ymm0,%ymm0
+	vpaddd	%ymm5,%ymm1,%ymm1
+	vpaddd	%ymm6,%ymm2,%ymm2
+	vpaddd	%ymm7,%ymm3,%ymm3
+	vpxor	%ymm0,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm14,%ymm14
+	vpxor	%ymm3,%ymm15,%ymm15
+	vprold	$16,%ymm12,%ymm12
+	vprold	$16,%ymm13,%ymm13
+	vprold	$16,%ymm14,%ymm14
+	vprold	$16,%ymm15,%ymm15
+	vpaddd	%ymm12,%ymm8,%ymm8
+	vpaddd	%ymm13,%ymm9,%ymm9
+	vpaddd	%ymm14,%ymm10,%ymm10
+	vpaddd	%ymm15,%ymm11,%ymm11
+	vpxor	%ymm8,%ymm4,%ymm4
+	vpxor	%ymm9,%ymm5,%ymm5
+	vpxor	%ymm10,%ymm6,%ymm6
+	vpxor	%ymm11,%ymm7,%ymm7
+	vprold	$12,%ymm4,%ymm4
+	vprold	$12,%ymm5,%ymm5
+	vprold	$12,%ymm6,%ymm6
+	vprold	$12,%ymm7,%ymm7
+	vpaddd	%ymm4,%ymm0,%ymm0
+	vpaddd	%ymm5,%ymm1,%ymm1
+	vpaddd	%ymm6,%ymm2,%ymm2
+	vpaddd	%ymm7,%ymm3,%ymm3
+	vpxor	%ymm0,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm14,%ymm14
+	vpxor	%ymm3,%ymm15,%ymm15
+	vprold	$8,%ymm12,%ymm12
+	vprold	$8,%ymm13,%ymm13
+	vprold	$8,%ymm14,%ymm14
+	vprold	$8,%ymm15,%ymm15
+	vpaddd	%ymm12,%ymm8,%ymm8
+	vpaddd	%ymm13,%ymm9,%ymm9
+	vpaddd	%ymm14,%ymm10,%ymm10
+	vpaddd	%ymm15,%ymm11,%ymm11
+	vpxor	%ymm8,%ymm4,%ymm4
+	vpxor	%ymm9,%ymm5,%ymm5
+	vpxor	%ymm10,%ymm6,%ymm6
+	vpxor	%ymm11,%ymm7,%ymm7
+	vprold	$7,%ymm4,%ymm4
+	vprold	$7,%ymm5,%ymm5
+	vprold	$7,%ymm6,%ymm6
+	vprold	$7,%ymm7,%ymm7
+	vpaddd	%ymm5,%ymm0,%ymm0
+	vpaddd	%ymm6,%ymm1,%ymm1
+	vpaddd	%ymm7,%ymm2,%ymm2
+	vpaddd	%ymm4,%ymm3,%ymm3
+	vpxor	%ymm0,%ymm15,%ymm15
+	vpxor	%ymm1,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm14,%ymm14
+	vprold	$16,%ymm15,%ymm15
+	vprold	$16,%ymm12,%ymm12
+	vprold	$16,%ymm13,%ymm13
+	vprold	$16,%ymm14,%ymm14
+	vpaddd	%ymm15,%ymm10,%ymm10
+	vpaddd	%ymm12,%ymm11,%ymm11
+	vpaddd	%ymm13,%ymm8,%ymm8
+	vpaddd	%ymm14,%ymm9,%ymm9
+	vpxor	%ymm10,%ymm5,%ymm5
+	vpxor	%ymm11,%ymm6,%ymm6
+	vpxor	%ymm8,%ymm7,%ymm7
+	vpxor	%ymm9,%ymm4,%ymm4
+	vprold	$12,%ymm5,%ymm5
+	vprold	$12,%ymm6,%ymm6
+	vprold	$12,%ymm7,%ymm7
+	vprold	$12,%ymm4,%ymm4
+	vpaddd	%ymm5,%ymm0,%ymm0
+	vpaddd	%ymm6,%ymm1,%ymm1
+	vpaddd	%ymm7,%ymm2,%ymm2
+	vpaddd	%ymm4,%ymm3,%ymm3
+	vpxor	%ymm0,%ymm15,%ymm15
+	vpxor	%ymm1,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm14,%ymm14
+	vprold	$8,%ymm15,%ymm15
+	vprold	$8,%ymm12,%ymm12
+	vprold	$8,%ymm13,%ymm13
+	vprold	$8,%ymm14,%ymm14
+	vpaddd	%ymm15,%ymm10,%ymm10
+	vpaddd	%ymm12,%ymm11,%ymm11
+	vpaddd	%ymm13,%ymm8,%ymm8
+	vpaddd	%ymm14,%ymm9,%ymm9
+	vpxor	%ymm10,%ymm5,%ymm5
+	vpxor	%ymm11,%ymm6,%ymm6
+	vpxor	%ymm8,%ymm7,%ymm7
+	vpxor	%ymm9,%ymm4,%ymm4
+	vprold	$7,%ymm5,%ymm5
+	vprold	$7,%ymm6,%ymm6
+	vprold	$7,%ymm7,%ymm7
+	vprold	$7,%ymm4,%ymm4
+	decl	%eax
+	jnz	.Loop8xvl
+
+	vpaddd	%ymm16,%ymm0,%ymm0
+	vpaddd	%ymm17,%ymm1,%ymm1
+	vpaddd	%ymm18,%ymm2,%ymm2
+	vpaddd	%ymm19,%ymm3,%ymm3
+
+	vpunpckldq	%ymm1,%ymm0,%ymm18
+	vpunpckldq	%ymm3,%ymm2,%ymm19
+	vpunpckhdq	%ymm1,%ymm0,%ymm0
+	vpunpckhdq	%ymm3,%ymm2,%ymm2
+	vpunpcklqdq	%ymm19,%ymm18,%ymm1
+	vpunpckhqdq	%ymm19,%ymm18,%ymm18
+	vpunpcklqdq	%ymm2,%ymm0,%ymm3
+	vpunpckhqdq	%ymm2,%ymm0,%ymm0
+	vpaddd	%ymm20,%ymm4,%ymm4
+	vpaddd	%ymm21,%ymm5,%ymm5
+	vpaddd	%ymm22,%ymm6,%ymm6
+	vpaddd	%ymm23,%ymm7,%ymm7
+
+	vpunpckldq	%ymm5,%ymm4,%ymm2
+	vpunpckldq	%ymm7,%ymm6,%ymm19
+	vpunpckhdq	%ymm5,%ymm4,%ymm4
+	vpunpckhdq	%ymm7,%ymm6,%ymm6
+	vpunpcklqdq	%ymm19,%ymm2,%ymm5
+	vpunpckhqdq	%ymm19,%ymm2,%ymm2
+	vpunpcklqdq	%ymm6,%ymm4,%ymm7
+	vpunpckhqdq	%ymm6,%ymm4,%ymm4
+	vshufi32x4	$0,%ymm5,%ymm1,%ymm19
+	vshufi32x4	$3,%ymm5,%ymm1,%ymm5
+	vshufi32x4	$0,%ymm2,%ymm18,%ymm1
+	vshufi32x4	$3,%ymm2,%ymm18,%ymm2
+	vshufi32x4	$0,%ymm7,%ymm3,%ymm18
+	vshufi32x4	$3,%ymm7,%ymm3,%ymm7
+	vshufi32x4	$0,%ymm4,%ymm0,%ymm3
+	vshufi32x4	$3,%ymm4,%ymm0,%ymm4
+	vpaddd	%ymm24,%ymm8,%ymm8
+	vpaddd	%ymm25,%ymm9,%ymm9
+	vpaddd	%ymm26,%ymm10,%ymm10
+	vpaddd	%ymm27,%ymm11,%ymm11
+
+	vpunpckldq	%ymm9,%ymm8,%ymm6
+	vpunpckldq	%ymm11,%ymm10,%ymm0
+	vpunpckhdq	%ymm9,%ymm8,%ymm8
+	vpunpckhdq	%ymm11,%ymm10,%ymm10
+	vpunpcklqdq	%ymm0,%ymm6,%ymm9
+	vpunpckhqdq	%ymm0,%ymm6,%ymm6
+	vpunpcklqdq	%ymm10,%ymm8,%ymm11
+	vpunpckhqdq	%ymm10,%ymm8,%ymm8
+	vpaddd	%ymm28,%ymm12,%ymm12
+	vpaddd	%ymm29,%ymm13,%ymm13
+	vpaddd	%ymm30,%ymm14,%ymm14
+	vpaddd	%ymm31,%ymm15,%ymm15
+
+	vpunpckldq	%ymm13,%ymm12,%ymm10
+	vpunpckldq	%ymm15,%ymm14,%ymm0
+	vpunpckhdq	%ymm13,%ymm12,%ymm12
+	vpunpckhdq	%ymm15,%ymm14,%ymm14
+	vpunpcklqdq	%ymm0,%ymm10,%ymm13
+	vpunpckhqdq	%ymm0,%ymm10,%ymm10
+	vpunpcklqdq	%ymm14,%ymm12,%ymm15
+	vpunpckhqdq	%ymm14,%ymm12,%ymm12
+	vperm2i128	$0x20,%ymm13,%ymm9,%ymm0
+	vperm2i128	$0x31,%ymm13,%ymm9,%ymm13
+	vperm2i128	$0x20,%ymm10,%ymm6,%ymm9
+	vperm2i128	$0x31,%ymm10,%ymm6,%ymm10
+	vperm2i128	$0x20,%ymm15,%ymm11,%ymm6
+	vperm2i128	$0x31,%ymm15,%ymm11,%ymm15
+	vperm2i128	$0x20,%ymm12,%ymm8,%ymm11
+	vperm2i128	$0x31,%ymm12,%ymm8,%ymm12
+	cmpq	$512,%rdx
+	jb	.Ltail8xvl
+
+	movl	$0x80,%eax
+	vpxord	0(%rsi),%ymm19,%ymm19
+	vpxor	32(%rsi),%ymm0,%ymm0
+	vpxor	64(%rsi),%ymm5,%ymm5
+	vpxor	96(%rsi),%ymm13,%ymm13
+	leaq	(%rsi,%rax,1),%rsi
+	vmovdqu32	%ymm19,0(%rdi)
+	vmovdqu	%ymm0,32(%rdi)
+	vmovdqu	%ymm5,64(%rdi)
+	vmovdqu	%ymm13,96(%rdi)
+	leaq	(%rdi,%rax,1),%rdi
+
+	vpxor	0(%rsi),%ymm1,%ymm1
+	vpxor	32(%rsi),%ymm9,%ymm9
+	vpxor	64(%rsi),%ymm2,%ymm2
+	vpxor	96(%rsi),%ymm10,%ymm10
+	leaq	(%rsi,%rax,1),%rsi
+	vmovdqu	%ymm1,0(%rdi)
+	vmovdqu	%ymm9,32(%rdi)
+	vmovdqu	%ymm2,64(%rdi)
+	vmovdqu	%ymm10,96(%rdi)
+	leaq	(%rdi,%rax,1),%rdi
+
+	vpxord	0(%rsi),%ymm18,%ymm18
+	vpxor	32(%rsi),%ymm6,%ymm6
+	vpxor	64(%rsi),%ymm7,%ymm7
+	vpxor	96(%rsi),%ymm15,%ymm15
+	leaq	(%rsi,%rax,1),%rsi
+	vmovdqu32	%ymm18,0(%rdi)
+	vmovdqu	%ymm6,32(%rdi)
+	vmovdqu	%ymm7,64(%rdi)
+	vmovdqu	%ymm15,96(%rdi)
+	leaq	(%rdi,%rax,1),%rdi
+
+	vpxor	0(%rsi),%ymm3,%ymm3
+	vpxor	32(%rsi),%ymm11,%ymm11
+	vpxor	64(%rsi),%ymm4,%ymm4
+	vpxor	96(%rsi),%ymm12,%ymm12
+	leaq	(%rsi,%rax,1),%rsi
+	vmovdqu	%ymm3,0(%rdi)
+	vmovdqu	%ymm11,32(%rdi)
+	vmovdqu	%ymm4,64(%rdi)
+	vmovdqu	%ymm12,96(%rdi)
+	leaq	(%rdi,%rax,1),%rdi
+
+	vpbroadcastd	0(%r10),%ymm0
+	vpbroadcastd	4(%r10),%ymm1
+
+	subq	$512,%rdx
+	jnz	.Loop_outer8xvl
+
+	jmp	.Ldone8xvl
+
+.align	32
+.Ltail8xvl:
+	vmovdqa64	%ymm19,%ymm8
+	xorq	%r10,%r10
+	subq	%rsi,%rdi
+	cmpq	$64,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm8,%ymm8
+	vpxor	32(%rsi),%ymm0,%ymm0
+	vmovdqu	%ymm8,0(%rdi,%rsi,1)
+	vmovdqu	%ymm0,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm5,%ymm8
+	vmovdqa	%ymm13,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$128,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm5,%ymm5
+	vpxor	32(%rsi),%ymm13,%ymm13
+	vmovdqu	%ymm5,0(%rdi,%rsi,1)
+	vmovdqu	%ymm13,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm1,%ymm8
+	vmovdqa	%ymm9,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$192,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm1,%ymm1
+	vpxor	32(%rsi),%ymm9,%ymm9
+	vmovdqu	%ymm1,0(%rdi,%rsi,1)
+	vmovdqu	%ymm9,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm2,%ymm8
+	vmovdqa	%ymm10,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$256,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm2,%ymm2
+	vpxor	32(%rsi),%ymm10,%ymm10
+	vmovdqu	%ymm2,0(%rdi,%rsi,1)
+	vmovdqu	%ymm10,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa32	%ymm18,%ymm8
+	vmovdqa	%ymm6,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$320,%rdx
+	jb	.Less_than_64_8xvl
+	vpxord	0(%rsi),%ymm18,%ymm18
+	vpxor	32(%rsi),%ymm6,%ymm6
+	vmovdqu32	%ymm18,0(%rdi,%rsi,1)
+	vmovdqu	%ymm6,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm7,%ymm8
+	vmovdqa	%ymm15,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$384,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm7,%ymm7
+	vpxor	32(%rsi),%ymm15,%ymm15
+	vmovdqu	%ymm7,0(%rdi,%rsi,1)
+	vmovdqu	%ymm15,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm3,%ymm8
+	vmovdqa	%ymm11,%ymm0
+	leaq	64(%rsi),%rsi
+
+	cmpq	$448,%rdx
+	jb	.Less_than_64_8xvl
+	vpxor	0(%rsi),%ymm3,%ymm3
+	vpxor	32(%rsi),%ymm11,%ymm11
+	vmovdqu	%ymm3,0(%rdi,%rsi,1)
+	vmovdqu	%ymm11,32(%rdi,%rsi,1)
+	je	.Ldone8xvl
+	vmovdqa	%ymm4,%ymm8
+	vmovdqa	%ymm12,%ymm0
+	leaq	64(%rsi),%rsi
+
+.Less_than_64_8xvl:
+	vmovdqa	%ymm8,0(%rsp)
+	vmovdqa	%ymm0,32(%rsp)
+	leaq	(%rdi,%rsi,1),%rdi
+	andq	$63,%rdx
+
+.Loop_tail8xvl:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	.Loop_tail8xvl
+
+	vpxor	%ymm8,%ymm8,%ymm8
+	vmovdqa	%ymm8,0(%rsp)
+	vmovdqa	%ymm8,32(%rsp)
+
+.Ldone8xvl:
+	vzeroall
+	leaq	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L8xvl_epilogue:
+	ret
+.cfi_endproc	
+.size	chacha20_8xvl,.-chacha20_8xvl
diff --git a/crypto/chacha20_x64_gas_macosx.s b/crypto/chacha20_x64_gas_macosx.s
new file mode 100644
index 0000000..37b9175
--- /dev/null
+++ b/crypto/chacha20_x64_gas_macosx.s
@@ -0,0 +1,1388 @@
+.text	
+
+.p2align	6
+L$zero:
+.long	0,0,0,0
+L$one:
+.long	1,0,0,0
+L$inc:
+.long	0,1,2,3
+L$four:
+.long	4,4,4,4
+L$incy:
+.long	0,2,4,6,1,3,5,7
+L$eight:
+.long	8,8,8,8,8,8,8,8
+L$rot16:
+.byte	0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd
+L$rot24:
+.byte	0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe
+L$sigma:
+.byte	101,120,112,97,110,100,32,51,50,45,98,121,116,101,32,107,0
+.p2align	6
+L$zeroz:
+.long	0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0
+L$fourz:
+.long	4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0
+L$incz:
+.long	0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+L$sixteen:
+.long	16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16
+.p2align	6
+L$twoy:
+.long	2,0,0,0, 2,0,0,0
+
+.global	_hchacha20_ssse3
+
+.p2align	5
+_hchacha20_ssse3:
+
+L$hchacha20_ssse3:
+	movdqa	L$sigma(%rip),%xmm0
+	movdqu	(%rdx),%xmm1
+	movdqu	16(%rdx),%xmm2
+	movdqu	(%rsi),%xmm3
+	movdqa	L$rot16(%rip),%xmm6
+	movdqa	L$rot24(%rip),%xmm7
+	movq	10,%r8
+.p2align	5
+L$oop_hssse3:
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	20,%xmm1
+	pslld	12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	25,%xmm1
+	pslld	7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$57,%xmm1,%xmm1
+	pshufd	$147,%xmm3,%xmm3
+	nop
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	20,%xmm1
+	pslld	12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	25,%xmm1
+	pslld	7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$147,%xmm1,%xmm1
+	pshufd	$57,%xmm3,%xmm3
+	decq	%r8
+	jnz	L$oop_hssse3
+	movdqu	%xmm0,0(%rdi)
+	movdqu	%xmm3,16(%rdi)
+	ret
+
+
+.global	_chacha20_ssse3
+
+.p2align	5
+_chacha20_ssse3:
+
+L$chacha20_ssse3:
+	movq	%rsp,%r9
+
+	cmpq	$128,%rdx
+	ja	L$chacha20_4x
+
+L$do_sse3_after_all:
+	subq	$64+8,%rsp
+	movdqa	L$sigma(%rip),%xmm0
+	movdqu	(%rcx),%xmm1
+	movdqu	16(%rcx),%xmm2
+	movdqu	(%r8),%xmm3
+	movdqa	L$rot16(%rip),%xmm6
+	movdqa	L$rot24(%rip),%xmm7
+
+	movdqa	%xmm0,0(%rsp)
+	movdqa	%xmm1,16(%rsp)
+	movdqa	%xmm2,32(%rsp)
+	movdqa	%xmm3,48(%rsp)
+	movq	$10,%r8
+	jmp	L$oop_ssse3
+
+.p2align	5
+L$oop_outer_ssse3:
+	movdqa	L$one(%rip),%xmm3
+	movdqa	0(%rsp),%xmm0
+	movdqa	16(%rsp),%xmm1
+	movdqa	32(%rsp),%xmm2
+	paddd	48(%rsp),%xmm3
+	movq	$10,%r8
+	movdqa	%xmm3,48(%rsp)
+	jmp	L$oop_ssse3
+
+.p2align	5
+L$oop_ssse3:
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$20,%xmm1
+	pslld	$12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$25,%xmm1
+	pslld	$7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$57,%xmm1,%xmm1
+	pshufd	$147,%xmm3,%xmm3
+	nop
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$20,%xmm1
+	pslld	$12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	$25,%xmm1
+	pslld	$7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	$78,%xmm2,%xmm2
+	pshufd	$147,%xmm1,%xmm1
+	pshufd	$57,%xmm3,%xmm3
+	decq	%r8
+	jnz	L$oop_ssse3
+	paddd	0(%rsp),%xmm0
+	paddd	16(%rsp),%xmm1
+	paddd	32(%rsp),%xmm2
+	paddd	48(%rsp),%xmm3
+
+	cmpq	$64,%rdx
+	jb	L$tail_ssse3
+
+	movdqu	0(%rsi),%xmm4
+	movdqu	16(%rsi),%xmm5
+	pxor	%xmm4,%xmm0
+	movdqu	32(%rsi),%xmm4
+	pxor	%xmm5,%xmm1
+	movdqu	48(%rsi),%xmm5
+	leaq	64(%rsi),%rsi
+	pxor	%xmm4,%xmm2
+	pxor	%xmm5,%xmm3
+
+	movdqu	%xmm0,0(%rdi)
+	movdqu	%xmm1,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm3,48(%rdi)
+	leaq	64(%rdi),%rdi
+
+	subq	$64,%rdx
+	jnz	L$oop_outer_ssse3
+
+	jmp	L$done_ssse3
+
+.p2align	4
+L$tail_ssse3:
+	movdqa	%xmm0,0(%rsp)
+	movdqa	%xmm1,16(%rsp)
+	movdqa	%xmm2,32(%rsp)
+	movdqa	%xmm3,48(%rsp)
+	xorq	%r8,%r8
+
+L$oop_tail_ssse3:
+	movzbl	(%rsi,%r8,1),%eax
+	movzbl	(%rsp,%r8,1),%ecx
+	leaq	1(%r8),%r8
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r8,1)
+	decq	%rdx
+	jnz	L$oop_tail_ssse3
+
+L$done_ssse3:
+	leaq	(%r9),%rsp
+
+L$ssse3_epilogue:
+	ret
+
+
+.global	_chacha20_4x
+
+.p2align	5
+_chacha20_4x:
+
+L$chacha20_4x:
+	movq	%rsp,%r9
+
+
+
+
+
+
+
+
+
+
+
+
+L$proceed4x:
+	subq	$0x140+8,%rsp
+	movdqa	L$sigma(%rip),%xmm11
+	movdqu	(%rcx),%xmm15
+	movdqu	16(%rcx),%xmm7
+	movdqu	(%r8),%xmm3
+	leaq	256(%rsp),%rcx
+	leaq	L$rot16(%rip),%r10
+	leaq	L$rot24(%rip),%r11
+
+	pshufd	$0x00,%xmm11,%xmm8
+	pshufd	$0x55,%xmm11,%xmm9
+	movdqa	%xmm8,64(%rsp)
+	pshufd	$0xaa,%xmm11,%xmm10
+	movdqa	%xmm9,80(%rsp)
+	pshufd	$0xff,%xmm11,%xmm11
+	movdqa	%xmm10,96(%rsp)
+	movdqa	%xmm11,112(%rsp)
+
+	pshufd	$0x00,%xmm15,%xmm12
+	pshufd	$0x55,%xmm15,%xmm13
+	movdqa	%xmm12,128-256(%rcx)
+	pshufd	$0xaa,%xmm15,%xmm14
+	movdqa	%xmm13,144-256(%rcx)
+	pshufd	$0xff,%xmm15,%xmm15
+	movdqa	%xmm14,160-256(%rcx)
+	movdqa	%xmm15,176-256(%rcx)
+
+	pshufd	$0x00,%xmm7,%xmm4
+	pshufd	$0x55,%xmm7,%xmm5
+	movdqa	%xmm4,192-256(%rcx)
+	pshufd	$0xaa,%xmm7,%xmm6
+	movdqa	%xmm5,208-256(%rcx)
+	pshufd	$0xff,%xmm7,%xmm7
+	movdqa	%xmm6,224-256(%rcx)
+	movdqa	%xmm7,240-256(%rcx)
+
+	pshufd	$0x00,%xmm3,%xmm0
+	pshufd	$0x55,%xmm3,%xmm1
+	paddd	L$inc(%rip),%xmm0
+	pshufd	$0xaa,%xmm3,%xmm2
+	movdqa	%xmm1,272-256(%rcx)
+	pshufd	$0xff,%xmm3,%xmm3
+	movdqa	%xmm2,288-256(%rcx)
+	movdqa	%xmm3,304-256(%rcx)
+
+	jmp	L$oop_enter4x
+
+.p2align	5
+L$oop_outer4x:
+	movdqa	64(%rsp),%xmm8
+	movdqa	80(%rsp),%xmm9
+	movdqa	96(%rsp),%xmm10
+	movdqa	112(%rsp),%xmm11
+	movdqa	128-256(%rcx),%xmm12
+	movdqa	144-256(%rcx),%xmm13
+	movdqa	160-256(%rcx),%xmm14
+	movdqa	176-256(%rcx),%xmm15
+	movdqa	192-256(%rcx),%xmm4
+	movdqa	208-256(%rcx),%xmm5
+	movdqa	224-256(%rcx),%xmm6
+	movdqa	240-256(%rcx),%xmm7
+	movdqa	256-256(%rcx),%xmm0
+	movdqa	272-256(%rcx),%xmm1
+	movdqa	288-256(%rcx),%xmm2
+	movdqa	304-256(%rcx),%xmm3
+	paddd	L$four(%rip),%xmm0
+
+L$oop_enter4x:
+	movdqa	%xmm6,32(%rsp)
+	movdqa	%xmm7,48(%rsp)
+	movdqa	(%r10),%xmm7
+	movl	$10,%eax
+	movdqa	%xmm0,256-256(%rcx)
+	jmp	L$oop4x
+
+.p2align	5
+L$oop4x:
+	paddd	%xmm12,%xmm8
+	paddd	%xmm13,%xmm9
+	pxor	%xmm8,%xmm0
+	pxor	%xmm9,%xmm1
+	pshufb	%xmm7,%xmm0
+	pshufb	%xmm7,%xmm1
+	paddd	%xmm0,%xmm4
+	paddd	%xmm1,%xmm5
+	pxor	%xmm4,%xmm12
+	pxor	%xmm5,%xmm13
+	movdqa	%xmm12,%xmm6
+	pslld	$12,%xmm12
+	psrld	$20,%xmm6
+	movdqa	%xmm13,%xmm7
+	pslld	$12,%xmm13
+	por	%xmm6,%xmm12
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm13
+	paddd	%xmm12,%xmm8
+	paddd	%xmm13,%xmm9
+	pxor	%xmm8,%xmm0
+	pxor	%xmm9,%xmm1
+	pshufb	%xmm6,%xmm0
+	pshufb	%xmm6,%xmm1
+	paddd	%xmm0,%xmm4
+	paddd	%xmm1,%xmm5
+	pxor	%xmm4,%xmm12
+	pxor	%xmm5,%xmm13
+	movdqa	%xmm12,%xmm7
+	pslld	$7,%xmm12
+	psrld	$25,%xmm7
+	movdqa	%xmm13,%xmm6
+	pslld	$7,%xmm13
+	por	%xmm7,%xmm12
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm13
+	movdqa	%xmm4,0(%rsp)
+	movdqa	%xmm5,16(%rsp)
+	movdqa	32(%rsp),%xmm4
+	movdqa	48(%rsp),%xmm5
+	paddd	%xmm14,%xmm10
+	paddd	%xmm15,%xmm11
+	pxor	%xmm10,%xmm2
+	pxor	%xmm11,%xmm3
+	pshufb	%xmm7,%xmm2
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm2,%xmm4
+	paddd	%xmm3,%xmm5
+	pxor	%xmm4,%xmm14
+	pxor	%xmm5,%xmm15
+	movdqa	%xmm14,%xmm6
+	pslld	$12,%xmm14
+	psrld	$20,%xmm6
+	movdqa	%xmm15,%xmm7
+	pslld	$12,%xmm15
+	por	%xmm6,%xmm14
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm15
+	paddd	%xmm14,%xmm10
+	paddd	%xmm15,%xmm11
+	pxor	%xmm10,%xmm2
+	pxor	%xmm11,%xmm3
+	pshufb	%xmm6,%xmm2
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm2,%xmm4
+	paddd	%xmm3,%xmm5
+	pxor	%xmm4,%xmm14
+	pxor	%xmm5,%xmm15
+	movdqa	%xmm14,%xmm7
+	pslld	$7,%xmm14
+	psrld	$25,%xmm7
+	movdqa	%xmm15,%xmm6
+	pslld	$7,%xmm15
+	por	%xmm7,%xmm14
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm15
+	paddd	%xmm13,%xmm8
+	paddd	%xmm14,%xmm9
+	pxor	%xmm8,%xmm3
+	pxor	%xmm9,%xmm0
+	pshufb	%xmm7,%xmm3
+	pshufb	%xmm7,%xmm0
+	paddd	%xmm3,%xmm4
+	paddd	%xmm0,%xmm5
+	pxor	%xmm4,%xmm13
+	pxor	%xmm5,%xmm14
+	movdqa	%xmm13,%xmm6
+	pslld	$12,%xmm13
+	psrld	$20,%xmm6
+	movdqa	%xmm14,%xmm7
+	pslld	$12,%xmm14
+	por	%xmm6,%xmm13
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm14
+	paddd	%xmm13,%xmm8
+	paddd	%xmm14,%xmm9
+	pxor	%xmm8,%xmm3
+	pxor	%xmm9,%xmm0
+	pshufb	%xmm6,%xmm3
+	pshufb	%xmm6,%xmm0
+	paddd	%xmm3,%xmm4
+	paddd	%xmm0,%xmm5
+	pxor	%xmm4,%xmm13
+	pxor	%xmm5,%xmm14
+	movdqa	%xmm13,%xmm7
+	pslld	$7,%xmm13
+	psrld	$25,%xmm7
+	movdqa	%xmm14,%xmm6
+	pslld	$7,%xmm14
+	por	%xmm7,%xmm13
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm14
+	movdqa	%xmm4,32(%rsp)
+	movdqa	%xmm5,48(%rsp)
+	movdqa	0(%rsp),%xmm4
+	movdqa	16(%rsp),%xmm5
+	paddd	%xmm15,%xmm10
+	paddd	%xmm12,%xmm11
+	pxor	%xmm10,%xmm1
+	pxor	%xmm11,%xmm2
+	pshufb	%xmm7,%xmm1
+	pshufb	%xmm7,%xmm2
+	paddd	%xmm1,%xmm4
+	paddd	%xmm2,%xmm5
+	pxor	%xmm4,%xmm15
+	pxor	%xmm5,%xmm12
+	movdqa	%xmm15,%xmm6
+	pslld	$12,%xmm15
+	psrld	$20,%xmm6
+	movdqa	%xmm12,%xmm7
+	pslld	$12,%xmm12
+	por	%xmm6,%xmm15
+	psrld	$20,%xmm7
+	movdqa	(%r11),%xmm6
+	por	%xmm7,%xmm12
+	paddd	%xmm15,%xmm10
+	paddd	%xmm12,%xmm11
+	pxor	%xmm10,%xmm1
+	pxor	%xmm11,%xmm2
+	pshufb	%xmm6,%xmm1
+	pshufb	%xmm6,%xmm2
+	paddd	%xmm1,%xmm4
+	paddd	%xmm2,%xmm5
+	pxor	%xmm4,%xmm15
+	pxor	%xmm5,%xmm12
+	movdqa	%xmm15,%xmm7
+	pslld	$7,%xmm15
+	psrld	$25,%xmm7
+	movdqa	%xmm12,%xmm6
+	pslld	$7,%xmm12
+	por	%xmm7,%xmm15
+	psrld	$25,%xmm6
+	movdqa	(%r10),%xmm7
+	por	%xmm6,%xmm12
+	decl	%eax
+	jnz	L$oop4x
+
+	paddd	64(%rsp),%xmm8
+	paddd	80(%rsp),%xmm9
+	paddd	96(%rsp),%xmm10
+	paddd	112(%rsp),%xmm11
+
+	movdqa	%xmm8,%xmm6
+	punpckldq	%xmm9,%xmm8
+	movdqa	%xmm10,%xmm7
+	punpckldq	%xmm11,%xmm10
+	punpckhdq	%xmm9,%xmm6
+	punpckhdq	%xmm11,%xmm7
+	movdqa	%xmm8,%xmm9
+	punpcklqdq	%xmm10,%xmm8
+	movdqa	%xmm6,%xmm11
+	punpcklqdq	%xmm7,%xmm6
+	punpckhqdq	%xmm10,%xmm9
+	punpckhqdq	%xmm7,%xmm11
+	paddd	128-256(%rcx),%xmm12
+	paddd	144-256(%rcx),%xmm13
+	paddd	160-256(%rcx),%xmm14
+	paddd	176-256(%rcx),%xmm15
+
+	movdqa	%xmm8,0(%rsp)
+	movdqa	%xmm9,16(%rsp)
+	movdqa	32(%rsp),%xmm8
+	movdqa	48(%rsp),%xmm9
+
+	movdqa	%xmm12,%xmm10
+	punpckldq	%xmm13,%xmm12
+	movdqa	%xmm14,%xmm7
+	punpckldq	%xmm15,%xmm14
+	punpckhdq	%xmm13,%xmm10
+	punpckhdq	%xmm15,%xmm7
+	movdqa	%xmm12,%xmm13
+	punpcklqdq	%xmm14,%xmm12
+	movdqa	%xmm10,%xmm15
+	punpcklqdq	%xmm7,%xmm10
+	punpckhqdq	%xmm14,%xmm13
+	punpckhqdq	%xmm7,%xmm15
+	paddd	192-256(%rcx),%xmm4
+	paddd	208-256(%rcx),%xmm5
+	paddd	224-256(%rcx),%xmm8
+	paddd	240-256(%rcx),%xmm9
+
+	movdqa	%xmm6,32(%rsp)
+	movdqa	%xmm11,48(%rsp)
+
+	movdqa	%xmm4,%xmm14
+	punpckldq	%xmm5,%xmm4
+	movdqa	%xmm8,%xmm7
+	punpckldq	%xmm9,%xmm8
+	punpckhdq	%xmm5,%xmm14
+	punpckhdq	%xmm9,%xmm7
+	movdqa	%xmm4,%xmm5
+	punpcklqdq	%xmm8,%xmm4
+	movdqa	%xmm14,%xmm9
+	punpcklqdq	%xmm7,%xmm14
+	punpckhqdq	%xmm8,%xmm5
+	punpckhqdq	%xmm7,%xmm9
+	paddd	256-256(%rcx),%xmm0
+	paddd	272-256(%rcx),%xmm1
+	paddd	288-256(%rcx),%xmm2
+	paddd	304-256(%rcx),%xmm3
+
+	movdqa	%xmm0,%xmm8
+	punpckldq	%xmm1,%xmm0
+	movdqa	%xmm2,%xmm7
+	punpckldq	%xmm3,%xmm2
+	punpckhdq	%xmm1,%xmm8
+	punpckhdq	%xmm3,%xmm7
+	movdqa	%xmm0,%xmm1
+	punpcklqdq	%xmm2,%xmm0
+	movdqa	%xmm8,%xmm3
+	punpcklqdq	%xmm7,%xmm8
+	punpckhqdq	%xmm2,%xmm1
+	punpckhqdq	%xmm7,%xmm3
+	cmpq	$256,%rdx
+	jb	L$tail4x
+
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+
+	movdqu	%xmm6,64(%rdi)
+	movdqu	0(%rsi),%xmm6
+	movdqu	%xmm11,80(%rdi)
+	movdqu	16(%rsi),%xmm11
+	movdqu	%xmm2,96(%rdi)
+	movdqu	32(%rsi),%xmm2
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+	movdqu	48(%rsi),%xmm7
+	pxor	32(%rsp),%xmm6
+	pxor	%xmm10,%xmm11
+	pxor	%xmm14,%xmm2
+	pxor	%xmm8,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	48(%rsp),%xmm6
+	pxor	%xmm15,%xmm11
+	pxor	%xmm9,%xmm2
+	pxor	%xmm3,%xmm7
+	movdqu	%xmm6,64(%rdi)
+	movdqu	%xmm11,80(%rdi)
+	movdqu	%xmm2,96(%rdi)
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+
+	subq	$256,%rdx
+	jnz	L$oop_outer4x
+
+	jmp	L$done4x
+
+L$tail4x:
+	cmpq	$192,%rdx
+	jae	L$192_or_more4x
+	cmpq	$128,%rdx
+	jae	L$128_or_more4x
+	cmpq	$64,%rdx
+	jae	L$64_or_more4x
+
+
+	xorq	%r10,%r10
+
+	movdqa	%xmm12,16(%rsp)
+	movdqa	%xmm4,32(%rsp)
+	movdqa	%xmm0,48(%rsp)
+	jmp	L$oop_tail4x
+
+.p2align	5
+L$64_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+	movdqu	%xmm6,0(%rdi)
+	movdqu	%xmm11,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm7,48(%rdi)
+	je	L$done4x
+
+	movdqa	16(%rsp),%xmm6
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm13,16(%rsp)
+	leaq	64(%rdi),%rdi
+	movdqa	%xmm5,32(%rsp)
+	subq	$64,%rdx
+	movdqa	%xmm1,48(%rsp)
+	jmp	L$oop_tail4x
+
+.p2align	5
+L$128_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+	movdqu	%xmm6,64(%rdi)
+	movdqu	%xmm11,80(%rdi)
+	movdqu	%xmm2,96(%rdi)
+	movdqu	%xmm7,112(%rdi)
+	je	L$done4x
+
+	movdqa	32(%rsp),%xmm6
+	leaq	128(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm10,16(%rsp)
+	leaq	128(%rdi),%rdi
+	movdqa	%xmm14,32(%rsp)
+	subq	$128,%rdx
+	movdqa	%xmm8,48(%rsp)
+	jmp	L$oop_tail4x
+
+.p2align	5
+L$192_or_more4x:
+	movdqu	0(%rsi),%xmm6
+	movdqu	16(%rsi),%xmm11
+	movdqu	32(%rsi),%xmm2
+	movdqu	48(%rsi),%xmm7
+	pxor	0(%rsp),%xmm6
+	pxor	%xmm12,%xmm11
+	pxor	%xmm4,%xmm2
+	pxor	%xmm0,%xmm7
+
+	movdqu	%xmm6,0(%rdi)
+	movdqu	64(%rsi),%xmm6
+	movdqu	%xmm11,16(%rdi)
+	movdqu	80(%rsi),%xmm11
+	movdqu	%xmm2,32(%rdi)
+	movdqu	96(%rsi),%xmm2
+	movdqu	%xmm7,48(%rdi)
+	movdqu	112(%rsi),%xmm7
+	leaq	128(%rsi),%rsi
+	pxor	16(%rsp),%xmm6
+	pxor	%xmm13,%xmm11
+	pxor	%xmm5,%xmm2
+	pxor	%xmm1,%xmm7
+
+	movdqu	%xmm6,64(%rdi)
+	movdqu	0(%rsi),%xmm6
+	movdqu	%xmm11,80(%rdi)
+	movdqu	16(%rsi),%xmm11
+	movdqu	%xmm2,96(%rdi)
+	movdqu	32(%rsi),%xmm2
+	movdqu	%xmm7,112(%rdi)
+	leaq	128(%rdi),%rdi
+	movdqu	48(%rsi),%xmm7
+	pxor	32(%rsp),%xmm6
+	pxor	%xmm10,%xmm11
+	pxor	%xmm14,%xmm2
+	pxor	%xmm8,%xmm7
+	movdqu	%xmm6,0(%rdi)
+	movdqu	%xmm11,16(%rdi)
+	movdqu	%xmm2,32(%rdi)
+	movdqu	%xmm7,48(%rdi)
+	je	L$done4x
+
+	movdqa	48(%rsp),%xmm6
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	movdqa	%xmm6,0(%rsp)
+	movdqa	%xmm15,16(%rsp)
+	leaq	64(%rdi),%rdi
+	movdqa	%xmm9,32(%rsp)
+	subq	$192,%rdx
+	movdqa	%xmm3,48(%rsp)
+
+L$oop_tail4x:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	L$oop_tail4x
+
+L$done4x:
+	leaq	(%r9),%rsp
+
+L$4x_epilogue:
+	ret
+
+
+.global	_chacha20_avx2
+
+.p2align	5
+_chacha20_avx2:
+
+L$chacha20_avx2:
+	movq	%rsp,%r9
+
+	subq	$0x280+8,%rsp
+	andq	$-32,%rsp
+	vzeroupper
+
+	vbroadcasti128	L$sigma(%rip),%ymm11
+	vbroadcasti128	(%rcx),%ymm3
+	vbroadcasti128	16(%rcx),%ymm15
+	vbroadcasti128	(%r8),%ymm7
+	leaq	256(%rsp),%rcx
+	leaq	512(%rsp),%rax
+	leaq	L$rot16(%rip),%r10
+	leaq	L$rot24(%rip),%r11
+
+	vpshufd	$0x00,%ymm11,%ymm8
+	vpshufd	$0x55,%ymm11,%ymm9
+	vmovdqa	%ymm8,128-256(%rcx)
+	vpshufd	$0xaa,%ymm11,%ymm10
+	vmovdqa	%ymm9,160-256(%rcx)
+	vpshufd	$0xff,%ymm11,%ymm11
+	vmovdqa	%ymm10,192-256(%rcx)
+	vmovdqa	%ymm11,224-256(%rcx)
+
+	vpshufd	$0x00,%ymm3,%ymm0
+	vpshufd	$0x55,%ymm3,%ymm1
+	vmovdqa	%ymm0,256-256(%rcx)
+	vpshufd	$0xaa,%ymm3,%ymm2
+	vmovdqa	%ymm1,288-256(%rcx)
+	vpshufd	$0xff,%ymm3,%ymm3
+	vmovdqa	%ymm2,320-256(%rcx)
+	vmovdqa	%ymm3,352-256(%rcx)
+
+	vpshufd	$0x00,%ymm15,%ymm12
+	vpshufd	$0x55,%ymm15,%ymm13
+	vmovdqa	%ymm12,384-512(%rax)
+	vpshufd	$0xaa,%ymm15,%ymm14
+	vmovdqa	%ymm13,416-512(%rax)
+	vpshufd	$0xff,%ymm15,%ymm15
+	vmovdqa	%ymm14,448-512(%rax)
+	vmovdqa	%ymm15,480-512(%rax)
+
+	vpshufd	$0x00,%ymm7,%ymm4
+	vpshufd	$0x55,%ymm7,%ymm5
+	vpaddd	L$incy(%rip),%ymm4,%ymm4
+	vpshufd	$0xaa,%ymm7,%ymm6
+	vmovdqa	%ymm5,544-512(%rax)
+	vpshufd	$0xff,%ymm7,%ymm7
+	vmovdqa	%ymm6,576-512(%rax)
+	vmovdqa	%ymm7,608-512(%rax)
+
+	jmp	L$oop_enter8x
+
+.p2align	5
+L$oop_outer8x:
+	vmovdqa	128-256(%rcx),%ymm8
+	vmovdqa	160-256(%rcx),%ymm9
+	vmovdqa	192-256(%rcx),%ymm10
+	vmovdqa	224-256(%rcx),%ymm11
+	vmovdqa	256-256(%rcx),%ymm0
+	vmovdqa	288-256(%rcx),%ymm1
+	vmovdqa	320-256(%rcx),%ymm2
+	vmovdqa	352-256(%rcx),%ymm3
+	vmovdqa	384-512(%rax),%ymm12
+	vmovdqa	416-512(%rax),%ymm13
+	vmovdqa	448-512(%rax),%ymm14
+	vmovdqa	480-512(%rax),%ymm15
+	vmovdqa	512-512(%rax),%ymm4
+	vmovdqa	544-512(%rax),%ymm5
+	vmovdqa	576-512(%rax),%ymm6
+	vmovdqa	608-512(%rax),%ymm7
+	vpaddd	L$eight(%rip),%ymm4,%ymm4
+
+L$oop_enter8x:
+	vmovdqa	%ymm14,64(%rsp)
+	vmovdqa	%ymm15,96(%rsp)
+	vbroadcasti128	(%r10),%ymm15
+	vmovdqa	%ymm4,512-512(%rax)
+	movl	$10,%eax
+	jmp	L$oop8x
+
+.p2align	5
+L$oop8x:
+	vpaddd	%ymm0,%ymm8,%ymm8
+	vpxor	%ymm4,%ymm8,%ymm4
+	vpshufb	%ymm15,%ymm4,%ymm4
+	vpaddd	%ymm1,%ymm9,%ymm9
+	vpxor	%ymm5,%ymm9,%ymm5
+	vpshufb	%ymm15,%ymm5,%ymm5
+	vpaddd	%ymm4,%ymm12,%ymm12
+	vpxor	%ymm0,%ymm12,%ymm0
+	vpslld	$12,%ymm0,%ymm14
+	vpsrld	$20,%ymm0,%ymm0
+	vpor	%ymm0,%ymm14,%ymm0
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm5,%ymm13,%ymm13
+	vpxor	%ymm1,%ymm13,%ymm1
+	vpslld	$12,%ymm1,%ymm15
+	vpsrld	$20,%ymm1,%ymm1
+	vpor	%ymm1,%ymm15,%ymm1
+	vpaddd	%ymm0,%ymm8,%ymm8
+	vpxor	%ymm4,%ymm8,%ymm4
+	vpshufb	%ymm14,%ymm4,%ymm4
+	vpaddd	%ymm1,%ymm9,%ymm9
+	vpxor	%ymm5,%ymm9,%ymm5
+	vpshufb	%ymm14,%ymm5,%ymm5
+	vpaddd	%ymm4,%ymm12,%ymm12
+	vpxor	%ymm0,%ymm12,%ymm0
+	vpslld	$7,%ymm0,%ymm15
+	vpsrld	$25,%ymm0,%ymm0
+	vpor	%ymm0,%ymm15,%ymm0
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm5,%ymm13,%ymm13
+	vpxor	%ymm1,%ymm13,%ymm1
+	vpslld	$7,%ymm1,%ymm14
+	vpsrld	$25,%ymm1,%ymm1
+	vpor	%ymm1,%ymm14,%ymm1
+	vmovdqa	%ymm12,0(%rsp)
+	vmovdqa	%ymm13,32(%rsp)
+	vmovdqa	64(%rsp),%ymm12
+	vmovdqa	96(%rsp),%ymm13
+	vpaddd	%ymm2,%ymm10,%ymm10
+	vpxor	%ymm6,%ymm10,%ymm6
+	vpshufb	%ymm15,%ymm6,%ymm6
+	vpaddd	%ymm3,%ymm11,%ymm11
+	vpxor	%ymm7,%ymm11,%ymm7
+	vpshufb	%ymm15,%ymm7,%ymm7
+	vpaddd	%ymm6,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm12,%ymm2
+	vpslld	$12,%ymm2,%ymm14
+	vpsrld	$20,%ymm2,%ymm2
+	vpor	%ymm2,%ymm14,%ymm2
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm7,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm13,%ymm3
+	vpslld	$12,%ymm3,%ymm15
+	vpsrld	$20,%ymm3,%ymm3
+	vpor	%ymm3,%ymm15,%ymm3
+	vpaddd	%ymm2,%ymm10,%ymm10
+	vpxor	%ymm6,%ymm10,%ymm6
+	vpshufb	%ymm14,%ymm6,%ymm6
+	vpaddd	%ymm3,%ymm11,%ymm11
+	vpxor	%ymm7,%ymm11,%ymm7
+	vpshufb	%ymm14,%ymm7,%ymm7
+	vpaddd	%ymm6,%ymm12,%ymm12
+	vpxor	%ymm2,%ymm12,%ymm2
+	vpslld	$7,%ymm2,%ymm15
+	vpsrld	$25,%ymm2,%ymm2
+	vpor	%ymm2,%ymm15,%ymm2
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm7,%ymm13,%ymm13
+	vpxor	%ymm3,%ymm13,%ymm3
+	vpslld	$7,%ymm3,%ymm14
+	vpsrld	$25,%ymm3,%ymm3
+	vpor	%ymm3,%ymm14,%ymm3
+	vpaddd	%ymm1,%ymm8,%ymm8
+	vpxor	%ymm7,%ymm8,%ymm7
+	vpshufb	%ymm15,%ymm7,%ymm7
+	vpaddd	%ymm2,%ymm9,%ymm9
+	vpxor	%ymm4,%ymm9,%ymm4
+	vpshufb	%ymm15,%ymm4,%ymm4
+	vpaddd	%ymm7,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm12,%ymm1
+	vpslld	$12,%ymm1,%ymm14
+	vpsrld	$20,%ymm1,%ymm1
+	vpor	%ymm1,%ymm14,%ymm1
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm4,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm13,%ymm2
+	vpslld	$12,%ymm2,%ymm15
+	vpsrld	$20,%ymm2,%ymm2
+	vpor	%ymm2,%ymm15,%ymm2
+	vpaddd	%ymm1,%ymm8,%ymm8
+	vpxor	%ymm7,%ymm8,%ymm7
+	vpshufb	%ymm14,%ymm7,%ymm7
+	vpaddd	%ymm2,%ymm9,%ymm9
+	vpxor	%ymm4,%ymm9,%ymm4
+	vpshufb	%ymm14,%ymm4,%ymm4
+	vpaddd	%ymm7,%ymm12,%ymm12
+	vpxor	%ymm1,%ymm12,%ymm1
+	vpslld	$7,%ymm1,%ymm15
+	vpsrld	$25,%ymm1,%ymm1
+	vpor	%ymm1,%ymm15,%ymm1
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm4,%ymm13,%ymm13
+	vpxor	%ymm2,%ymm13,%ymm2
+	vpslld	$7,%ymm2,%ymm14
+	vpsrld	$25,%ymm2,%ymm2
+	vpor	%ymm2,%ymm14,%ymm2
+	vmovdqa	%ymm12,64(%rsp)
+	vmovdqa	%ymm13,96(%rsp)
+	vmovdqa	0(%rsp),%ymm12
+	vmovdqa	32(%rsp),%ymm13
+	vpaddd	%ymm3,%ymm10,%ymm10
+	vpxor	%ymm5,%ymm10,%ymm5
+	vpshufb	%ymm15,%ymm5,%ymm5
+	vpaddd	%ymm0,%ymm11,%ymm11
+	vpxor	%ymm6,%ymm11,%ymm6
+	vpshufb	%ymm15,%ymm6,%ymm6
+	vpaddd	%ymm5,%ymm12,%ymm12
+	vpxor	%ymm3,%ymm12,%ymm3
+	vpslld	$12,%ymm3,%ymm14
+	vpsrld	$20,%ymm3,%ymm3
+	vpor	%ymm3,%ymm14,%ymm3
+	vbroadcasti128	(%r11),%ymm14
+	vpaddd	%ymm6,%ymm13,%ymm13
+	vpxor	%ymm0,%ymm13,%ymm0
+	vpslld	$12,%ymm0,%ymm15
+	vpsrld	$20,%ymm0,%ymm0
+	vpor	%ymm0,%ymm15,%ymm0
+	vpaddd	%ymm3,%ymm10,%ymm10
+	vpxor	%ymm5,%ymm10,%ymm5
+	vpshufb	%ymm14,%ymm5,%ymm5
+	vpaddd	%ymm0,%ymm11,%ymm11
+	vpxor	%ymm6,%ymm11,%ymm6
+	vpshufb	%ymm14,%ymm6,%ymm6
+	vpaddd	%ymm5,%ymm12,%ymm12
+	vpxor	%ymm3,%ymm12,%ymm3
+	vpslld	$7,%ymm3,%ymm15
+	vpsrld	$25,%ymm3,%ymm3
+	vpor	%ymm3,%ymm15,%ymm3
+	vbroadcasti128	(%r10),%ymm15
+	vpaddd	%ymm6,%ymm13,%ymm13
+	vpxor	%ymm0,%ymm13,%ymm0
+	vpslld	$7,%ymm0,%ymm14
+	vpsrld	$25,%ymm0,%ymm0
+	vpor	%ymm0,%ymm14,%ymm0
+	decl	%eax
+	jnz	L$oop8x
+
+	leaq	512(%rsp),%rax
+	vpaddd	128-256(%rcx),%ymm8,%ymm8
+	vpaddd	160-256(%rcx),%ymm9,%ymm9
+	vpaddd	192-256(%rcx),%ymm10,%ymm10
+	vpaddd	224-256(%rcx),%ymm11,%ymm11
+
+	vpunpckldq	%ymm9,%ymm8,%ymm14
+	vpunpckldq	%ymm11,%ymm10,%ymm15
+	vpunpckhdq	%ymm9,%ymm8,%ymm8
+	vpunpckhdq	%ymm11,%ymm10,%ymm10
+	vpunpcklqdq	%ymm15,%ymm14,%ymm9
+	vpunpckhqdq	%ymm15,%ymm14,%ymm14
+	vpunpcklqdq	%ymm10,%ymm8,%ymm11
+	vpunpckhqdq	%ymm10,%ymm8,%ymm8
+	vpaddd	256-256(%rcx),%ymm0,%ymm0
+	vpaddd	288-256(%rcx),%ymm1,%ymm1
+	vpaddd	320-256(%rcx),%ymm2,%ymm2
+	vpaddd	352-256(%rcx),%ymm3,%ymm3
+
+	vpunpckldq	%ymm1,%ymm0,%ymm10
+	vpunpckldq	%ymm3,%ymm2,%ymm15
+	vpunpckhdq	%ymm1,%ymm0,%ymm0
+	vpunpckhdq	%ymm3,%ymm2,%ymm2
+	vpunpcklqdq	%ymm15,%ymm10,%ymm1
+	vpunpckhqdq	%ymm15,%ymm10,%ymm10
+	vpunpcklqdq	%ymm2,%ymm0,%ymm3
+	vpunpckhqdq	%ymm2,%ymm0,%ymm0
+	vperm2i128	$0x20,%ymm1,%ymm9,%ymm15
+	vperm2i128	$0x31,%ymm1,%ymm9,%ymm1
+	vperm2i128	$0x20,%ymm10,%ymm14,%ymm9
+	vperm2i128	$0x31,%ymm10,%ymm14,%ymm10
+	vperm2i128	$0x20,%ymm3,%ymm11,%ymm14
+	vperm2i128	$0x31,%ymm3,%ymm11,%ymm3
+	vperm2i128	$0x20,%ymm0,%ymm8,%ymm11
+	vperm2i128	$0x31,%ymm0,%ymm8,%ymm0
+	vmovdqa	%ymm15,0(%rsp)
+	vmovdqa	%ymm9,32(%rsp)
+	vmovdqa	64(%rsp),%ymm15
+	vmovdqa	96(%rsp),%ymm9
+
+	vpaddd	384-512(%rax),%ymm12,%ymm12
+	vpaddd	416-512(%rax),%ymm13,%ymm13
+	vpaddd	448-512(%rax),%ymm15,%ymm15
+	vpaddd	480-512(%rax),%ymm9,%ymm9
+
+	vpunpckldq	%ymm13,%ymm12,%ymm2
+	vpunpckldq	%ymm9,%ymm15,%ymm8
+	vpunpckhdq	%ymm13,%ymm12,%ymm12
+	vpunpckhdq	%ymm9,%ymm15,%ymm15
+	vpunpcklqdq	%ymm8,%ymm2,%ymm13
+	vpunpckhqdq	%ymm8,%ymm2,%ymm2
+	vpunpcklqdq	%ymm15,%ymm12,%ymm9
+	vpunpckhqdq	%ymm15,%ymm12,%ymm12
+	vpaddd	512-512(%rax),%ymm4,%ymm4
+	vpaddd	544-512(%rax),%ymm5,%ymm5
+	vpaddd	576-512(%rax),%ymm6,%ymm6
+	vpaddd	608-512(%rax),%ymm7,%ymm7
+
+	vpunpckldq	%ymm5,%ymm4,%ymm15
+	vpunpckldq	%ymm7,%ymm6,%ymm8
+	vpunpckhdq	%ymm5,%ymm4,%ymm4
+	vpunpckhdq	%ymm7,%ymm6,%ymm6
+	vpunpcklqdq	%ymm8,%ymm15,%ymm5
+	vpunpckhqdq	%ymm8,%ymm15,%ymm15
+	vpunpcklqdq	%ymm6,%ymm4,%ymm7
+	vpunpckhqdq	%ymm6,%ymm4,%ymm4
+	vperm2i128	$0x20,%ymm5,%ymm13,%ymm8
+	vperm2i128	$0x31,%ymm5,%ymm13,%ymm5
+	vperm2i128	$0x20,%ymm15,%ymm2,%ymm13
+	vperm2i128	$0x31,%ymm15,%ymm2,%ymm15
+	vperm2i128	$0x20,%ymm7,%ymm9,%ymm2
+	vperm2i128	$0x31,%ymm7,%ymm9,%ymm7
+	vperm2i128	$0x20,%ymm4,%ymm12,%ymm9
+	vperm2i128	$0x31,%ymm4,%ymm12,%ymm4
+	vmovdqa	0(%rsp),%ymm6
+	vmovdqa	32(%rsp),%ymm12
+
+	cmpq	$512,%rdx
+	jb	L$tail8x
+
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm12,%ymm12
+	vpxor	32(%rsi),%ymm13,%ymm13
+	vpxor	64(%rsi),%ymm10,%ymm10
+	vpxor	96(%rsi),%ymm15,%ymm15
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm12,0(%rdi)
+	vmovdqu	%ymm13,32(%rdi)
+	vmovdqu	%ymm10,64(%rdi)
+	vmovdqu	%ymm15,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm14,%ymm14
+	vpxor	32(%rsi),%ymm2,%ymm2
+	vpxor	64(%rsi),%ymm3,%ymm3
+	vpxor	96(%rsi),%ymm7,%ymm7
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm14,0(%rdi)
+	vmovdqu	%ymm2,32(%rdi)
+	vmovdqu	%ymm3,64(%rdi)
+	vmovdqu	%ymm7,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	vpxor	0(%rsi),%ymm11,%ymm11
+	vpxor	32(%rsi),%ymm9,%ymm9
+	vpxor	64(%rsi),%ymm0,%ymm0
+	vpxor	96(%rsi),%ymm4,%ymm4
+	leaq	128(%rsi),%rsi
+	vmovdqu	%ymm11,0(%rdi)
+	vmovdqu	%ymm9,32(%rdi)
+	vmovdqu	%ymm0,64(%rdi)
+	vmovdqu	%ymm4,96(%rdi)
+	leaq	128(%rdi),%rdi
+
+	subq	$512,%rdx
+	jnz	L$oop_outer8x
+
+	jmp	L$done8x
+
+L$tail8x:
+	cmpq	$448,%rdx
+	jae	L$448_or_more8x
+	cmpq	$384,%rdx
+	jae	L$384_or_more8x
+	cmpq	$320,%rdx
+	jae	L$320_or_more8x
+	cmpq	$256,%rdx
+	jae	L$256_or_more8x
+	cmpq	$192,%rdx
+	jae	L$192_or_more8x
+	cmpq	$128,%rdx
+	jae	L$128_or_more8x
+	cmpq	$64,%rdx
+	jae	L$64_or_more8x
+
+	xorq	%r10,%r10
+	vmovdqa	%ymm6,0(%rsp)
+	vmovdqa	%ymm8,32(%rsp)
+	jmp	L$oop_tail8x
+
+.p2align	5
+L$64_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	je	L$done8x
+
+	leaq	64(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm1,0(%rsp)
+	leaq	64(%rdi),%rdi
+	subq	$64,%rdx
+	vmovdqa	%ymm5,32(%rsp)
+	jmp	L$oop_tail8x
+
+.p2align	5
+L$128_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	je	L$done8x
+
+	leaq	128(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm12,0(%rsp)
+	leaq	128(%rdi),%rdi
+	subq	$128,%rdx
+	vmovdqa	%ymm13,32(%rsp)
+	jmp	L$oop_tail8x
+
+.p2align	5
+L$192_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	je	L$done8x
+
+	leaq	192(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm10,0(%rsp)
+	leaq	192(%rdi),%rdi
+	subq	$192,%rdx
+	vmovdqa	%ymm15,32(%rsp)
+	jmp	L$oop_tail8x
+
+.p2align	5
+L$256_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	je	L$done8x
+
+	leaq	256(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm14,0(%rsp)
+	leaq	256(%rdi),%rdi
+	subq	$256,%rdx
+	vmovdqa	%ymm2,32(%rsp)
+	jmp	L$oop_tail8x
+
+.p2align	5
+L$320_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	je	L$done8x
+
+	leaq	320(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm3,0(%rsp)
+	leaq	320(%rdi),%rdi
+	subq	$320,%rdx
+	vmovdqa	%ymm7,32(%rsp)
+	jmp	L$oop_tail8x
+
+.p2align	5
+L$384_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vpxor	320(%rsi),%ymm3,%ymm3
+	vpxor	352(%rsi),%ymm7,%ymm7
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	vmovdqu	%ymm3,320(%rdi)
+	vmovdqu	%ymm7,352(%rdi)
+	je	L$done8x
+
+	leaq	384(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm11,0(%rsp)
+	leaq	384(%rdi),%rdi
+	subq	$384,%rdx
+	vmovdqa	%ymm9,32(%rsp)
+	jmp	L$oop_tail8x
+
+.p2align	5
+L$448_or_more8x:
+	vpxor	0(%rsi),%ymm6,%ymm6
+	vpxor	32(%rsi),%ymm8,%ymm8
+	vpxor	64(%rsi),%ymm1,%ymm1
+	vpxor	96(%rsi),%ymm5,%ymm5
+	vpxor	128(%rsi),%ymm12,%ymm12
+	vpxor	160(%rsi),%ymm13,%ymm13
+	vpxor	192(%rsi),%ymm10,%ymm10
+	vpxor	224(%rsi),%ymm15,%ymm15
+	vpxor	256(%rsi),%ymm14,%ymm14
+	vpxor	288(%rsi),%ymm2,%ymm2
+	vpxor	320(%rsi),%ymm3,%ymm3
+	vpxor	352(%rsi),%ymm7,%ymm7
+	vpxor	384(%rsi),%ymm11,%ymm11
+	vpxor	416(%rsi),%ymm9,%ymm9
+	vmovdqu	%ymm6,0(%rdi)
+	vmovdqu	%ymm8,32(%rdi)
+	vmovdqu	%ymm1,64(%rdi)
+	vmovdqu	%ymm5,96(%rdi)
+	vmovdqu	%ymm12,128(%rdi)
+	vmovdqu	%ymm13,160(%rdi)
+	vmovdqu	%ymm10,192(%rdi)
+	vmovdqu	%ymm15,224(%rdi)
+	vmovdqu	%ymm14,256(%rdi)
+	vmovdqu	%ymm2,288(%rdi)
+	vmovdqu	%ymm3,320(%rdi)
+	vmovdqu	%ymm7,352(%rdi)
+	vmovdqu	%ymm11,384(%rdi)
+	vmovdqu	%ymm9,416(%rdi)
+	je	L$done8x
+
+	leaq	448(%rsi),%rsi
+	xorq	%r10,%r10
+	vmovdqa	%ymm0,0(%rsp)
+	leaq	448(%rdi),%rdi
+	subq	$448,%rdx
+	vmovdqa	%ymm4,32(%rsp)
+
+L$oop_tail8x:
+	movzbl	(%rsi,%r10,1),%eax
+	movzbl	(%rsp,%r10,1),%ecx
+	leaq	1(%r10),%r10
+	xorl	%ecx,%eax
+	movb	%al,-1(%rdi,%r10,1)
+	decq	%rdx
+	jnz	L$oop_tail8x
+
+L$done8x:
+	vzeroall
+	leaq	(%r9),%rsp
+
+L$8x_epilogue:
+	ret
+
+
diff --git a/crypto/chacha20poly1305.cpp b/crypto/chacha20poly1305.cpp
new file mode 100644
index 0000000..a5c222d
--- /dev/null
+++ b/crypto/chacha20poly1305.cpp
@@ -0,0 +1,596 @@
+/* SPDX-License-Identifier: OpenSSL OR (BSD-3-Clause OR GPL-2.0)
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright 2016 The OpenSSL Project Authors. All Rights Reserved.
+ */
+
+#include "stdafx.h"
+#include "crypto/chacha20poly1305.h"
+#include "tunsafe_types.h"
+#include "tunsafe_endian.h"
+#include "build_config.h"
+#include "tunsafe_cpu.h"
+#include "crypto_ops.h"
+#include <string.h>
+#include <assert.h>
+
+enum {
+	CHACHA20_IV_SIZE = 16,
+	CHACHA20_KEY_SIZE = 32,
+	CHACHA20_BLOCK_SIZE = 64,
+	POLY1305_BLOCK_SIZE = 16,
+	POLY1305_KEY_SIZE = 32,
+	POLY1305_MAC_SIZE = 16
+};
+
+
+#if defined(OS_MACOSX) || !WITH_AVX512_OPTIMIZATIONS
+#define CHACHA20_WITH_AVX512 0
+#else
+#define CHACHA20_WITH_AVX512 1
+#endif
+
+extern "C" {
+void _cdecl hchacha20_ssse3(uint8 *derived_key, const uint8 *nonce, const uint8 *key);
+void _cdecl chacha20_ssse3(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]);
+void _cdecl chacha20_avx2(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]);
+void _cdecl chacha20_avx512(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]);
+void _cdecl chacha20_avx512vl(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]);
+void _cdecl poly1305_init_x86_64(void *ctx, const uint8 key[16]);
+void _cdecl poly1305_blocks_x86_64(void *ctx, const uint8 *inp, size_t len, uint32 padbit);
+void _cdecl poly1305_emit_x86_64(void *ctx, uint8 mac[16], const uint32 nonce[4]);
+void _cdecl poly1305_emit_avx(void *ctx, uint8 mac[16], const uint32 nonce[4]);
+void _cdecl poly1305_blocks_avx(void *ctx, const uint8 *inp, size_t len, uint32 padbit);
+void _cdecl poly1305_blocks_avx2(void *ctx, const uint8 *inp, size_t len, uint32 padbit);
+void _cdecl poly1305_blocks_avx512(void *ctx, const uint8 *inp, size_t len, uint32 padbit);
+}
+
+struct chacha20_ctx {
+	uint32 state[CHACHA20_BLOCK_SIZE / sizeof(uint32)];
+};
+
+void crypto_xor(uint8 *dst, const uint8 *src, size_t n) {
+  for (; n >= 4; n -= 4, dst += 4, src += 4)
+    *(uint32*)dst ^= *(uint32*)src;
+  for (; n; n--)
+    *dst++ ^= *src++;
+}
+
+int memcmp_crypto(const uint8 *a, const uint8 *b, size_t n) {
+  int rv = 0;
+  for (; n >= 4; n -= 4, a += 4, b += 4)
+    rv |= *(uint32*)a ^ *(uint32*)b;
+  for (; n; n--)
+    rv |= *a++ ^ *b++;
+  return rv;
+}
+
+#define QUARTER_ROUND(x, a, b, c, d) ( \
+	x[a] += x[b], \
+	x[d] = rol32((x[d] ^ x[a]), 16), \
+	x[c] += x[d], \
+	x[b] = rol32((x[b] ^ x[c]), 12), \
+	x[a] += x[b], \
+	x[d] = rol32((x[d] ^ x[a]), 8), \
+	x[c] += x[d], \
+	x[b] = rol32((x[b] ^ x[c]), 7) \
+)
+
+#define C(i, j) (i * 4 + j)
+
+#define DOUBLE_ROUND(x) ( \
+	/* Column Round */ \
+	QUARTER_ROUND(x, C(0, 0), C(1, 0), C(2, 0), C(3, 0)), \
+	QUARTER_ROUND(x, C(0, 1), C(1, 1), C(2, 1), C(3, 1)), \
+	QUARTER_ROUND(x, C(0, 2), C(1, 2), C(2, 2), C(3, 2)), \
+	QUARTER_ROUND(x, C(0, 3), C(1, 3), C(2, 3), C(3, 3)), \
+	/* Diagonal Round */ \
+	QUARTER_ROUND(x, C(0, 0), C(1, 1), C(2, 2), C(3, 3)), \
+	QUARTER_ROUND(x, C(0, 1), C(1, 2), C(2, 3), C(3, 0)), \
+	QUARTER_ROUND(x, C(0, 2), C(1, 3), C(2, 0), C(3, 1)), \
+	QUARTER_ROUND(x, C(0, 3), C(1, 0), C(2, 1), C(3, 2)) \
+)
+
+#define TWENTY_ROUNDS(x) ( \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x) \
+)
+
+SAFEBUFFERS static void chacha20_block_generic(struct chacha20_ctx *ctx, uint32 *stream)
+{
+	uint32 x[CHACHA20_BLOCK_SIZE / sizeof(uint32)];
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(x); ++i)
+		x[i] = ctx->state[i];
+
+	TWENTY_ROUNDS(x);
+
+	for (i = 0; i < ARRAY_SIZE(x); ++i)
+		stream[i] = ToLE32(x[i] + ctx->state[i]);
+
+	++ctx->state[12];
+}
+
+SAFEBUFFERS static void hchacha20_generic(uint8 derived_key[CHACHA20POLY1305_KEYLEN], const uint8 nonce[16], const uint8 key[CHACHA20POLY1305_KEYLEN])
+{
+	uint32 *out = (uint32 *)derived_key;
+	uint32 x[] = {
+		0x61707865, 0x3320646e, 0x79622d32, 0x6b206574,
+    ReadLE32(key + 0), ReadLE32(key + 4), ReadLE32(key + 8), ReadLE32(key + 12),
+    ReadLE32(key + 16), ReadLE32(key + 20), ReadLE32(key + 24), ReadLE32(key + 28),
+    ReadLE32(nonce +  0), ReadLE32(nonce +  4), ReadLE32(nonce +  8), ReadLE32(nonce + 12)
+	};
+
+	TWENTY_ROUNDS(x);
+
+	out[0] = ToLE32(x[0]);
+	out[1] = ToLE32(x[1]);
+	out[2] = ToLE32(x[2]);
+	out[3] = ToLE32(x[3]);
+	out[4] = ToLE32(x[12]);
+	out[5] = ToLE32(x[13]);
+	out[6] = ToLE32(x[14]);
+	out[7] = ToLE32(x[15]);
+}
+
+static inline void hchacha20(uint8 derived_key[CHACHA20POLY1305_KEYLEN], const uint8 nonce[16], const uint8 key[CHACHA20POLY1305_KEYLEN])
+{
+#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC)
+	if (X86_PCAP_SSSE3) {
+		hchacha20_ssse3(derived_key, nonce, key);
+		return;
+	}
+#endif  // defined(ARCH_CPU_X86_64)
+	hchacha20_generic(derived_key, nonce, key);
+}
+
+#define chacha20_initial_state(key, nonce) {{ \
+	0x61707865, 0x3320646e, 0x79622d32, 0x6b206574, \
+	ReadLE32((key) + 0), ReadLE32((key) + 4), ReadLE32((key) + 8), ReadLE32((key) + 12), \
+	ReadLE32((key) + 16), ReadLE32((key) + 20), ReadLE32((key) + 24), ReadLE32((key) + 28), \
+	0, 0, ReadLE32((nonce) +  0), ReadLE32((nonce) + 4) \
+}}
+
+SAFEBUFFERS static void chacha20_crypt(struct chacha20_ctx *ctx, uint8 *dst, const uint8 *src, uint32 bytes)
+{
+	uint32 buf[CHACHA20_BLOCK_SIZE / sizeof(uint32)];
+
+  if (bytes == 0)
+    return;
+
+#if defined(ARCH_CPU_X86_64)
+#if CHACHA20_WITH_AVX512
+	if (X86_PCAP_AVX512F) {
+		chacha20_avx512(dst, src, bytes, &ctx->state[4], &ctx->state[12]);
+		ctx->state[12] += (bytes + 63) / 64;
+		return;
+	}
+	if (X86_PCAP_AVX512VL) {
+		chacha20_avx512vl(dst, src, bytes, &ctx->state[4], &ctx->state[12]);
+		ctx->state[12] += (bytes + 63) / 64;
+		return;
+	}
+#endif  // CHACHA20_WITH_AVX512
+  if (X86_PCAP_AVX2) {
+    chacha20_avx2(dst, src, bytes, &ctx->state[4], &ctx->state[12]);
+    ctx->state[12] += (bytes + 63) / 64;
+    return;
+  }
+  if (X86_PCAP_SSSE3) {
+    assert(bytes);
+    chacha20_ssse3(dst, src, bytes, &ctx->state[4], &ctx->state[12]);
+    ctx->state[12] += (bytes + 63) / 64;
+    return;
+  }
+#endif  // defined(ARCH_CPU_X86_64)
+
+	if (dst != src)
+		memcpy(dst, src, bytes);
+
+	while (bytes >= CHACHA20_BLOCK_SIZE) {
+		chacha20_block_generic(ctx, buf);
+		crypto_xor(dst, (uint8 *)buf, CHACHA20_BLOCK_SIZE);
+		bytes -= CHACHA20_BLOCK_SIZE;
+		dst += CHACHA20_BLOCK_SIZE;
+	}
+	if (bytes) {
+		chacha20_block_generic(ctx, buf);
+		crypto_xor(dst, (uint8 *)buf, bytes);
+	}
+}
+
+struct poly1305_ctx {
+	uint8 opaque[24 * sizeof(uint64)];
+	uint32 nonce[4];
+	uint8 data[POLY1305_BLOCK_SIZE];
+	size_t num;
+};
+
+#if !(defined(CONFIG_X86_64) || defined(CONFIG_ARM) || defined(CONFIG_ARM64) || (defined(CONFIG_MIPS) && defined(CONFIG_64BIT)))
+struct poly1305_internal {
+	uint32 h[5];
+	uint32 r[4];
+};
+
+static void poly1305_init_generic(void *ctx, const uint8 key[16]) {
+	struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+
+	/* h = 0 */
+	st->h[0] = 0;
+	st->h[1] = 0;
+	st->h[2] = 0;
+	st->h[3] = 0;
+	st->h[4] = 0;
+
+	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
+	st->r[0] = ReadLE32(&key[ 0]) & 0x0fffffff;
+	st->r[1] = ReadLE32(&key[ 4]) & 0x0ffffffc;
+	st->r[2] = ReadLE32(&key[ 8]) & 0x0ffffffc;
+	st->r[3] = ReadLE32(&key[12]) & 0x0ffffffc;
+}
+
+static void poly1305_blocks_generic(void *ctx, const uint8 *inp, size_t len, uint32 padbit)
+{
+#define CONSTANT_TIME_CARRY(a,b) ((a ^ ((a ^ b) | ((a - b) ^ b))) >> (sizeof(a) * 8 - 1))
+	struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+	uint32 r0, r1, r2, r3;
+	uint32 s1, s2, s3;
+	uint32 h0, h1, h2, h3, h4, c;
+	uint64 d0, d1, d2, d3;
+
+	r0 = st->r[0];
+	r1 = st->r[1];
+	r2 = st->r[2];
+	r3 = st->r[3];
+
+	s1 = r1 + (r1 >> 2);
+	s2 = r2 + (r2 >> 2);
+	s3 = r3 + (r3 >> 2);
+
+	h0 = st->h[0];
+	h1 = st->h[1];
+	h2 = st->h[2];
+	h3 = st->h[3];
+	h4 = st->h[4];
+
+	while (len >= POLY1305_BLOCK_SIZE) {
+		/* h += m[i] */
+		h0 = (uint32)(d0 = (uint64)h0 + ReadLE32(inp + 0));
+		h1 = (uint32)(d1 = (uint64)h1 + (d0 >> 32) + ReadLE32(inp + 4));
+		h2 = (uint32)(d2 = (uint64)h2 + (d1 >> 32) + ReadLE32(inp + 8));
+		h3 = (uint32)(d3 = (uint64)h3 + (d2 >> 32) + ReadLE32(inp + 12));
+		h4 += (uint32)(d3 >> 32) + padbit;
+
+		/* h *= r "%" p, where "%" stands for "partial remainder" */
+		d0 = ((uint64)h0 * r0) +
+		     ((uint64)h1 * s3) +
+		     ((uint64)h2 * s2) +
+		     ((uint64)h3 * s1);
+		d1 = ((uint64)h0 * r1) +
+		     ((uint64)h1 * r0) +
+		     ((uint64)h2 * s3) +
+		     ((uint64)h3 * s2) +
+		     (h4 * s1);
+		d2 = ((uint64)h0 * r2) +
+		     ((uint64)h1 * r1) +
+		     ((uint64)h2 * r0) +
+		     ((uint64)h3 * s3) +
+		     (h4 * s2);
+		d3 = ((uint64)h0 * r3) +
+		     ((uint64)h1 * r2) +
+		     ((uint64)h2 * r1) +
+		     ((uint64)h3 * r0) +
+		     (h4 * s3);
+		h4 = (h4 * r0);
+
+		/* last reduction step: */
+		/* a) h4:h0 = h4<<128 + d3<<96 + d2<<64 + d1<<32 + d0 */
+		h0 = (uint32)d0;
+		h1 = (uint32)(d1 += d0 >> 32);
+		h2 = (uint32)(d2 += d1 >> 32);
+		h3 = (uint32)(d3 += d2 >> 32);
+		h4 += (uint32)(d3 >> 32);
+		/* b) (h4:h0 += (h4:h0>>130) * 5) %= 2^130 */
+		c = (h4 >> 2) + (h4 & ~3U);
+		h4 &= 3;
+		h0 += c;
+		h1 += (c = CONSTANT_TIME_CARRY(h0,c));
+		h2 += (c = CONSTANT_TIME_CARRY(h1,c));
+		h3 += (c = CONSTANT_TIME_CARRY(h2,c));
+		h4 += CONSTANT_TIME_CARRY(h3,c);
+		/*
+		 * Occasional overflows to 3rd bit of h4 are taken care of
+		 * "naturally". If after this point we end up at the top of
+		 * this loop, then the overflow bit will be accounted for
+		 * in next iteration. If we end up in poly1305_emit, then
+		 * comparison to modulus below will still count as "carry
+		 * into 131st bit", so that properly reduced value will be
+		 * picked in conditional move.
+		 */
+
+		inp += POLY1305_BLOCK_SIZE;
+		len -= POLY1305_BLOCK_SIZE;
+	}
+
+	st->h[0] = h0;
+	st->h[1] = h1;
+	st->h[2] = h2;
+	st->h[3] = h3;
+	st->h[4] = h4;
+#undef CONSTANT_TIME_CARRY
+}
+
+static void poly1305_emit_generic(void *ctx, uint8 mac[16], const uint32 nonce[4])
+{
+	struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+	uint32 *omac = (uint32 *)mac;
+	uint32 h0, h1, h2, h3, h4;
+	uint32 g0, g1, g2, g3, g4;
+	uint64 t;
+	uint32 mask;
+
+	h0 = st->h[0];
+	h1 = st->h[1];
+	h2 = st->h[2];
+	h3 = st->h[3];
+	h4 = st->h[4];
+
+	/* compare to modulus by computing h + -p */
+	g0 = (uint32)(t = (uint64)h0 + 5);
+	g1 = (uint32)(t = (uint64)h1 + (t >> 32));
+	g2 = (uint32)(t = (uint64)h2 + (t >> 32));
+	g3 = (uint32)(t = (uint64)h3 + (t >> 32));
+	g4 = h4 + (uint32)(t >> 32);
+
+	/* if there was carry into 131st bit, h3:h0 = g3:g0 */
+	mask = 0 - (g4 >> 2);
+	g0 &= mask;
+	g1 &= mask;
+	g2 &= mask;
+	g3 &= mask;
+	mask = ~mask;
+	h0 = (h0 & mask) | g0;
+	h1 = (h1 & mask) | g1;
+	h2 = (h2 & mask) | g2;
+	h3 = (h3 & mask) | g3;
+
+	/* mac = (h + nonce) % (2^128) */
+	h0 = (uint32)(t = (uint64)h0 + nonce[0]);
+	h1 = (uint32)(t = (uint64)h1 + (t >> 32) + nonce[1]);
+	h2 = (uint32)(t = (uint64)h2 + (t >> 32) + nonce[2]);
+	h3 = (uint32)(t = (uint64)h3 + (t >> 32) + nonce[3]);
+
+	omac[0] = ToLE32(h0);
+	omac[1] = ToLE32(h1);
+	omac[2] = ToLE32(h2);
+	omac[3] = ToLE32(h3);
+}
+#endif
+
+SAFEBUFFERS static void poly1305_init(struct poly1305_ctx *ctx, const uint8 key[POLY1305_KEY_SIZE])
+{
+	ctx->nonce[0] = ReadLE32(&key[16]);
+	ctx->nonce[1] = ReadLE32(&key[20]);
+	ctx->nonce[2] = ReadLE32(&key[24]);
+	ctx->nonce[3] = ReadLE32(&key[28]);
+
+#if defined(ARCH_CPU_X86_64)
+	poly1305_init_x86_64(ctx->opaque, key);
+#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+	poly1305_init_arm(ctx->opaque, key);
+#elif defined(CONFIG_MIPS) && defined(CONFIG_64BIT)
+	poly1305_init_mips(ctx->opaque, key);
+#else
+	poly1305_init_generic(ctx->opaque, key);
+#endif
+	ctx->num = 0;
+}
+
+static inline void poly1305_blocks(void *ctx, const uint8 *inp, size_t len, uint32 padbit)
+{
+#if defined(ARCH_CPU_X86_64)
+#if CHACHA20_WITH_AVX512
+	if(X86_PCAP_AVX512F)
+		poly1305_blocks_avx512(ctx, inp, len, padbit);
+	else 
+#endif  // CHACHA20_WITH_AVX512
+  if (X86_PCAP_AVX2)
+    poly1305_blocks_avx2(ctx, inp, len, padbit);
+  else if (X86_PCAP_AVX)
+    poly1305_blocks_avx(ctx, inp, len, padbit);
+  else
+		poly1305_blocks_x86_64(ctx, inp, len, padbit);
+#else  // defined(ARCH_CPU_X86_64)
+  poly1305_blocks_generic(ctx, inp, len, padbit);
+#endif  // defined(ARCH_CPU_X86_64)
+}
+
+static inline void poly1305_emit(void *ctx, uint8 mac[16], const uint32 nonce[4])
+{
+#if defined(ARCH_CPU_X86_64)
+  if (X86_PCAP_AVX)
+    poly1305_emit_avx(ctx, mac, nonce);
+  else
+    poly1305_emit_x86_64(ctx, mac, nonce);
+#else  // defined(ARCH_CPU_X86_64)
+	poly1305_emit_generic(ctx, mac, nonce);
+#endif  // defined(ARCH_CPU_X86_64)
+} 
+
+SAFEBUFFERS static void poly1305_update(struct poly1305_ctx *ctx, const uint8 *inp, size_t len)
+{
+	const size_t num = ctx->num;
+	size_t rem;
+
+	if (num) {
+		rem = POLY1305_BLOCK_SIZE - num;
+		if (len >= rem) {
+			memcpy(ctx->data + num, inp, rem);
+			poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 1);
+			inp += rem;
+			len -= rem;
+		} else {
+			/* Still not enough data to process a block. */
+			memcpy(ctx->data + num, inp, len);
+			ctx->num = num + len;
+			return;
+		}
+	}
+
+	rem = len % POLY1305_BLOCK_SIZE;
+	len -= rem;
+
+	if (len >= POLY1305_BLOCK_SIZE) {
+		poly1305_blocks(ctx->opaque, inp, len, 1);
+		inp += len;
+	}
+
+	if (rem)
+		memcpy(ctx->data, inp, rem);
+
+	ctx->num = rem;
+}
+
+SAFEBUFFERS static void poly1305_finish(struct poly1305_ctx *ctx, uint8 mac[16])
+{
+	size_t num = ctx->num;
+
+	if (num) {
+		ctx->data[num++] = 1;   /* pad bit */
+		while (num < POLY1305_BLOCK_SIZE)
+			ctx->data[num++] = 0;
+		poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 0);
+	}
+
+	poly1305_emit(ctx->opaque, mac, ctx->nonce);
+
+	/* zero out the state */
+	memzero_crypto(ctx, sizeof(*ctx));
+}
+
+static const uint8 pad0[16] = { 0 };
+
+SAFEBUFFERS static FORCEINLINE void poly1305_getmac(const uint8 *ad, size_t ad_len, const uint8 *src, size_t src_len, const uint8 key[POLY1305_KEY_SIZE], uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) {
+  uint64 len[2];
+  struct poly1305_ctx poly1305_state;
+
+  poly1305_init(&poly1305_state, key);
+  poly1305_update(&poly1305_state, ad, ad_len);
+  poly1305_update(&poly1305_state, pad0, (0 - ad_len) & 0xf);
+  poly1305_update(&poly1305_state, src, src_len);
+  poly1305_update(&poly1305_state, pad0, (0 - src_len) & 0xf);
+  len[0] = ToLE64(ad_len);
+  len[1] = ToLE64(src_len);
+  poly1305_update(&poly1305_state, (uint8 *)&len, sizeof(len));
+  poly1305_finish(&poly1305_state, mac);
+}
+
+struct ChaChaState {
+  struct chacha20_ctx chacha20_state;
+  uint8 block0[CHACHA20_BLOCK_SIZE];
+};
+
+static inline void InitializeChaChaState(ChaChaState *st, const uint8 key[CHACHA20POLY1305_KEYLEN], uint64 nonce) {
+  uint64 le_nonce = ToLE64(nonce);
+  WriteLE64((uint8*)st, 0x3320646e61707865);
+  WriteLE64((uint8*)st + 8, 0x6b20657479622d32);
+  Write64((uint8*)st + 16, Read64(key + 0));
+  Write64((uint8*)st + 24, Read64(key + 8));
+  Write64((uint8*)st + 32, Read64(key + 16));
+  Write64((uint8*)st + 40, Read64(key + 24));
+  Write64((uint8*)st + 48, 0);
+  Write64((uint8*)st + 56, Read64((uint8*)&le_nonce));
+
+  Write64((uint8*)st + 64 + 0 * 8, 0);
+  Write64((uint8*)st + 64 + 1 * 8, 0);
+  Write64((uint8*)st + 64 + 2 * 8, 0);
+  Write64((uint8*)st + 64 + 3 * 8, 0);
+  Write64((uint8*)st + 64 + 4 * 8, 0);
+  Write64((uint8*)st + 64 + 5 * 8, 0);
+  Write64((uint8*)st + 64 + 6 * 8, 0);
+  Write64((uint8*)st + 64 + 7 * 8, 0);
+}
+
+SAFEBUFFERS void poly1305_get_mac(const uint8 *src, size_t src_len,
+                     const uint8 *ad, const size_t ad_len,
+                     const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN],
+                     uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) {
+  ChaChaState st;
+
+  InitializeChaChaState(&st, key, nonce);
+  chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0));
+  poly1305_getmac(ad, ad_len, src, src_len, st.block0, mac);
+  memzero_crypto(&st, sizeof(st));
+}
+
+SAFEBUFFERS void chacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+					      const uint8 *ad, const size_t ad_len,
+					      const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]) {
+  ChaChaState st;
+
+  InitializeChaChaState(&st, key, nonce);
+	chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0));
+  chacha20_crypt(&st.chacha20_state, dst, src, (uint32)src_len);
+  poly1305_getmac(ad, ad_len, dst, src_len, st.block0, dst + src_len);
+  memzero_crypto(&st, sizeof(st));
+}
+
+SAFEBUFFERS void chacha20poly1305_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len,
+                                      const uint8 *ad, const size_t ad_len,
+                                      const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN],
+                                      uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) {
+  ChaChaState st;
+
+  InitializeChaChaState(&st, key, nonce);
+  chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0));
+  poly1305_getmac(ad, ad_len, src, src_len, st.block0, mac);
+  chacha20_crypt(&st.chacha20_state, dst, src, (uint32)src_len);
+  memzero_crypto(&st, sizeof(st));
+}
+
+SAFEBUFFERS bool chacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                              const uint8 *ad, const size_t ad_len,
+                              const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]) {
+  uint8 mac[POLY1305_MAC_SIZE];
+
+  if (src_len < CHACHA20POLY1305_AUTHTAGLEN)
+    return false;
+  chacha20poly1305_decrypt_get_mac(dst, src, src_len - CHACHA20POLY1305_AUTHTAGLEN, ad, ad_len, nonce, key, mac);
+  return memcmp_crypto(mac, src + src_len - CHACHA20POLY1305_AUTHTAGLEN, CHACHA20POLY1305_AUTHTAGLEN) == 0;
+}
+
+void xchacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+			       const uint8 *ad, const size_t ad_len,
+			       const uint8 nonce[XCHACHA20POLY1305_NONCELEN],
+			       const uint8 key[CHACHA20POLY1305_KEYLEN])
+{
+  __aligned(16) uint8 derived_key[CHACHA20POLY1305_KEYLEN];
+
+	hchacha20(derived_key, nonce, key);
+	chacha20poly1305_encrypt(dst, src, src_len, ad, ad_len, ReadLE64(nonce + 16), derived_key);
+	memzero_crypto(derived_key, CHACHA20POLY1305_KEYLEN);
+}
+
+bool xchacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+			       const uint8 *ad, const size_t ad_len,
+			       const uint8 nonce[XCHACHA20POLY1305_NONCELEN],
+			       const uint8 key[CHACHA20POLY1305_KEYLEN]) {
+  bool ret;
+  __aligned(16) uint8 derived_key[CHACHA20POLY1305_KEYLEN];
+
+	hchacha20(derived_key, nonce, key);
+	ret = chacha20poly1305_decrypt(dst, src, src_len, ad, ad_len, ReadLE64(nonce + 16), derived_key);
+	memzero_crypto(derived_key, CHACHA20POLY1305_KEYLEN);
+
+	return ret;
+}
+
diff --git a/crypto/chacha20poly1305.h b/crypto/chacha20poly1305.h
new file mode 100644
index 0000000..90b701d
--- /dev/null
+++ b/crypto/chacha20poly1305.h
@@ -0,0 +1,39 @@
+#pragma once
+#include "tunsafe_types.h"
+
+
+enum {
+  XCHACHA20POLY1305_NONCELEN = 24,
+  CHACHA20POLY1305_KEYLEN = 32,
+  CHACHA20POLY1305_AUTHTAGLEN = 16
+};
+
+
+void chacha20poly1305_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len,
+                                      const uint8 *ad, const size_t ad_len,
+                                      const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN],
+                                      uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]);
+
+bool chacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                              const uint8 *ad, const size_t ad_len,
+                              const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]);
+
+void chacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                              const uint8 *ad, const size_t ad_len,
+                              const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]);
+
+
+void xchacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                               const uint8 *ad, const size_t ad_len,
+                               const uint8 nonce[XCHACHA20POLY1305_NONCELEN],
+                               const uint8 key[CHACHA20POLY1305_KEYLEN]);
+
+bool xchacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                               const uint8 *ad, const size_t ad_len,
+                               const uint8 nonce[XCHACHA20POLY1305_NONCELEN],
+                               const uint8 key[CHACHA20POLY1305_KEYLEN]);
+
+void poly1305_get_mac(const uint8 *src, size_t src_len,
+                     const uint8 *ad, const size_t ad_len,
+                     const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN],
+                     uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]);
\ No newline at end of file
diff --git a/crypto/curve25519-donna.cpp b/crypto/curve25519-donna.cpp
new file mode 100644
index 0000000..a8f5cbe
--- /dev/null
+++ b/crypto/curve25519-donna.cpp
@@ -0,0 +1,737 @@
+/* Copyright 2008, Google Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *     * Neither the name of Google Inc. nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * curve25519-donna: Curve25519 elliptic curve, public key function
+ *
+ * http://code.google.com/p/curve25519-donna/
+ *
+ * Adam Langley <agl@imperialviolet.org>
+ *
+ * Derived from public domain C code by Daniel J. Bernstein <djb@cr.yp.to>
+ *
+ * More information about curve25519 can be found here
+ *   http://cr.yp.to/ecdh.html
+ *
+ * djb's sample implementation of curve25519 is written in a special assembly
+ * language called qhasm and uses the floating point registers.
+ *
+ * This is, almost, a clean room reimplementation from the curve25519 paper. It
+ * uses many of the tricks described therein. Only the crecip function is taken
+ * from the sample implementation.
+ */
+
+#include <string.h>
+#include <stdint.h>
+
+#ifdef _MSC_VER
+#define inline __inline
+#endif
+
+typedef uint8_t u8;
+typedef int32_t s32;
+typedef int64_t limb;
+
+/* Field element representation:
+ *
+ * Field elements are written as an array of signed, 64-bit limbs, least
+ * significant first. The value of the field element is:
+ *   x[0] + 2^26·x[1] + x^51·x[2] + 2^102·x[3] + ...
+ *
+ * i.e. the limbs are 26, 25, 26, 25, ... bits wide.
+ */
+
+/* Sum two numbers: output += in */
+static void fsum(limb *output, const limb *in) {
+  unsigned i;
+  for (i = 0; i < 10; i += 2) {
+    output[0+i] = (output[0+i] + in[0+i]);
+    output[1+i] = (output[1+i] + in[1+i]);
+  }
+}
+
+/* Find the difference of two numbers: output = in - output
+ * (note the order of the arguments!)
+ */
+static void fdifference(limb *output, const limb *in) {
+  unsigned i;
+  for (i = 0; i < 10; ++i) {
+    output[i] = (in[i] - output[i]);
+  }
+}
+
+/* Multiply a number by a scalar: output = in * scalar */
+static void fscalar_product(limb *output, const limb *in, const limb scalar) {
+  unsigned i;
+  for (i = 0; i < 10; ++i) {
+    output[i] = in[i] * scalar;
+  }
+}
+
+/* Multiply two numbers: output = in2 * in
+ *
+ * output must be distinct to both inputs. The inputs are reduced coefficient
+ * form, the output is not.
+ */
+static void fproduct(limb *output, const limb *in2, const limb *in) {
+  output[0] =       ((limb) ((s32) in2[0])) * ((s32) in[0]);
+  output[1] =       ((limb) ((s32) in2[0])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[0]);
+  output[2] =  2 *  ((limb) ((s32) in2[1])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[0]);
+  output[3] =       ((limb) ((s32) in2[1])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[0]);
+  output[4] =       ((limb) ((s32) in2[2])) * ((s32) in[2]) +
+               2 * (((limb) ((s32) in2[1])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[1])) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[0]);
+  output[5] =       ((limb) ((s32) in2[2])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[0]);
+  output[6] =  2 * (((limb) ((s32) in2[3])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[1])) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[0]);
+  output[7] =       ((limb) ((s32) in2[3])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[0]);
+  output[8] =       ((limb) ((s32) in2[4])) * ((s32) in[4]) +
+               2 * (((limb) ((s32) in2[3])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[1])) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[0]);
+  output[9] =       ((limb) ((s32) in2[4])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[0]);
+  output[10] = 2 * (((limb) ((s32) in2[5])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[1])) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[2]);
+  output[11] =      ((limb) ((s32) in2[5])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[2]);
+  output[12] =      ((limb) ((s32) in2[6])) * ((s32) in[6]) +
+               2 * (((limb) ((s32) in2[5])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[3])) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[4]);
+  output[13] =      ((limb) ((s32) in2[6])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[4]);
+  output[14] = 2 * (((limb) ((s32) in2[7])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[5])) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[6]);
+  output[15] =      ((limb) ((s32) in2[7])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[6]);
+  output[16] =      ((limb) ((s32) in2[8])) * ((s32) in[8]) +
+               2 * (((limb) ((s32) in2[7])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[7]));
+  output[17] =      ((limb) ((s32) in2[8])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[8]);
+  output[18] = 2 *  ((limb) ((s32) in2[9])) * ((s32) in[9]);
+}
+
+/* Reduce a long form to a short form by taking the input mod 2^255 - 19. */
+static void freduce_degree(limb *output) {
+  /* Each of these shifts and adds ends up multiplying the value by 19. */
+  output[8] += output[18] << 4;
+  output[8] += output[18] << 1;
+  output[8] += output[18];
+  output[7] += output[17] << 4;
+  output[7] += output[17] << 1;
+  output[7] += output[17];
+  output[6] += output[16] << 4;
+  output[6] += output[16] << 1;
+  output[6] += output[16];
+  output[5] += output[15] << 4;
+  output[5] += output[15] << 1;
+  output[5] += output[15];
+  output[4] += output[14] << 4;
+  output[4] += output[14] << 1;
+  output[4] += output[14];
+  output[3] += output[13] << 4;
+  output[3] += output[13] << 1;
+  output[3] += output[13];
+  output[2] += output[12] << 4;
+  output[2] += output[12] << 1;
+  output[2] += output[12];
+  output[1] += output[11] << 4;
+  output[1] += output[11] << 1;
+  output[1] += output[11];
+  output[0] += output[10] << 4;
+  output[0] += output[10] << 1;
+  output[0] += output[10];
+}
+
+#if (-1 & 3) != 3
+#error "This code only works on a two's complement system"
+#endif
+
+/* return v / 2^26, using only shifts and adds. */
+static inline limb
+div_by_2_26(const limb v)
+{
+  /* High word of v; no shift needed*/
+  const uint32_t highword = (uint32_t) (((uint64_t) v) >> 32);
+  /* Set to all 1s if v was negative; else set to 0s. */
+  const int32_t sign = ((int32_t) highword) >> 31;
+  /* Set to 0x3ffffff if v was negative; else set to 0. */
+  const int32_t roundoff = ((uint32_t) sign) >> 6;
+  /* Should return v / (1<<26) */
+  return (v + roundoff) >> 26;
+}
+
+/* return v / (2^25), using only shifts and adds. */
+static inline limb
+div_by_2_25(const limb v)
+{
+  /* High word of v; no shift needed*/
+  const uint32_t highword = (uint32_t) (((uint64_t) v) >> 32);
+  /* Set to all 1s if v was negative; else set to 0s. */
+  const int32_t sign = ((int32_t) highword) >> 31;
+  /* Set to 0x1ffffff if v was negative; else set to 0. */
+  const int32_t roundoff = ((uint32_t) sign) >> 7;
+  /* Should return v / (1<<25) */
+  return (v + roundoff) >> 25;
+}
+
+static inline s32
+div_s32_by_2_25(const s32 v)
+{
+   const s32 roundoff = ((uint32_t)(v >> 31)) >> 7;
+   return (v + roundoff) >> 25;
+}
+
+/* Reduce all coefficients of the short form input so that |x| < 2^26.
+ *
+ * On entry: |output[i]| < 2^62
+ */
+static void freduce_coefficients(limb *output) {
+  unsigned i;
+
+  output[10] = 0;
+
+  for (i = 0; i < 10; i += 2) {
+    limb over = div_by_2_26(output[i]);
+    output[i] -= over << 26;
+    output[i+1] += over;
+
+    over = div_by_2_25(output[i+1]);
+    output[i+1] -= over << 25;
+    output[i+2] += over;
+  }
+  /* Now |output[10]| < 2 ^ 38 and all other coefficients are reduced. */
+  output[0] += output[10] << 4;
+  output[0] += output[10] << 1;
+  output[0] += output[10];
+
+  output[10] = 0;
+
+  /* Now output[1..9] are reduced, and |output[0]| < 2^26 + 19 * 2^38
+   * So |over| will be no more than 77825  */
+  {
+    limb over = div_by_2_26(output[0]);
+    output[0] -= over << 26;
+    output[1] += over;
+  }
+
+  /* Now output[0,2..9] are reduced, and |output[1]| < 2^25 + 77825
+   * So |over| will be no more than 1. */
+  {
+    /* output[1] fits in 32 bits, so we can use div_s32_by_2_25 here. */
+    s32 over32 = div_s32_by_2_25((s32) output[1]);
+    output[1] -= over32 << 25;
+    output[2] += over32;
+  }
+
+  /* Finally, output[0,1,3..9] are reduced, and output[2] is "nearly reduced":
+   * we have |output[2]| <= 2^26.  This is good enough for all of our math,
+   * but it will require an extra freduce_coefficients before fcontract. */
+}
+
+/* A helpful wrapper around fproduct: output = in * in2.
+ *
+ * output must be distinct to both inputs. The output is reduced degree and
+ * reduced coefficient.
+ */
+static void
+fmul(limb *output, const limb *in, const limb *in2) {
+  limb t[19];
+  fproduct(t, in, in2);
+  freduce_degree(t);
+  freduce_coefficients(t);
+  memcpy(output, t, sizeof(limb) * 10);
+}
+
+static void fsquare_inner(limb *output, const limb *in) {
+  output[0] =       ((limb) ((s32) in[0])) * ((s32) in[0]);
+  output[1] =  2 *  ((limb) ((s32) in[0])) * ((s32) in[1]);
+  output[2] =  2 * (((limb) ((s32) in[1])) * ((s32) in[1]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[2]));
+  output[3] =  2 * (((limb) ((s32) in[1])) * ((s32) in[2]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[3]));
+  output[4] =       ((limb) ((s32) in[2])) * ((s32) in[2]) +
+               4 *  ((limb) ((s32) in[1])) * ((s32) in[3]) +
+               2 *  ((limb) ((s32) in[0])) * ((s32) in[4]);
+  output[5] =  2 * (((limb) ((s32) in[2])) * ((s32) in[3]) +
+                    ((limb) ((s32) in[1])) * ((s32) in[4]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[5]));
+  output[6] =  2 * (((limb) ((s32) in[3])) * ((s32) in[3]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[4]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[6]) +
+               2 *  ((limb) ((s32) in[1])) * ((s32) in[5]));
+  output[7] =  2 * (((limb) ((s32) in[3])) * ((s32) in[4]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[5]) +
+                    ((limb) ((s32) in[1])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[7]));
+  output[8] =       ((limb) ((s32) in[4])) * ((s32) in[4]) +
+               2 * (((limb) ((s32) in[2])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[8]) +
+               2 * (((limb) ((s32) in[1])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[3])) * ((s32) in[5])));
+  output[9] =  2 * (((limb) ((s32) in[4])) * ((s32) in[5]) +
+                    ((limb) ((s32) in[3])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[1])) * ((s32) in[8]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[9]));
+  output[10] = 2 * (((limb) ((s32) in[5])) * ((s32) in[5]) +
+                    ((limb) ((s32) in[4])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[8]) +
+               2 * (((limb) ((s32) in[3])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[1])) * ((s32) in[9])));
+  output[11] = 2 * (((limb) ((s32) in[5])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[4])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[3])) * ((s32) in[8]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[9]));
+  output[12] =      ((limb) ((s32) in[6])) * ((s32) in[6]) +
+               2 * (((limb) ((s32) in[4])) * ((s32) in[8]) +
+               2 * (((limb) ((s32) in[5])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[3])) * ((s32) in[9])));
+  output[13] = 2 * (((limb) ((s32) in[6])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[5])) * ((s32) in[8]) +
+                    ((limb) ((s32) in[4])) * ((s32) in[9]));
+  output[14] = 2 * (((limb) ((s32) in[7])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[6])) * ((s32) in[8]) +
+               2 *  ((limb) ((s32) in[5])) * ((s32) in[9]));
+  output[15] = 2 * (((limb) ((s32) in[7])) * ((s32) in[8]) +
+                    ((limb) ((s32) in[6])) * ((s32) in[9]));
+  output[16] =      ((limb) ((s32) in[8])) * ((s32) in[8]) +
+               4 *  ((limb) ((s32) in[7])) * ((s32) in[9]);
+  output[17] = 2 *  ((limb) ((s32) in[8])) * ((s32) in[9]);
+  output[18] = 2 *  ((limb) ((s32) in[9])) * ((s32) in[9]);
+}
+
+static void
+fsquare(limb *output, const limb *in) {
+  limb t[19];
+  fsquare_inner(t, in);
+  freduce_degree(t);
+  freduce_coefficients(t);
+  memcpy(output, t, sizeof(limb) * 10);
+}
+
+/* Take a little-endian, 32-byte number and expand it into polynomial form */
+static void
+fexpand(limb *output, const u8 *input) {
+#define F(n,start,shift,mask) \
+  output[n] = ((((limb) input[start + 0]) | \
+                ((limb) input[start + 1]) << 8 | \
+                ((limb) input[start + 2]) << 16 | \
+                ((limb) input[start + 3]) << 24) >> shift) & mask;
+  F(0, 0, 0, 0x3ffffff);
+  F(1, 3, 2, 0x1ffffff);
+  F(2, 6, 3, 0x3ffffff);
+  F(3, 9, 5, 0x1ffffff);
+  F(4, 12, 6, 0x3ffffff);
+  F(5, 16, 0, 0x1ffffff);
+  F(6, 19, 1, 0x3ffffff);
+  F(7, 22, 3, 0x1ffffff);
+  F(8, 25, 4, 0x3ffffff);
+  F(9, 28, 6, 0x3ffffff);
+#undef F
+}
+
+#if (-32 >> 1) != -16
+#error "This code only works when >> does sign-extension on negative numbers"
+#endif
+
+/* Take a fully reduced polynomial form number and contract it into a
+ * little-endian, 32-byte array
+ */
+static void
+fcontract(u8 *output, limb *input) {
+  int i;
+  int j;
+
+  for (j = 0; j < 2; ++j) {
+    for (i = 0; i < 9; ++i) {
+      if ((i & 1) == 1) {
+        /* This calculation is a time-invariant way to make input[i] positive
+           by borrowing from the next-larger limb.
+        */
+        const s32 mask = (s32)(input[i]) >> 31;
+        const s32 carry = -(((s32)(input[i]) & mask) >> 25);
+        input[i] = (s32)(input[i]) + (carry << 25);
+        input[i+1] = (s32)(input[i+1]) - carry;
+      } else {
+        const s32 mask = (s32)(input[i]) >> 31;
+        const s32 carry = -(((s32)(input[i]) & mask) >> 26);
+        input[i] = (s32)(input[i]) + (carry << 26);
+        input[i+1] = (s32)(input[i+1]) - carry;
+      }
+    }
+    {
+      const s32 mask = (s32)(input[9]) >> 31;
+      const s32 carry = -(((s32)(input[9]) & mask) >> 25);
+      input[9] = (s32)(input[9]) + (carry << 25);
+      input[0] = (s32)(input[0]) - (carry * 19);
+    }
+  }
+
+  /* The first borrow-propagation pass above ended with every limb
+     except (possibly) input[0] non-negative.
+
+     Since each input limb except input[0] is decreased by at most 1
+     by a borrow-propagation pass, the second borrow-propagation pass
+     could only have wrapped around to decrease input[0] again if the
+     first pass left input[0] negative *and* input[1] through input[9]
+     were all zero.  In that case, input[1] is now 2^25 - 1, and this
+     last borrow-propagation step will leave input[1] non-negative.
+  */
+  {
+    const s32 mask = (s32)(input[0]) >> 31;
+    const s32 carry = -(((s32)(input[0]) & mask) >> 26);
+    input[0] = (s32)(input[0]) + (carry << 26);
+    input[1] = (s32)(input[1]) - carry;
+  }
+
+  /* Both passes through the above loop, plus the last 0-to-1 step, are
+     necessary: if input[9] is -1 and input[0] through input[8] are 0,
+     negative values will remain in the array until the end.
+   */
+
+  input[1] <<= 2;
+  input[2] <<= 3;
+  input[3] <<= 5;
+  input[4] <<= 6;
+  input[6] <<= 1;
+  input[7] <<= 3;
+  input[8] <<= 4;
+  input[9] <<= 6;
+#define F(i, s) \
+  output[s+0] |=  input[i] & 0xff; \
+  output[s+1]  = (input[i] >> 8) & 0xff; \
+  output[s+2]  = (input[i] >> 16) & 0xff; \
+  output[s+3]  = (input[i] >> 24) & 0xff;
+  output[0] = 0;
+  output[16] = 0;
+  F(0,0);
+  F(1,3);
+  F(2,6);
+  F(3,9);
+  F(4,12);
+  F(5,16);
+  F(6,19);
+  F(7,22);
+  F(8,25);
+  F(9,28);
+#undef F
+}
+
+/* Input: Q, Q', Q-Q'
+ * Output: 2Q, Q+Q'
+ *
+ *   x2 z3: long form
+ *   x3 z3: long form
+ *   x z: short form, destroyed
+ *   xprime zprime: short form, destroyed
+ *   qmqp: short form, preserved
+ */
+static void fmonty(limb *x2, limb *z2,  /* output 2Q */
+                   limb *x3, limb *z3,  /* output Q + Q' */
+                   limb *x, limb *z,    /* input Q */
+                   limb *xprime, limb *zprime,  /* input Q' */
+                   const limb *qmqp /* input Q - Q' */) {
+  limb origx[10], origxprime[10], zzz[19], xx[19], zz[19], xxprime[19],
+        zzprime[19], zzzprime[19], xxxprime[19];
+
+  memcpy(origx, x, 10 * sizeof(limb));
+  fsum(x, z);
+  fdifference(z, origx);  // does x - z
+
+  memcpy(origxprime, xprime, sizeof(limb) * 10);
+  fsum(xprime, zprime);
+  fdifference(zprime, origxprime);
+  fproduct(xxprime, xprime, z);
+  fproduct(zzprime, x, zprime);
+  freduce_degree(xxprime);
+  freduce_coefficients(xxprime);
+  freduce_degree(zzprime);
+  freduce_coefficients(zzprime);
+  memcpy(origxprime, xxprime, sizeof(limb) * 10);
+  fsum(xxprime, zzprime);
+  fdifference(zzprime, origxprime);
+  fsquare(xxxprime, xxprime);
+  fsquare(zzzprime, zzprime);
+  fproduct(zzprime, zzzprime, qmqp);
+  freduce_degree(zzprime);
+  freduce_coefficients(zzprime);
+  memcpy(x3, xxxprime, sizeof(limb) * 10);
+  memcpy(z3, zzprime, sizeof(limb) * 10);
+
+  fsquare(xx, x);
+  fsquare(zz, z);
+  fproduct(x2, xx, zz);
+  freduce_degree(x2);
+  freduce_coefficients(x2);
+  fdifference(zz, xx);  // does zz = xx - zz
+  memset(zzz + 10, 0, sizeof(limb) * 9);
+  fscalar_product(zzz, zz, 121665);
+  /* No need to call freduce_degree here:
+     fscalar_product doesn't increase the degree of its input. */
+  freduce_coefficients(zzz);
+  fsum(zzz, xx);
+  fproduct(z2, zz, zzz);
+  freduce_degree(z2);
+  freduce_coefficients(z2);
+}
+
+/* Conditionally swap two reduced-form limb arrays if 'iswap' is 1, but leave
+ * them unchanged if 'iswap' is 0.  Runs in data-invariant time to avoid
+ * side-channel attacks.
+ *
+ * NOTE that this function requires that 'iswap' be 1 or 0; other values give
+ * wrong results.  Also, the two limb arrays must be in reduced-coefficient,
+ * reduced-degree form: the values in a[10..19] or b[10..19] aren't swapped,
+ * and all all values in a[0..9],b[0..9] must have magnitude less than
+ * INT32_MAX.
+ */
+static void
+swap_conditional(limb a[19], limb b[19], limb iswap) {
+  unsigned i;
+  const s32 swap = (s32) -iswap;
+
+  for (i = 0; i < 10; ++i) {
+    const s32 x = swap & ( ((s32)a[i]) ^ ((s32)b[i]) );
+    a[i] = ((s32)a[i]) ^ x;
+    b[i] = ((s32)b[i]) ^ x;
+  }
+}
+
+/* Calculates nQ where Q is the x-coordinate of a point on the curve
+ *
+ *   resultx/resultz: the x coordinate of the resulting curve point (short form)
+ *   n: a little endian, 32-byte number
+ *   q: a point of the curve (short form)
+ */
+static void
+cmult(limb *resultx, limb *resultz, const u8 *n, const limb *q) {
+  limb a[19] = {0}, b[19] = {1}, c[19] = {1}, d[19] = {0};
+  limb *nqpqx = a, *nqpqz = b, *nqx = c, *nqz = d, *t;
+  limb e[19] = {0}, f[19] = {1}, g[19] = {0}, h[19] = {1};
+  limb *nqpqx2 = e, *nqpqz2 = f, *nqx2 = g, *nqz2 = h;
+
+  unsigned i, j;
+
+  memcpy(nqpqx, q, sizeof(limb) * 10);
+
+  for (i = 0; i < 32; ++i) {
+    u8 byte = n[31 - i];
+    for (j = 0; j < 8; ++j) {
+      const limb bit = byte >> 7;
+
+      swap_conditional(nqx, nqpqx, bit);
+      swap_conditional(nqz, nqpqz, bit);
+      fmonty(nqx2, nqz2,
+             nqpqx2, nqpqz2,
+             nqx, nqz,
+             nqpqx, nqpqz,
+             q);
+      swap_conditional(nqx2, nqpqx2, bit);
+      swap_conditional(nqz2, nqpqz2, bit);
+
+      t = nqx;
+      nqx = nqx2;
+      nqx2 = t;
+      t = nqz;
+      nqz = nqz2;
+      nqz2 = t;
+      t = nqpqx;
+      nqpqx = nqpqx2;
+      nqpqx2 = t;
+      t = nqpqz;
+      nqpqz = nqpqz2;
+      nqpqz2 = t;
+
+      byte <<= 1;
+    }
+  }
+
+  memcpy(resultx, nqx, sizeof(limb) * 10);
+  memcpy(resultz, nqz, sizeof(limb) * 10);
+}
+
+// -----------------------------------------------------------------------------
+// Shamelessly copied from djb's code
+// -----------------------------------------------------------------------------
+static void
+crecip(limb *out, const limb *z) {
+  limb z2[10];
+  limb z9[10];
+  limb z11[10];
+  limb z2_5_0[10];
+  limb z2_10_0[10];
+  limb z2_20_0[10];
+  limb z2_50_0[10];
+  limb z2_100_0[10];
+  limb t0[10];
+  limb t1[10];
+  int i;
+
+  /* 2 */ fsquare(z2,z);
+  /* 4 */ fsquare(t1,z2);
+  /* 8 */ fsquare(t0,t1);
+  /* 9 */ fmul(z9,t0,z);
+  /* 11 */ fmul(z11,z9,z2);
+  /* 22 */ fsquare(t0,z11);
+  /* 2^5 - 2^0 = 31 */ fmul(z2_5_0,t0,z9);
+
+  /* 2^6 - 2^1 */ fsquare(t0,z2_5_0);
+  /* 2^7 - 2^2 */ fsquare(t1,t0);
+  /* 2^8 - 2^3 */ fsquare(t0,t1);
+  /* 2^9 - 2^4 */ fsquare(t1,t0);
+  /* 2^10 - 2^5 */ fsquare(t0,t1);
+  /* 2^10 - 2^0 */ fmul(z2_10_0,t0,z2_5_0);
+
+  /* 2^11 - 2^1 */ fsquare(t0,z2_10_0);
+  /* 2^12 - 2^2 */ fsquare(t1,t0);
+  /* 2^20 - 2^10 */ for (i = 2;i < 10;i += 2) { fsquare(t0,t1); fsquare(t1,t0); }
+  /* 2^20 - 2^0 */ fmul(z2_20_0,t1,z2_10_0);
+
+  /* 2^21 - 2^1 */ fsquare(t0,z2_20_0);
+  /* 2^22 - 2^2 */ fsquare(t1,t0);
+  /* 2^40 - 2^20 */ for (i = 2;i < 20;i += 2) { fsquare(t0,t1); fsquare(t1,t0); }
+  /* 2^40 - 2^0 */ fmul(t0,t1,z2_20_0);
+
+  /* 2^41 - 2^1 */ fsquare(t1,t0);
+  /* 2^42 - 2^2 */ fsquare(t0,t1);
+  /* 2^50 - 2^10 */ for (i = 2;i < 10;i += 2) { fsquare(t1,t0); fsquare(t0,t1); }
+  /* 2^50 - 2^0 */ fmul(z2_50_0,t0,z2_10_0);
+
+  /* 2^51 - 2^1 */ fsquare(t0,z2_50_0);
+  /* 2^52 - 2^2 */ fsquare(t1,t0);
+  /* 2^100 - 2^50 */ for (i = 2;i < 50;i += 2) { fsquare(t0,t1); fsquare(t1,t0); }
+  /* 2^100 - 2^0 */ fmul(z2_100_0,t1,z2_50_0);
+
+  /* 2^101 - 2^1 */ fsquare(t1,z2_100_0);
+  /* 2^102 - 2^2 */ fsquare(t0,t1);
+  /* 2^200 - 2^100 */ for (i = 2;i < 100;i += 2) { fsquare(t1,t0); fsquare(t0,t1); }
+  /* 2^200 - 2^0 */ fmul(t1,t0,z2_100_0);
+
+  /* 2^201 - 2^1 */ fsquare(t0,t1);
+  /* 2^202 - 2^2 */ fsquare(t1,t0);
+  /* 2^250 - 2^50 */ for (i = 2;i < 50;i += 2) { fsquare(t0,t1); fsquare(t1,t0); }
+  /* 2^250 - 2^0 */ fmul(t0,t1,z2_50_0);
+
+  /* 2^251 - 2^1 */ fsquare(t1,t0);
+  /* 2^252 - 2^2 */ fsquare(t0,t1);
+  /* 2^253 - 2^3 */ fsquare(t1,t0);
+  /* 2^254 - 2^4 */ fsquare(t0,t1);
+  /* 2^255 - 2^5 */ fsquare(t1,t0);
+  /* 2^255 - 21 */ fmul(out,t1,z11);
+}
+
+void curve25519_normalize(u8 *e) {
+  e[0] &= 248;
+  e[31] &= 127;
+  e[31] |= 64;
+}
+
+void curve25519_donna_ref(uint8_t *mypublic, const uint8_t *secret, const uint8_t *basepoint) {
+  limb bp[10], x[10], z[11], zmone[10];
+  uint8_t e[32];
+  int i;
+
+  for (i = 0; i < 32; ++i) e[i] = secret[i];
+  e[0] &= 248;
+  e[31] &= 127;
+  e[31] |= 64;
+
+  fexpand(bp, basepoint);
+  cmult(x, z, e, bp);
+  crecip(zmone, z);
+  fmul(z, x, zmone);
+  freduce_coefficients(z);
+  fcontract(mypublic, z);
+}
+
diff --git a/crypto/curve25519-donna.h b/crypto/curve25519-donna.h
new file mode 100644
index 0000000..6985273
--- /dev/null
+++ b/crypto/curve25519-donna.h
@@ -0,0 +1,17 @@
+#ifndef TUNSAFE_CRYPTO_CURVE25519_DONNA_H_
+#define TUNSAFE_CRYPTO_CURVE25519_DONNA_H_
+
+#include "tunsafe_types.h"
+
+void curve25519_donna_ref(uint8 *mypublic, const uint8 *secret, const uint8 *basepoint);
+extern "C" void curve25519_donna_x64(uint8 *mypublic, const uint8 *secret, const uint8 *basepoint);
+
+#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC)
+#define curve25519_donna curve25519_donna_x64
+#else
+#define curve25519_donna curve25519_donna_ref
+#endif
+
+void curve25519_normalize(uint8 *e);
+
+#endif  // TUNSAFE_CRYPTO_CURVE25519_DONNA_H_
\ No newline at end of file
diff --git a/crypto/curve25519_x64_nasm.asm b/crypto/curve25519_x64_nasm.asm
new file mode 100644
index 0000000..bdac7e6
--- /dev/null
+++ b/crypto/curve25519_x64_nasm.asm
@@ -0,0 +1,6825 @@
+default	rel
+%define XMMWORD
+%define YMMWORD
+%define ZMMWORD
+
+section	.text code align=64
+
+
+global	curve25519_donna_x64
+
+# donna function.
+# linux arguments: RDI, RSI, RDX
+# windows arguments: RCX, RDX, R8
+curve25519_donna_x64:
+$L$FB13:
+	push	r15
+	push	r14
+	xor	r15d,r15d
+	push	r13
+	push	r12
+	push	rbp
+	push	rbx
+	push  rsi
+	push  rdi
+
+	mov rdi, rcx
+	mov rsi, rdx
+	mov rdx, r8
+
+	xor	r8d,r8d
+	xor	r11d,r11d
+	xor	ebp,ebp
+	xor	r9d,r9d
+	xor	r13d,r13d
+	sub	rsp,784
+
+	mov	rcx,QWORD[6+rdx]
+	mov	r10,QWORD[rdx]
+	movdqu	xmm0,XMMWORD[rsi]
+	lea	r14,[488+rsp]
+	mov	QWORD[352+rsp],rdi
+	movaps	XMMWORD[360+rsp],xmm0
+	shr	rcx,3
+	and	BYTE[360+rsp],-8
+	mov	rbx,rcx
+	mov	rcx,QWORD[12+rdx]
+	movdqu	xmm0,XMMWORD[16+rsi]
+	shr	rcx,6
+	movaps	XMMWORD[376+rsp],xmm0
+	movzx	eax,BYTE[391+rsp]
+	and	eax,127
+	or	eax,64
+	mov	BYTE[391+rsp],al
+	mov		rax,2251799813685247
+	and	rcx,rax
+	and	rbx,rax
+	and	r10,rax
+	mov	rdi,rcx
+	mov	QWORD[184+rsp],rcx
+	mov	rcx,QWORD[19+rdx]
+	mov	rdx,QWORD[24+rdx]
+	mov	QWORD[24+rsp],r10
+	mov	QWORD[120+rsp],rbx
+	shr	rcx,1
+	shr	rdx,12
+	and	rcx,rax
+	mov	rsi,rdx
+	mov	r12,rcx
+	mov	QWORD[264+rsp],rcx
+	and	rsi,rax
+	lea	rdx,[rsi*8+rsi]
+	mov	QWORD[320+rsp],rsi
+	mov	QWORD[((-120))+rsp],rsi
+	lea	rdx,[rdx*2+rsi]
+	mov	rsi,r14
+	mov	r14,r15
+	mov	QWORD[192+rsp],rdx
+	lea	rdx,[rbx*8+rbx]
+	lea	rdx,[rdx*2+rbx]
+	mov	QWORD[328+rsp],rdx
+	lea	rdx,[rdi*8+rdi]
+	lea	rdx,[rdx*2+rdi]
+	mov	QWORD[336+rsp],rdx
+	lea	rdx,[rcx*8+rcx]
+	lea	rdx,[rdx*2+rcx]
+	lea	rcx,[728+rsp]
+	mov	QWORD[200+rsp],rdx
+	lea	rdx,[391+rsp]
+	mov	QWORD[344+rsp],rdx
+	mov	QWORD[((-24))+rsp],rdi
+	lea	rdx,[536+rsp]
+	mov	QWORD[88+rsp],rcx
+	lea	rcx,[680+rsp]
+	mov	QWORD[40+rsp],rbx
+	mov	QWORD[((-88))+rsp],r10
+	mov	ebx,1
+	xor	r10d,r10d
+	mov	QWORD[232+rsp],rcx
+	lea	rcx,[632+rsp]
+	mov	QWORD[((-104))+rsp],rbx
+	mov	r15,QWORD[40+rsp]
+	xor	edi,edi
+	mov	QWORD[40+rsp],r12
+	mov	QWORD[80+rsp],rcx
+	lea	rcx,[584+rsp]
+	mov	QWORD[((-32))+rsp],0
+	mov	QWORD[((-56))+rsp],0
+	mov	QWORD[72+rsp],1
+	mov	rbx,r8
+	mov	QWORD[104+rsp],rcx
+	lea	rcx,[440+rsp]
+	mov	QWORD[8+rsp],rdx
+	mov	QWORD[56+rsp],r10
+	mov	r12,r11
+	mov	QWORD[((-8))+rsp],rcx
+	lea	rcx,[392+rsp]
+	mov	QWORD[((-72))+rsp],rcx
+	mov	rcx,rax
+	mov	rax,QWORD[344+rsp]
+
+
+$L$3:
+	movzx	eax,BYTE[rax]
+	mov	rdx,QWORD[((-8))+rsp]
+	mov	QWORD[240+rsp],rsi
+	mov	DWORD[316+rsp],8
+	mov	rsi,r15
+	mov	r15,QWORD[72+rsp]
+	mov	BYTE[315+rsp],al
+	mov	rax,QWORD[80+rsp]
+	mov	QWORD[80+rsp],rdx
+	mov	rdx,r9
+	mov	QWORD[((-8))+rsp],rax
+	mov	rax,QWORD[8+rsp]
+	mov	QWORD[256+rsp],rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	QWORD[((-72))+rsp],rbp
+	mov	QWORD[248+rsp],rax
+	mov	rax,r8
+	jmp	NEAR $L$2
+
+
+$L$10:
+	mov	r9,r8
+	mov	r8,QWORD[80+rsp]
+	mov	QWORD[80+rsp],r9
+	mov	r9,QWORD[88+rsp]
+	mov	QWORD[((-8))+rsp],r8
+	mov	r8,QWORD[256+rsp]
+	mov	QWORD[256+rsp],r9
+	mov	r9,QWORD[104+rsp]
+	mov	QWORD[88+rsp],r8
+	mov	r8,QWORD[248+rsp]
+	mov	QWORD[248+rsp],r9
+	mov	r9,QWORD[232+rsp]
+	mov	QWORD[104+rsp],r8
+	mov	r8,QWORD[240+rsp]
+	mov	QWORD[240+rsp],r9
+	mov	QWORD[232+rsp],r8
+$L$2:
+	movzx	r8d,BYTE[315+rsp]
+	mov	QWORD[208+rsp],rcx
+	mov	rcx,QWORD[((-88))+rsp]
+	shr	r8b,7
+	mov	r9,rcx
+	movzx	r8d,r8b
+	xor	r9,r15
+	neg	r8
+	and	r9,r8
+	mov	rbp,r8
+	xor	r15,r9
+	xor	r9,rcx
+	mov	rcx,rbp
+	mov	QWORD[((-88))+rsp],r15
+	mov	r15,QWORD[((-56))+rsp]
+	mov	QWORD[160+rsp],r9
+	mov	QWORD[128+rsp],rcx
+	mov	r9,r15
+	xor	r9,rsi
+	mov	r8,r9
+	and	r8,rbp
+	xor	r15,r8
+	mov	r9,r15
+	mov	r15,r8
+	xor	r15,rsi
+	mov	QWORD[72+rsp],r15
+	mov	r15,QWORD[((-24))+rsp]
+	mov	rsi,r15
+	xor	rsi,r10
+	and	rsi,rbp
+	mov	rbp,QWORD[((-72))+rsp]
+	xor	r10,rsi
+	xor	rsi,r15
+	mov	r15,QWORD[40+rsp]
+	mov	QWORD[8+rsp],rsi
+	mov	rsi,r15
+	xor	rsi,rbp
+	and	rsi,rcx
+	xor	rbp,rsi
+	xor	rsi,r15
+	mov	r15,QWORD[((-32))+rsp]
+	mov	QWORD[40+rsp],rsi
+	mov	rsi,QWORD[((-120))+rsp]
+	xor	rsi,r15
+	mov	r8,rsi
+	and	r8,rcx
+	xor	r15,r8
+	mov	rsi,r15
+	mov	r15,QWORD[((-120))+rsp]
+	xor	r15,r8
+	mov	r8,QWORD[((-104))+rsp]
+	mov	QWORD[152+rsp],r15
+	mov	r15,QWORD[((-104))+rsp]
+	xor	r8,r12
+	and	r8,rcx
+	xor	r15,r8
+	xor	r12,r8
+	mov	r8,r11
+	mov	QWORD[((-32))+rsp],r15
+	mov	r15,QWORD[56+rsp]
+	xor	r8,rdx
+	and	r8,rcx
+	xor	r11,r8
+	xor	r8,rdx
+	mov	rdx,r15
+	mov	QWORD[((-72))+rsp],r8
+	mov	r8,r14
+	xor	rdx,rdi
+	and	rdx,rcx
+	xor	rdi,rdx
+	xor	r8,r13
+	xor	rdx,r15
+	and	r8,rcx
+	mov	r15,QWORD[((-88))+rsp]
+	xor	r14,r8
+	xor	r13,r8
+	mov	r8,rbx
+	xor	r8,rax
+	and	r8,rcx
+	mov		rcx,18014398509481832
+	xor	rbx,r8
+	xor	r8,rax
+	lea	rax,[r15*1+r12]
+	mov	r15,QWORD[240+rsp]
+	mov	QWORD[((-56))+rsp],rax
+	mov	QWORD[r15],rax
+	lea	rax,[r9*1+r11]
+	mov	QWORD[8+r15],rax
+	mov	QWORD[((-120))+rsp],rax
+	lea	rax,[r10*1+rdi]
+	mov	QWORD[16+r15],rax
+	mov	QWORD[136+rsp],rax
+	lea	rax,[rbp*1+r14]
+	mov	QWORD[24+r15],rax
+	mov	QWORD[144+rsp],rax
+	lea	rax,[rsi*1+rbx]
+	mov	QWORD[((-104))+rsp],rax
+	mov	QWORD[32+r15],rax
+	mov	r15,QWORD[((-88))+rsp]
+	mov	rax,QWORD[256+rsp]
+	add	r15,rcx
+	sub	r15,r12
+	mov		r12,18014398509481976
+	mov	QWORD[rax],r15
+	mov	QWORD[((-24))+rsp],r15
+	mov	r15,r12
+	add	r9,r12
+	add	r10,r15
+	add	rbp,r15
+	sub	r10,rdi
+	mov	r12,r9
+	mov	rdi,r15
+	mov	r15,rbp
+	sub	r12,r11
+	mov	r11,rax
+	sub	r15,r14
+	mov	QWORD[8+rax],r12
+	mov	QWORD[16+r11],r10
+	mov	QWORD[24+r11],r15
+	mov	r11,rdi
+	mov	r14,r15
+	add	r11,rsi
+	mov	r15,QWORD[((-32))+rsp]
+	mov	QWORD[((-88))+rsp],r10
+	sub	r11,rbx
+	mov	r10,QWORD[((-72))+rsp]
+	mov	rbx,QWORD[8+rsp]
+	add	r10,QWORD[72+rsp]
+	mov	QWORD[32+rax],r11
+	mov	rsi,QWORD[40+rsp]
+	mov	rax,QWORD[160+rsp]
+	mov	rdi,r15
+	mov	r9,QWORD[152+rsp]
+	mov	rbp,QWORD[248+rsp]
+	add	rbx,rdx
+	add	rdi,rax
+	add	rsi,r13
+	add	rcx,rax
+	add	r9,r8
+	mov	QWORD[rbp],rdi
+	mov	QWORD[8+rbp],r10
+	mov	QWORD[16+rbp],rbx
+	mov	QWORD[24+rbp],rsi
+	mov	QWORD[32+rbp],r9
+	mov	rbp,rcx
+	mov		rcx,18014398509481976
+	add	rcx,QWORD[72+rsp]
+	sub	rbp,r15
+	mov	rax,QWORD[80+rsp]
+	mov	QWORD[((-32))+rsp],rbp
+	mov	QWORD[rax],rbp
+	mov	r15,rcx
+	mov		rcx,18014398509481976
+	add	rcx,QWORD[8+rsp]
+	sub	r15,QWORD[((-72))+rsp]
+	mov	rbp,rcx
+	mov		rcx,18014398509481976
+	sub	rbp,rdx
+	mov	QWORD[8+rax],r15
+	lea	rax,[r11*8+r11]
+	mov	rdx,rbp
+	mov	rbp,QWORD[80+rsp]
+	mov	QWORD[72+rsp],rdx
+	mov	QWORD[16+rbp],rdx
+	add	rcx,QWORD[40+rsp]
+	mov	QWORD[((-72))+rsp],r14
+	mov	rdx,rcx
+	mov		rcx,18014398509481976
+	sub	rdx,r13
+	add	rcx,QWORD[152+rsp]
+	mov	QWORD[152+rsp],r11
+	mov	r13,rdx
+	mov	rdx,rbp
+	mov	QWORD[24+rbp],r13
+	mov	QWORD[56+rsp],r13
+	mov	rbp,rcx
+	sub	rbp,r8
+	lea	r8,[rax*2+r11]
+	mov	QWORD[32+rdx],rbp
+	mov	QWORD[160+rsp],rbp
+	mov	rbp,QWORD[((-88))+rsp]
+	mov	r13,r8
+	mov	r8,r12
+	mov	QWORD[8+rsp],r13
+	lea	rax,[rbp*8+rbp]
+	lea	rdx,[rax*2+rbp]
+	lea	rax,[r14*8+r14]
+	lea	r14,[rax*2+r14]
+	lea	rax,[r12*8+r12]
+	mov	rcx,rdx
+	mov	QWORD[224+rsp],rcx
+	lea	r11,[rax*2+r12]
+	mov	QWORD[168+rsp],r14
+	mov	rax,r11
+	mul	r9
+	mov	r11,rax
+	mov	rax,r13
+	mov	r12,rdx
+	mul	r10
+	mov	r13,QWORD[((-24))+rsp]
+	add	r11,rax
+	mov	rax,r13
+	adc	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,r14
+	adc	r12,rdx
+	mul	rbx
+	add	r11,rax
+	mov	rax,rcx
+	mov	rcx,QWORD[208+rsp]
+	adc	r12,rdx
+	mul	rsi
+	add	r11,rax
+	mov	rax,r13
+	adc	r12,rdx
+	mov	rdx,r11
+	and	rdx,rcx
+	mov	QWORD[208+rsp],rdx
+	mul	r10
+	mov	r13,rax
+	mov	rax,r8
+	mov	r14,rdx
+	mul	rdi
+	add	r13,rax
+	mov	rax,QWORD[8+rsp]
+	adc	r14,rdx
+	mul	rbx
+	add	r13,rax
+	mov	rax,QWORD[224+rsp]
+	adc	r14,rdx
+	mul	r9
+	add	r13,rax
+	mov	rax,QWORD[168+rsp]
+	adc	r14,rdx
+	mul	rsi
+	add	rax,r13
+	mov	r13,r12
+	mov	r12,r11
+	adc	rdx,r14
+	shrd	r12,r13,51
+	shr	r13,51
+	mov	r14,r13
+	mov	r13,r12
+	add	r13,rax
+	mov	rax,QWORD[((-24))+rsp]
+	adc	r14,rdx
+	mov	r12,r13
+	and	r12,rcx
+	mul	rbx
+	mov	QWORD[216+rsp],r12
+	mov	r11,rax
+	mov	rax,rbp
+	mov	r12,rdx
+	mul	rdi
+	mov	rbp,r8
+	mov	QWORD[40+rsp],rbp
+	add	r11,rax
+	mov	rax,r8
+	adc	r12,rdx
+	mul	r10
+	add	r11,rax
+	mov	rax,QWORD[8+rsp]
+	adc	r12,rdx
+	mul	rsi
+	add	r11,rax
+	mov	rax,QWORD[168+rsp]
+	adc	r12,rdx
+	mul	r9
+	add	rax,r11
+	adc	rdx,r12
+	mov	r12,r13
+	mov	r13,r14
+	shrd	r12,r14,51
+	shr	r13,51
+	mov	r11,r12
+	mov	r12,r13
+	add	r11,rax
+	mov	rax,QWORD[((-24))+rsp]
+	adc	r12,rdx
+	mov	r8,r11
+	and	r8,rcx
+	mul	rsi
+	mov	r13,rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	r14,rdx
+	mul	rdi
+	add	r13,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	r14,rdx
+	mul	r10
+	add	r13,rax
+	mov	rax,rbp
+	adc	r14,rdx
+	mul	rbx
+	add	r13,rax
+	mov	rax,QWORD[8+rsp]
+	adc	r14,rdx
+	mul	r9
+	add	rax,r13
+	mov	r13,r12
+	mov	r12,r11
+	adc	rdx,r14
+	shrd	r12,r13,51
+	shr	r13,51
+	mov	r14,r13
+	mov	r13,r12
+	add	r13,rax
+	mov	rax,r9
+	adc	r14,rdx
+	mov	r12,r13
+	mov	r9,r13
+	mul	QWORD[((-24))+rsp]
+	and	r12,rcx
+	mov	QWORD[272+rsp],r12
+	mov	r11,rax
+	mov	r12,rdx
+	mov	rax,rdi
+	mul	QWORD[152+rsp]
+	mov	rdi,rax
+	mov	rbp,rdx
+	mov	rax,rsi
+	add	rdi,r11
+	adc	rbp,r12
+	mul	QWORD[40+rsp]
+	add	rdi,rax
+	mov	rax,r10
+	mov	r10,r14
+	adc	rbp,rdx
+	mul	QWORD[((-72))+rsp]
+	add	rdi,rax
+	mov	rax,rbx
+	adc	rbp,rdx
+	mul	QWORD[((-88))+rsp]
+	mov	r13,QWORD[56+rsp]
+	mov	rbx,QWORD[136+rsp]
+	add	rdi,rax
+	adc	rbp,rdx
+	shr	r10,51
+	shrd	r9,r14,51
+	mov	rdx,r10
+	mov	r10,QWORD[((-104))+rsp]
+	mov	r14,QWORD[((-32))+rsp]
+	mov	rax,r9
+	add	rax,rdi
+	mov	rdi,rax
+	adc	rdx,rbp
+	mov	rbp,QWORD[144+rsp]
+	and	rdi,rcx
+	mov	QWORD[280+rsp],rdi
+	mov	rdi,rax
+	shrd	rdi,rdx,51
+	mov	rdx,QWORD[72+rsp]
+	lea	rax,[rdi*8+rdi]
+	lea	rax,[rax*2+rdi]
+	add	rax,QWORD[208+rsp]
+	mov	r9,rax
+	shr	rax,51
+	add	rax,QWORD[216+rsp]
+	and	r9,rcx
+	mov	QWORD[288+rsp],r9
+	mov	r9,QWORD[((-120))+rsp]
+	mov	rdi,rax
+	shr	rax,51
+	lea	rsi,[r8*1+rax]
+	mov	r8,QWORD[160+rsp]
+	and	rdi,rcx
+	mov	QWORD[208+rsp],rdi
+	mov	QWORD[216+rsp],rsi
+	lea	rax,[r8*8+r8]
+	lea	rsi,[rax*2+r8]
+	lea	rax,[rdx*8+rdx]
+	lea	r8,[rax*2+rdx]
+	lea	rax,[r13*8+r13]
+	lea	rdi,[rax*2+r13]
+	lea	rax,[r15*8+r15]
+	lea	r11,[rax*2+r15]
+	mov	rax,r11
+	mul	r10
+	mov	r11,rax
+	mov	rax,r9
+	mov	r12,rdx
+	mul	rsi
+	add	r11,rax
+	mov	rax,r14
+	adc	r12,rdx
+	mul	QWORD[((-56))+rsp]
+	add	r11,rax
+	mov	rax,rbx
+	adc	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,rbp
+	adc	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,r11
+	adc	r12,rdx
+	and	rax,rcx
+	mov	QWORD[296+rsp],rax
+	mov	rax,r14
+	mul	r9
+	mov	r9,r11
+	mov	r13,rax
+	mov	rax,QWORD[((-56))+rsp]
+	mov	r14,rdx
+	mul	r15
+	add	r13,rax
+	mov	rax,rbx
+	adc	r14,rdx
+	mul	rsi
+	add	r13,rax
+	mov	rax,r8
+	mov	r8,rbp
+	adc	r14,rdx
+	mul	r10
+	mov	r10,r12
+	add	r13,rax
+	mov	rax,rbp
+	adc	r14,rdx
+	mul	rdi
+	add	rax,r13
+	adc	rdx,r14
+	shr	r10,51
+	mov	r14,rbx
+	shrd	r9,r12,51
+	add	r9,rax
+	mov	rax,QWORD[((-32))+rsp]
+	adc	r10,rdx
+	mov	r13,r9
+	and	r13,rcx
+	mul	rbx
+	mov	r11,rax
+	mov	r12,rdx
+	mov	rax,QWORD[((-56))+rsp]
+	mul	QWORD[72+rsp]
+	add	r11,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r12,rdx
+	mul	r15
+	add	r11,rax
+	mov	rax,rbp
+	adc	r12,rdx
+	mul	rsi
+	add	r11,rax
+	mov	rax,rdi
+	adc	r12,rdx
+	mul	QWORD[((-104))+rsp]
+	mov	rdi,rax
+	mov	rbp,rdx
+	mov	rax,r9
+	add	rdi,r11
+	mov	rdx,r10
+	adc	rbp,r12
+	shr	rdx,51
+	shrd	rax,r10,51
+	mov	r12,rdx
+	mov	r11,rax
+	mov	rax,QWORD[((-32))+rsp]
+	add	r11,rdi
+	mov	rdx,r11
+	adc	r12,rbp
+	mov	rbp,r8
+	and	rdx,rcx
+	mov	rdi,r12
+	mov	rbx,rdx
+	mul	r8
+	mov	r9,rax
+	mov	r10,rdx
+	mov	rax,QWORD[((-56))+rsp]
+	mul	QWORD[56+rsp]
+	add	r9,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r10,rdx
+	mul	QWORD[72+rsp]
+	add	r9,rax
+	mov	rax,r14
+	adc	r10,rdx
+	mul	r15
+	add	r9,rax
+	mov	rax,rsi
+	mov	rsi,r11
+	adc	r10,rdx
+	mul	QWORD[((-104))+rsp]
+	add	rax,r9
+	adc	rdx,r10
+	shr	rdi,51
+	shrd	rsi,r12,51
+	add	rsi,rax
+	mov	rax,QWORD[((-32))+rsp]
+	adc	rdi,rdx
+	mov	rdx,rsi
+	and	rdx,rcx
+	mov	r8,rdx
+	mul	QWORD[((-104))+rsp]
+	mov	r9,rax
+	mov	r10,rdx
+	mov	rax,QWORD[160+rsp]
+	mul	QWORD[((-56))+rsp]
+	mov	r12,QWORD[272+rsp]
+	add	r9,rax
+	mov	rax,r15
+	mov	r15,QWORD[288+rsp]
+	adc	r10,rdx
+	mul	rbp
+	add	r9,rax
+	mov	rax,QWORD[56+rsp]
+	adc	r10,rdx
+	mul	QWORD[((-120))+rsp]
+	add	r9,rax
+	mov	rax,QWORD[72+rsp]
+	adc	r10,rdx
+	mul	r14
+	mov	r14,QWORD[280+rsp]
+	add	rax,r9
+	adc	rdx,r10
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	mov	r9,rsi
+	mov	r10,rdi
+	lea	rdi,[r8*1+r12]
+	add	r9,rax
+	adc	r10,rdx
+	mov	rdx,r9
+	shrd	r9,r10,51
+	and	rdx,rcx
+	lea	rax,[r9*8+r9]
+	lea	rbp,[rdx*1+r14]
+	lea	rax,[rax*2+r9]
+	add	rax,QWORD[296+rsp]
+	mov	r10,rax
+	shr	rax,51
+	add	r13,rax
+	and	r10,rcx
+	mov	rax,r13
+	shr	r13,51
+	lea	r11,[r10*1+r15]
+	and	rax,rcx
+	add	r13,rbx
+	mov	r9,rax
+	mov	rax,QWORD[208+rsp]
+	lea	rbx,[r9*1+rax]
+	mov	rax,QWORD[216+rsp]
+	lea	rsi,[r13*1+rax]
+	mov		rax,18014398509481832
+	add	rax,r15
+	mov		r15,18014398509481976
+	sub	rax,r10
+	add	r15,r12
+	mov	QWORD[72+rsp],rax
+	mov		rax,18014398509481976
+	add	rax,QWORD[208+rsp]
+	sub	r15,r8
+	sub	rax,r9
+	mov	QWORD[56+rsp],rax
+	mov		rax,18014398509481976
+	add	rax,QWORD[216+rsp]
+	sub	rax,r13
+	mov	r13,rax
+	mov		rax,18014398509481976
+	add	rax,r14
+	lea	r14,[r11*1+r11]
+	sub	rax,rdx
+	lea	rdx,[rbx*1+rbx]
+	mov	QWORD[((-32))+rsp],rax
+	lea	rax,[rbp*8+rbp]
+	mov	QWORD[160+rsp],rdx
+	lea	rdx,[rax*2+rbp]
+	mov	rax,r11
+	lea	r8,[rdx*1+rdx]
+	mov	QWORD[272+rsp],rdx
+	mul	r11
+	mov	r11,rax
+	mov	rax,r8
+	mov	r12,rdx
+	mul	rbx
+	add	r11,rax
+	lea	rax,[rsi*8+rsi]
+	adc	r12,rdx
+	lea	rax,[rax*2+rsi]
+	add	rax,rax
+	mul	rdi
+	add	r11,rax
+	lea	rax,[rdi*8+rdi]
+	mov	r9,r11
+	adc	r12,rdx
+	and	r9,rcx
+	mov	QWORD[208+rsp],r9
+	lea	r9,[rax*2+rdi]
+	mov	rax,r9
+	mul	rdi
+	mov	r9,rax
+	mov	rax,rbx
+	mov	r10,rdx
+	mul	r14
+	add	r9,rax
+	mov	rax,r8
+	adc	r10,rdx
+	mul	rsi
+	add	rax,r9
+	mov	r9,r11
+	adc	rdx,r10
+	mov	r10,r12
+	shrd	r9,r12,51
+	shr	r10,51
+	add	r9,rax
+	mov	rax,rbx
+	adc	r10,rdx
+	mov	r12,r9
+	mul	rbx
+	and	r12,rcx
+	mov	rbx,QWORD[160+rsp]
+	mov	QWORD[216+rsp],r12
+	mov	r11,rax
+	mov	rax,r8
+	mov	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,r14
+	adc	r12,rdx
+	mul	rsi
+	add	rax,r11
+	adc	rdx,r12
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rax
+	mov	rax,rdi
+	adc	r10,rdx
+	mov	r8,r9
+	mul	r14
+	and	r8,rcx
+	mov	r11,rax
+	mov	rax,QWORD[272+rsp]
+	mov	r12,rdx
+	mul	rbp
+	add	r11,rax
+	mov	rax,rbx
+	adc	r12,rdx
+	mul	rsi
+	add	rax,r11
+	adc	rdx,r12
+	shrd	r9,r10,51
+	shr	r10,51
+	mov	r11,r9
+	mov	r12,r10
+	add	r11,rax
+	mov	rax,rbx
+	adc	r12,rdx
+	mov	QWORD[160+rsp],r11
+	mul	rdi
+	mov	r9,rax
+	mov	rax,rsi
+	mov	r10,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r14
+	add	rsi,r9
+	adc	rdi,r10
+	mov	r10,QWORD[72+rsp]
+	mul	rbp
+	mov	rbp,QWORD[56+rsp]
+	add	rsi,rax
+	adc	rdi,rdx
+	mov	rax,rsi
+	shrd	r11,r12,51
+	mov	rdx,rdi
+	shr	r12,51
+	add	rax,r11
+	adc	rdx,r12
+	mov	QWORD[272+rsp],rax
+	shrd	rax,rdx,51
+	mov	rsi,rax
+	lea	rax,[rax*8+rax]
+	lea	rdx,[rax*2+rsi]
+	add	rdx,QWORD[208+rsp]
+	mov	rax,QWORD[216+rsp]
+	lea	rsi,[r10*1+r10]
+	mov	QWORD[208+rsp],rdx
+	shr	rdx,51
+	add	rax,rdx
+	mov	QWORD[216+rsp],rax
+	shr	rax,51
+	lea	rbx,[r8*1+rax]
+	mov	r8,QWORD[((-32))+rsp]
+	mov	QWORD[280+rsp],rbx
+	lea	rbx,[rbp*1+rbp]
+	lea	rax,[r8*8+r8]
+	lea	r8,[rax*2+r8]
+	mov	rax,r10
+	mul	r10
+	lea	rdi,[r8*1+r8]
+	mov	r11,rax
+	mov	rax,rbp
+	mov	r12,rdx
+	mul	rdi
+	add	r11,rax
+	lea	rax,[r13*8+r13]
+	adc	r12,rdx
+	lea	rax,[rax*2+r13]
+	add	rax,rax
+	mul	r15
+	add	r11,rax
+	lea	rax,[r15*8+r15]
+	adc	r12,rdx
+	mov	r10,r11
+	lea	r9,[rax*2+r15]
+	and	r10,rcx
+	mov	QWORD[72+rsp],r10
+	mov	rax,r9
+	mul	r15
+	mov	r9,rax
+	mov	rax,rbp
+	mov	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,rdi
+	adc	r10,rdx
+	mul	r13
+	add	rax,r9
+	mov	r9,r11
+	adc	rdx,r10
+	mov	r10,r12
+	shrd	r9,r12,51
+	shr	r10,51
+	add	r9,rax
+	mov	rax,rbp
+	adc	r10,rdx
+	mov	r12,r9
+	mul	rbp
+	and	r12,rcx
+	mov	r14,r12
+	mov	r11,rax
+	mov	rax,rdi
+	mov	r12,rdx
+	mul	r15
+	add	r11,rax
+	mov	rax,rsi
+	adc	r12,rdx
+	mul	r13
+	add	rax,r11
+	adc	rdx,r12
+	shrd	r9,r10,51
+	shr	r10,51
+	mov	r11,r9
+	mov	r12,r10
+	add	r11,rax
+	mov	rax,r15
+	adc	r12,rdx
+	mov	r9,r11
+	mul	rsi
+	and	r9,rcx
+	mov	QWORD[56+rsp],r9
+	mov	r9,rax
+	mov	rax,r8
+	mov	r8,QWORD[((-32))+rsp]
+	mov	r10,rdx
+	mul	r8
+	add	r9,rax
+	mov	rax,r13
+	adc	r10,rdx
+	mul	rbx
+	add	rax,r9
+	mov	r9,r11
+	adc	rdx,r10
+	mov	r10,r12
+	shrd	r9,r12,51
+	shr	r10,51
+	add	r9,rax
+	mov	rax,r15
+	adc	r10,rdx
+	mov	r12,r9
+	mul	rbx
+	and	r12,rcx
+	mov	rbp,r12
+	mov	r11,rax
+	mov	rax,r13
+	mov	r12,rdx
+	mul	r13
+	add	r11,rax
+	mov	rax,r8
+	adc	r12,rdx
+	mul	rsi
+	add	rax,r11
+	adc	rdx,r12
+	mov	r12,QWORD[336+rsp]
+	shrd	r9,r10,51
+	shr	r10,51
+	add	rax,r9
+	mov	r9,QWORD[24+rsp]
+	adc	rdx,r10
+	mov	rbx,rax
+	mov	r10,QWORD[192+rsp]
+	shrd	rax,rdx,51
+	and	rbx,rcx
+	lea	rdx,[rax*8+rax]
+	lea	rsi,[rdx*2+rax]
+	add	rsi,QWORD[72+rsp]
+	mov	rax,QWORD[328+rsp]
+	mul	rbx
+	mov	r13,rsi
+	shr	rsi,51
+	add	rsi,r14
+	and	r13,rcx
+	mov	r8,r13
+	mov	r13,rsi
+	shr	rsi,51
+	and	r13,rcx
+	mov	r14,rdx
+	add	rsi,QWORD[56+rsp]
+	mov	rdi,r13
+	mov	r13,rax
+	mov	rax,r12
+	mul	rbp
+	add	r13,rax
+	mov	rax,r9
+	adc	r14,rdx
+	mul	r8
+	add	r13,rax
+	mov	rax,r10
+	adc	r14,rdx
+	mul	rdi
+	add	r13,rax
+	mov	rax,QWORD[200+rsp]
+	adc	r14,rdx
+	mul	rsi
+	add	r13,rax
+	mov	rax,r12
+	adc	r14,rdx
+	mov	r15,r13
+	mul	rbx
+	and	r15,rcx
+	mov	r11,rax
+	mov	rax,QWORD[200+rsp]
+	mov	r12,rdx
+	mul	rbp
+	add	r11,rax
+	mov	rax,QWORD[120+rsp]
+	adc	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,r9
+	adc	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,r10
+	adc	r12,rdx
+	mul	rsi
+	add	rax,r11
+	adc	rdx,r12
+	shrd	r13,r14,51
+	shr	r14,51
+	mov	r11,r13
+	mov	r12,r14
+	add	r11,rax
+	mov	rax,r10
+	adc	r12,rdx
+	mov	r13,r11
+	mul	rbp
+	and	r13,rcx
+	mov	r14,r13
+	mov	r9,rax
+	mov	rax,QWORD[200+rsp]
+	mov	r10,rdx
+	mul	rbx
+	add	r9,rax
+	mov	rax,QWORD[184+rsp]
+	adc	r10,rdx
+	mul	r8
+	add	r9,rax
+	mov	rax,QWORD[120+rsp]
+	adc	r10,rdx
+	mul	rdi
+	add	r9,rax
+	mov	rax,QWORD[24+rsp]
+	adc	r10,rdx
+	mul	rsi
+	add	rax,r9
+	mov	r9,r11
+	adc	rdx,r10
+	mov	r10,r12
+	shrd	r9,r12,51
+	shr	r10,51
+	add	r9,rax
+	mov	rax,QWORD[24+rsp]
+	adc	r10,rdx
+	mov	r13,r9
+	and	r13,rcx
+	mul	rbp
+	mov	r11,rax
+	mov	rax,QWORD[192+rsp]
+	mov	r12,rdx
+	mul	rbx
+	add	r11,rax
+	mov	rax,QWORD[264+rsp]
+	adc	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,QWORD[184+rsp]
+	adc	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,QWORD[120+rsp]
+	adc	r12,rdx
+	mul	rsi
+	add	rax,r11
+	adc	rdx,r12
+	shrd	r9,r10,51
+	shr	r10,51
+	mov	r11,r9
+	mov	r12,r10
+	add	r11,rax
+	mov	rax,rbx
+	adc	r12,rdx
+	mov	QWORD[72+rsp],r11
+	mul	QWORD[24+rsp]
+	mov	r9,rax
+	mov	r10,rdx
+	mov	rax,rbp
+	mul	QWORD[120+rsp]
+	add	r9,rax
+	mov	rax,r8
+	mov	r8,r11
+	adc	r10,rdx
+	mul	QWORD[320+rsp]
+	add	r9,rax
+	mov	rax,rdi
+	adc	r10,rdx
+	mul	QWORD[264+rsp]
+	mov	rdi,rax
+	mov	rbp,rdx
+	mov	rax,rsi
+	add	rdi,r9
+	mov	r9,r12
+	adc	rbp,r10
+	mov	r10,QWORD[((-56))+rsp]
+	mul	QWORD[184+rsp]
+	add	rdi,rax
+	adc	rbp,rdx
+	shr	r9,51
+	shrd	r8,r12,51
+	mov	rdx,r9
+	mov	r12,QWORD[((-120))+rsp]
+	mov	rax,r8
+	add	rax,rdi
+	mov	rdi,QWORD[((-104))+rsp]
+	adc	rdx,rbp
+	mov	r8,rax
+	mov	QWORD[288+rsp],rax
+	shrd	r8,rdx,51
+	lea	rbp,[r12*1+r12]
+	lea	rax,[r8*8+r8]
+	lea	r8,[rax*2+r8]
+	lea	rax,[rdi*8+rdi]
+	add	r8,r15
+	lea	r11,[rax*2+rdi]
+	mov	rbx,r8
+	mov	QWORD[56+rsp],r8
+	shr	rbx,51
+	lea	r8,[r14*1+rbx]
+	lea	rbx,[r10*1+r10]
+	mov	QWORD[296+rsp],r8
+	shr	r8,51
+	lea	r15,[r13*1+r8]
+	lea	r8,[r11*1+r11]
+	mov	QWORD[304+rsp],r15
+	mov	r15,QWORD[136+rsp]
+	mov	r13,QWORD[144+rsp]
+	lea	rax,[r15*8+r15]
+	lea	rsi,[rax*2+r15]
+	add	rsi,rsi
+	mov	rax,rsi
+	mul	r13
+	mov	rsi,rax
+	mov	rax,r10
+	mov	rdi,rdx
+	mul	r10
+	add	rsi,rax
+	mov	rax,r12
+	adc	rdi,rdx
+	mul	r8
+	add	rsi,rax
+	lea	rax,[r13*8+r13]
+	adc	rdi,rdx
+	mov	r14,rsi
+	lea	r9,[rax*2+r13]
+	and	r14,rcx
+	mov	rax,r9
+	mul	r13
+	mov	r9,rax
+	mov	rax,r12
+	mov	r10,rdx
+	mul	rbx
+	add	r9,rax
+	mov	rax,r15
+	adc	r10,rdx
+	mul	r8
+	add	rax,r9
+	adc	rdx,r10
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rax
+	mov	rax,r15
+	adc	rdi,rdx
+	mov	r10,rsi
+	mov	QWORD[((-56))+rsp],rsi
+	mul	rbx
+	mov	QWORD[((-48))+rsp],rdi
+	mov	rdi,QWORD[((-120))+rsp]
+	and	r10,rcx
+	mov	rsi,QWORD[((-56))+rsp]
+	mov	r12,r10
+	mov	r9,rax
+	mov	rax,rdi
+	mov	r10,rdx
+	mul	rdi
+	mov	rdi,QWORD[((-48))+rsp]
+	add	r9,rax
+	mov	rax,r8
+	mov	r8,r13
+	adc	r10,rdx
+	mul	r13
+	add	rax,r9
+	adc	rdx,r10
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rax
+	mov	rax,r8
+	adc	rdi,rdx
+	mov	r10,rsi
+	mov	QWORD[((-120))+rsp],rsi
+	mul	rbx
+	and	r10,rcx
+	mov	rsi,QWORD[((-120))+rsp]
+	mov	r13,r10
+	mov	QWORD[((-112))+rsp],rdi
+	mov	rdi,QWORD[((-112))+rsp]
+	mov	r9,rax
+	mov	rax,r15
+	mov	r10,rdx
+	mul	rbp
+	add	r9,rax
+	mov	rax,r11
+	mov	r11,QWORD[((-104))+rsp]
+	adc	r10,rdx
+	mul	r11
+	add	rax,r9
+	adc	rdx,r10
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rax
+	mov	rax,rsi
+	adc	rdi,rdx
+	and	rax,rcx
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,r11
+	mov	r11,QWORD[((-88))+rsp]
+	mul	rbx
+	mov	rbx,QWORD[((-24))+rsp]
+	mov	r9,rax
+	mov	rax,r8
+	mov	r10,rdx
+	mul	rbp
+	mov	rbp,QWORD[40+rsp]
+	add	r9,rax
+	mov	rax,r15
+	adc	r10,rdx
+	mul	r15
+	add	rax,r9
+	adc	rdx,r10
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rax
+	adc	rdi,rdx
+	mov	r15,rsi
+	shrd	rsi,rdi,51
+	and	r15,rcx
+	lea	rax,[rsi*8+rsi]
+	lea	rax,[rax*2+rsi]
+	lea	rsi,[rbp*1+rbp]
+	add	r14,rax
+	mov	rdi,r14
+	shr	r14,51
+	add	r12,r14
+	mov	r14,QWORD[224+rsp]
+	and	rdi,rcx
+	mov	rdx,r12
+	shr	r12,51
+	mov	QWORD[((-104))+rsp],rdi
+	lea	r9,[r13*1+r12]
+	mov	r13,QWORD[8+rsp]
+	mov	r12,QWORD[((-72))+rsp]
+	and	rdx,rcx
+	lea	rdi,[rbx*1+rbx]
+	mov	QWORD[((-56))+rsp],rdx
+	mov	QWORD[((-32))+rsp],r9
+	lea	r8,[r13*1+r13]
+	lea	r13,[r14*1+r14]
+	mov	rax,r13
+	mul	r12
+	mov	r13,rax
+	mov	rax,rbx
+	mov	r14,rdx
+	mul	rbx
+	mov	rbx,rbp
+	add	r13,rax
+	mov	rax,rbp
+	adc	r14,rdx
+	mul	r8
+	add	r13,rax
+	mov	rax,rbx
+	adc	r14,rdx
+	mov	rbp,r13
+	mul	rdi
+	and	rbp,rcx
+	mov	r9,rax
+	mov	rax,QWORD[168+rsp]
+	mov	r10,rdx
+	mul	r12
+	add	r9,rax
+	mov	rax,r11
+	adc	r10,rdx
+	mul	r8
+	add	rax,r9
+	mov	r9,QWORD[40+rsp]
+	adc	rdx,r10
+	shrd	r13,r14,51
+	shr	r14,51
+	add	r13,rax
+	mov	rax,r11
+	adc	r14,rdx
+	mov	r10,r13
+	mul	rdi
+	and	r10,rcx
+	mov	rbx,r10
+	mov	r11,rax
+	mov	rax,r9
+	mov	r12,rdx
+	mul	r9
+	mov	r9,QWORD[((-72))+rsp]
+	add	r11,rax
+	mov	rax,r8
+	adc	r12,rdx
+	mul	r9
+	add	rax,r11
+	adc	rdx,r12
+	mov	r12,r13
+	mov	r13,r14
+	shrd	r12,r14,51
+	shr	r13,51
+	mov	r11,r12
+	mov	r12,r13
+	mov	r13,QWORD[((-88))+rsp]
+	add	r11,rax
+	mov	rax,r9
+	adc	r12,rdx
+	mov	r14,r11
+	mul	rdi
+	and	r14,rcx
+	mov	r8,r14
+	mov	r9,rax
+	mov	rax,r13
+	mov	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,QWORD[8+rsp]
+	adc	r10,rdx
+	mul	QWORD[152+rsp]
+	add	rax,r9
+	adc	rdx,r10
+	shrd	r11,r12,51
+	shr	r12,51
+	mov	r9,r11
+	mov	r10,r12
+	add	r9,rax
+	mov	rax,QWORD[152+rsp]
+	adc	r10,rdx
+	mov	r14,r9
+	and	r14,rcx
+	mul	rdi
+	mov	r11,rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	r12,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r13
+	add	rsi,r11
+	adc	rdi,r12
+	mul	r13
+	add	rsi,rax
+	adc	rdi,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	rsi,r9
+	adc	rdi,r10
+	mov	r12,rsi
+	shrd	rsi,rdi,51
+	and	r12,rcx
+	lea	rax,[rsi*8+rsi]
+	mov	QWORD[((-88))+rsp],r12
+	lea	r13,[rax*2+rsi]
+	lea	rax,[r12*8+r12]
+	add	r13,rbp
+	lea	rdi,[rax*2+r12]
+	mov	rbp,r13
+	shr	r13,51
+	add	r13,rbx
+	and	rbp,rcx
+	mov	rbx,r13
+	shr	r13,51
+	add	r13,r8
+	and	rbx,rcx
+	lea	rax,[r13*8+r13]
+	lea	rsi,[rax*2+r13]
+	lea	rax,[r14*8+r14]
+	lea	r8,[rax*2+r14]
+	mov	rax,QWORD[((-104))+rsp]
+	mov	r11,QWORD[((-56))+rsp]
+	mul	rbp
+	mov	r9,rax
+	mov	rax,r11
+	mov	r10,rdx
+	mul	rdi
+	add	r9,rax
+	mov	rax,QWORD[((-32))+rsp]
+	adc	r10,rdx
+	mul	r8
+	add	r9,rax
+	lea	rax,[rbx*8+rbx]
+	adc	r10,rdx
+	lea	rax,[rax*2+rbx]
+	mul	r15
+	add	r9,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,r11
+	adc	r10,rdx
+	mov	r12,r9
+	mul	rbp
+	and	r12,rcx
+	mov	QWORD[((-72))+rsp],r12
+	mov	r11,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	r12,rdx
+	mul	rbx
+	add	r11,rax
+	mov	rax,QWORD[((-32))+rsp]
+	adc	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,rsi
+	mov	rsi,QWORD[((-104))+rsp]
+	adc	r12,rdx
+	mul	r15
+	add	rax,r11
+	adc	rdx,r12
+	shrd	r9,r10,51
+	shr	r10,51
+	mov	r11,r9
+	mov	r12,r10
+	add	r11,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r12,rdx
+	mov	r10,r11
+	and	r10,rcx
+	mul	rdi
+	mov	QWORD[((-24))+rsp],r10
+	mov	r9,rax
+	mov	rax,r8
+	mov	r10,rdx
+	mul	r15
+	add	r9,rax
+	mov	rax,QWORD[((-32))+rsp]
+	adc	r10,rdx
+	mul	rbp
+	add	r9,rax
+	mov	rax,QWORD[((-56))+rsp]
+	adc	r10,rdx
+	mul	rbx
+	add	r9,rax
+	mov	rax,rsi
+	adc	r10,rdx
+	mul	r13
+	add	rax,r9
+	adc	rdx,r10
+	shrd	r11,r12,51
+	shr	r12,51
+	add	r11,rax
+	mov	rax,rsi
+	adc	r12,rdx
+	mov	QWORD[8+rsp],r11
+	and	r11,rcx
+	mul	r14
+	mov	r8,r11
+	mov	QWORD[16+rsp],r12
+	mov	r11,QWORD[8+rsp]
+	mov	r12,QWORD[16+rsp]
+	mov	r9,rax
+	mov	rax,rdi
+	mov	r10,rdx
+	mul	r15
+	mov	rsi,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rdi,rdx
+	add	rsi,r9
+	adc	rdi,r10
+	mul	rbp
+	add	rsi,rax
+	mov	rax,QWORD[((-32))+rsp]
+	adc	rdi,rdx
+	mul	rbx
+	add	rsi,rax
+	mov	rax,QWORD[((-56))+rsp]
+	adc	rdi,rdx
+	mul	r13
+	add	rsi,rax
+	mov	rax,rbp
+	adc	rdi,rdx
+	shrd	r11,r12,51
+	shr	r12,51
+	add	rsi,r11
+	mov	r11,QWORD[((-104))+rsp]
+	adc	rdi,r12
+	mov	r12,QWORD[((-120))+rsp]
+	mov	QWORD[40+rsp],rsi
+	mul	r15
+	mov	r9,rax
+	mov	r10,rdx
+	mov	rax,r11
+	mul	QWORD[((-88))+rsp]
+	add	r9,rax
+	mov	rax,QWORD[((-56))+rsp]
+	adc	r10,rdx
+	mul	r14
+	add	r9,rax
+	mov	rax,r12
+	adc	r10,rdx
+	mul	rbx
+	add	r9,rax
+	mov	rax,QWORD[((-32))+rsp]
+	adc	r10,rdx
+	mul	r13
+	add	rax,r9
+	adc	rdx,r10
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rax,rsi
+	adc	rdx,rdi
+	mov	rsi,rax
+	mov	QWORD[136+rsp],rax
+	shrd	rsi,rdx,51
+	mov	rdi,QWORD[((-72))+rsp]
+	lea	rax,[rsi*8+rsi]
+	lea	rax,[rax*2+rsi]
+	mov		rsi,18014398509481832
+	add	rdi,rax
+	lea	rax,[r11*1+rsi]
+	add	rsi,144
+	mov	r9,rdi
+	add	rsi,QWORD[((-56))+rsp]
+	mov	QWORD[144+rsp],rdi
+	shr	r9,51
+	add	r9,QWORD[((-24))+rsp]
+	sub	rsi,rbx
+	mov	r10,r9
+	mov	rbx,rsi
+	mov	QWORD[152+rsp],r9
+	shr	r10,51
+	mov		rsi,18014398509481976
+	add	r10,r8
+	mov	r8,rax
+	mov	QWORD[8+rsp],r10
+	add	rsi,QWORD[((-32))+rsp]
+	sub	r8,rbp
+	mov	rbp,r8
+	mov	r8,rsi
+	mov		rsi,18014398509481976
+	lea	r11,[r12*1+rsi]
+	lea	rax,[rsi*1+r15]
+	sub	rax,QWORD[((-88))+rsp]
+	sub	r8,r13
+	mov	QWORD[((-88))+rsp],rbp
+	mov	r12,QWORD[((-120))+rsp]
+	mov	r13,r8
+	mov	r8,r11
+	sub	r8,r14
+	mov	QWORD[((-72))+rsp],r13
+	mov	r14,r8
+	mov	r8,rax
+	mov	eax,121665
+	mul	rbp
+	mov	QWORD[((-24))+rsp],r14
+	mov	r11,rax
+	mov	rsi,rax
+	mov	eax,121665
+	mov	rdi,rdx
+	shrd	rsi,rdx,51
+	mul	rbx
+	shr	rdi,51
+	add	rsi,rax
+	mov	eax,121665
+	adc	rdi,rdx
+	mov	QWORD[168+rsp],rsi
+	mul	r13
+	mov	QWORD[176+rsp],rdi
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	mov	r10,rdi
+	mov	rdi,rax
+	mov	rbp,rdx
+	mov	eax,121665
+	add	rdi,rsi
+	adc	rbp,r10
+	mov	r9,rdi
+	mul	r14
+	mov	r10,rbp
+	shrd	r9,rbp,51
+	shr	r10,51
+	add	r9,rax
+	mov	eax,121665
+	adc	r10,rdx
+	mov	r13,r9
+	mul	r8
+	mov	r14,r10
+	shrd	r13,r10,51
+	shr	r14,51
+	add	r13,rax
+	adc	r14,rdx
+	mov	rax,r13
+	and	r11,rcx
+	shrd	rax,r14,51
+	add	r11,QWORD[((-104))+rsp]
+	and	rdi,rcx
+	add	rdi,QWORD[((-32))+rsp]
+	lea	rdx,[rax*8+rax]
+	lea	rax,[rdx*2+rax]
+	lea	rsi,[rax*1+r11]
+	mov	r11,r9
+	mov	r9,QWORD[((-88))+rsp]
+	and	r11,rcx
+	lea	rbp,[r12*1+r11]
+	mov	r11,r13
+	mov	QWORD[((-104))+rsp],rsi
+	and	r11,rcx
+	mov	rsi,QWORD[168+rsp]
+	add	r15,r11
+	lea	rax,[r15*8+r15]
+	and	rsi,rcx
+	add	rsi,QWORD[((-56))+rsp]
+	lea	r14,[rax*2+r15]
+	lea	rax,[rdi*8+rdi]
+	lea	r11,[rax*2+rdi]
+	lea	rax,[rbp*8+rbp]
+	lea	r13,[rax*2+rbp]
+	lea	rax,[rsi*8+rsi]
+	mov	r10,r11
+	mov	QWORD[((-56))+rsp],r10
+	lea	r11,[rax*2+rsi]
+	mov	rax,r11
+	mul	r8
+	mov	r11,rax
+	mov	rax,QWORD[((-24))+rsp]
+	mov	r12,rdx
+	mul	r10
+	add	r11,rax
+	mov	rax,QWORD[((-72))+rsp]
+	adc	r12,rdx
+	mul	r13
+	add	r11,rax
+	mov	rax,r9
+	adc	r12,rdx
+	mul	QWORD[((-104))+rsp]
+	add	r11,rax
+	mov	rax,rbx
+	adc	r12,rdx
+	mul	r14
+	add	r11,rax
+	mov	rax,r11
+	adc	r12,rdx
+	and	rax,rcx
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,r9
+	mul	rsi
+	mov	r9,rax
+	mov	rax,QWORD[((-56))+rsp]
+	mov	r10,rdx
+	mul	r8
+	add	r9,rax
+	mov	rax,QWORD[((-24))+rsp]
+	adc	r10,rdx
+	mul	r13
+	add	r9,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	r10,rdx
+	mul	rbx
+	add	r9,rax
+	mov	rax,QWORD[((-72))+rsp]
+	adc	r10,rdx
+	mul	r14
+	add	rax,r9
+	adc	rdx,r10
+	shrd	r11,r12,51
+	shr	r12,51
+	mov	r9,r11
+	mov	r10,r12
+	add	r9,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	r10,rdx
+	mov	r12,r9
+	and	r12,rcx
+	mul	rdi
+	mov	QWORD[((-32))+rsp],r12
+	mov	r11,rax
+	mov	rax,rbx
+	mov	r12,rdx
+	mul	rsi
+	add	r11,rax
+	mov	rax,r13
+	adc	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,QWORD[((-72))+rsp]
+	adc	r12,rdx
+	mul	QWORD[((-104))+rsp]
+	add	r11,rax
+	mov	rax,QWORD[((-24))+rsp]
+	adc	r12,rdx
+	mul	r14
+	add	rax,r11
+	adc	rdx,r12
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rax
+	mov	rax,rbx
+	adc	r10,rdx
+	mov	r11,r9
+	mul	rdi
+	and	r11,rcx
+	mov	QWORD[((-48))+rsp],r10
+	mov	r13,r11
+	mov	r11,rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	r12,rdx
+	mul	rsi
+	add	r11,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	r12,rdx
+	mul	rbp
+	add	r11,rax
+	mov	rax,QWORD[((-24))+rsp]
+	adc	r12,rdx
+	mul	QWORD[((-104))+rsp]
+	add	r11,rax
+	mov	rax,r14
+	adc	r12,rdx
+	mul	r8
+	add	rax,r11
+	adc	rdx,r12
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rax
+	mov	rax,QWORD[((-24))+rsp]
+	adc	r10,rdx
+	mov	r14,r9
+	mul	rsi
+	mov	r11,rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	r12,rdx
+	mul	rdi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,rbx
+	add	rsi,r11
+	adc	rdi,r12
+	mul	rbp
+	mov	rbp,QWORD[104+rsp]
+	add	rsi,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rdi,rdx
+	mul	r15
+	mov	r15,QWORD[144+rsp]
+	add	rsi,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rdi,rdx
+	mul	r8
+	mov	r8,QWORD[152+rsp]
+	add	rsi,rax
+	adc	rdi,rdx
+	mov	rdx,QWORD[216+rsp]
+	shrd	r9,r10,51
+	shr	r10,51
+	add	rsi,r9
+	adc	rdi,r10
+	mov	rbx,rsi
+	mov	r10,QWORD[128+rsp]
+	shrd	rsi,rdi,51
+	lea	rax,[rsi*8+rsi]
+	lea	r12,[rax*2+rsi]
+	add	r12,QWORD[((-120))+rsp]
+	mov	rsi,QWORD[208+rsp]
+	mov	rax,r10
+	and	rax,rcx
+	mov	r9,rsi
+	and	rsi,rcx
+	mov	r11,r12
+	xor	r9,r15
+	and	r15,rcx
+	shr	r11,51
+	add	r11,QWORD[((-32))+rsp]
+	and	r9,rax
+	xor	rsi,r9
+	xor	r15,r9
+	mov	QWORD[rbp],rsi
+	mov	QWORD[((-88))+rsp],rsi
+	mov	rsi,rdx
+	xor	rsi,r8
+	and	r8,rcx
+	and	rdx,rcx
+	mov	rdi,r11
+	and	rsi,rax
+	mov	r9,r8
+	shr	rdi,51
+	xor	r9,rsi
+	xor	rsi,rdx
+	add	rdi,r13
+	mov	r13,QWORD[232+rsp]
+	mov	rdx,QWORD[280+rsp]
+	mov	QWORD[((-56))+rsp],r9
+	mov	QWORD[8+rbp],rsi
+	mov	QWORD[8+r13],r9
+	mov	r9,QWORD[8+rsp]
+	mov	QWORD[r13],r15
+	xor	r9,rdx
+	mov	r8,r9
+	and	r8,r10
+	mov	r10,QWORD[8+rsp]
+	mov	r9,r8
+	xor	r9,rdx
+	xor	r10,r8
+	mov	rdx,r9
+	mov	r8,QWORD[160+rsp]
+	mov	QWORD[16+rbp],rdx
+	mov	QWORD[((-24))+rsp],r9
+	mov	r9,rbp
+	mov	rbp,QWORD[40+rsp]
+	mov	QWORD[16+r13],r10
+	mov	rdx,r8
+	and	r8,rcx
+	xor	rdx,rbp
+	and	rbp,rcx
+	and	rdx,rax
+	xor	rbp,rdx
+	xor	r8,rdx
+	mov	rdx,r9
+	mov	QWORD[24+r13],rbp
+	mov	QWORD[24+rdx],r8
+	mov	QWORD[((-72))+rsp],rbp
+	mov	QWORD[40+rsp],r8
+	mov	r9,QWORD[272+rsp]
+	mov	rbp,QWORD[136+rsp]
+	mov	rdx,r9
+	xor	rdx,rbp
+	mov	r8,rdx
+	mov	rdx,rbp
+	mov	rbp,QWORD[((-8))+rsp]
+	and	r8,rax
+	and	rdx,rcx
+	xor	rdx,r8
+	mov	QWORD[32+r13],rdx
+	mov	QWORD[((-32))+rsp],rdx
+	mov	r13,r9
+	mov	rdx,QWORD[104+rsp]
+	mov	r9,QWORD[56+rsp]
+	and	r13,rcx
+	xor	r13,r8
+	mov		r8,2251799813685247
+	mov	QWORD[((-120))+rsp],r13
+	mov	QWORD[32+rdx],r13
+	mov	rdx,r9
+	and	r8,r9
+	xor	rdx,r12
+	mov	r9,r8
+	mov		r13,2251799813685247
+	and	rdx,rax
+	and	r12,r13
+	mov		r8,2251799813685247
+	xor	r9,rdx
+	xor	r12,rdx
+	mov	r13,QWORD[88+rsp]
+	mov	QWORD[rbp],r9
+	mov	QWORD[((-104))+rsp],r9
+	mov	r9,QWORD[296+rsp]
+	mov	QWORD[r13],r12
+	mov	rdx,r9
+	xor	rdx,r11
+	and	r11,r8
+	and	rdx,rax
+	xor	r11,rdx
+	mov	QWORD[8+r13],r11
+	mov	r13,r8
+	and	r13,r9
+	mov	r9,QWORD[88+rsp]
+	xor	rdx,r13
+	mov	QWORD[8+rbp],rdx
+	mov	rbp,QWORD[304+rsp]
+	mov	r13,rbp
+	xor	r13,rdi
+	and	r13,QWORD[128+rsp]
+	mov	r8,r13
+	xor	rdi,r13
+	xor	r8,rbp
+	mov	rbp,QWORD[72+rsp]
+	mov	QWORD[16+r9],rdi
+	mov	r13,r8
+	mov	QWORD[56+rsp],r8
+	mov	r8,QWORD[((-8))+rsp]
+	mov	QWORD[16+r8],r13
+	mov	r13,QWORD[72+rsp]
+	mov		r8,2251799813685247
+	and	rbp,r8
+	xor	r13,r14
+	and	r14,r8
+	mov	r8,QWORD[((-8))+rsp]
+	and	r13,rax
+	xor	r14,r13
+	xor	r13,rbp
+	mov	rbp,QWORD[288+rsp]
+	mov	QWORD[24+r8],r13
+	mov	QWORD[24+r9],r14
+	mov	r8,rbp
+	xor	r8,rbx
+	and	r8,rax
+	mov		rax,2251799813685247
+	and	rbx,rax
+	and	rax,rbp
+	xor	rbx,r8
+	xor	rax,r8
+	mov	r8,QWORD[((-8))+rsp]
+	mov	QWORD[32+r9],rbx
+	mov	QWORD[32+r8],rax
+	sal	BYTE[315+rsp],1
+	sub	DWORD[316+rsp],1
+	jne	NEAR $L$10
+	mov	r9,rdx
+	mov	rdx,QWORD[88+rsp]
+	mov	QWORD[72+rsp],r15
+	mov	r15,rsi
+	mov	rsi,QWORD[104+rsp]
+	mov	rbp,QWORD[((-72))+rsp]
+	sub	QWORD[344+rsp],1
+	mov	r8,rax
+	mov	QWORD[8+rsp],rdx
+	mov	rdx,QWORD[248+rsp]
+	mov	QWORD[((-72))+rsp],rsi
+	mov	rsi,QWORD[232+rsp]
+	mov	rax,QWORD[344+rsp]
+	mov	QWORD[104+rsp],rdx
+	mov	rdx,QWORD[240+rsp]
+	mov	QWORD[232+rsp],rdx
+	mov	rdx,QWORD[256+rsp]
+	mov	QWORD[88+rsp],rdx
+	lea	rdx,[359+rsp]
+	cmp	rdx,rax
+	jne	NEAR $L$3
+	lea	rax,[rbx*8+rbx]
+	mov	QWORD[184+rsp],rbp
+	mov	r8,rbx
+	mov	r15,r14
+	mov	r13,r11
+	mov	r11,r12
+	lea	rbp,[rax*2+rbx]
+	lea	rax,[rdi*8+rdi]
+	mov	QWORD[168+rsp],r10
+	mov	r9,rdi
+	lea	r14,[r12*1+r12]
+	lea	r12,[r13*1+r13]
+	lea	rax,[rax*2+rdi]
+	lea	r10,[rbp*1+rbp]
+	lea	rbx,[rax*1+rax]
+	mov	QWORD[56+rsp],rax
+	mov	rax,rbx
+	mul	r15
+	mov	rcx,rax
+	mov	rax,r11
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,r10
+	adc	rbx,rdx
+	mul	r13
+	mov	rsi,rax
+	mov	rdi,rdx
+	lea	rax,[r15*8+r15]
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	QWORD[((-120))+rsp],rsi
+	mov		rbx,2251799813685247
+	mov	rcx,rdi
+	lea	rdi,[rax*2+r15]
+	mov	rax,r13
+	mul	r14
+	and	rbx,QWORD[((-120))+rsp]
+	mov	QWORD[((-112))+rsp],rcx
+	mov	rsi,QWORD[((-120))+rsp]
+	mov	QWORD[8+rsp],rdi
+	mov	rcx,rax
+	mov	rax,rdi
+	mov	QWORD[((-104))+rsp],rbx
+	mov	rbx,rdx
+	mov	rdi,QWORD[((-112))+rsp]
+	mul	r15
+	add	rcx,rax
+	mov	rax,r10
+	adc	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov		rcx,2251799813685247
+	adc	rdi,rbx
+	and	rcx,rsi
+	mul	r9
+	mov	QWORD[((-88))+rsp],rcx
+	mov	rcx,rax
+	mov	rax,r13
+	mov	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,r10
+	mov		r10,2251799813685247
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	mov		rsi,2251799813685247
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,r15
+	and	rsi,QWORD[((-120))+rsp]
+	mul	r14
+	mov	QWORD[((-112))+rsp],rdi
+	mov	rdi,QWORD[((-112))+rsp]
+	mov	QWORD[((-72))+rsp],rsi
+	mov	rsi,QWORD[((-120))+rsp]
+	mov	rcx,rax
+	mov	rax,r9
+	mov	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r14
+	mov		r14,2251799813685247
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	and	r10,rsi
+	mul	r8
+	mov	rcx,rax
+	mov	rax,r12
+	mov	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,r9
+	adc	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	rcx,rsi
+	mov		rbx,2251799813685247
+	shrd	rcx,rdi,51
+	and	r14,rsi
+	mov		rsi,2251799813685247
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-104))+rsp]
+	and	rbx,rax
+	shr	rax,51
+	add	rax,QWORD[((-88))+rsp]
+	mov	QWORD[((-24))+rsp],rbx
+	lea	r12,[rbx*1+rbx]
+	and	rsi,rax
+	shr	rax,51
+	add	rax,QWORD[((-72))+rsp]
+	lea	rcx,[rsi*1+rsi]
+	mov	QWORD[((-88))+rsp],rcx
+	mov	QWORD[((-120))+rsp],rax
+	lea	rax,[r14*8+r14]
+	lea	rcx,[rax*2+r14]
+	mov	rax,rsi
+	lea	rdi,[rcx*1+rcx]
+	mov	QWORD[80+rsp],rcx
+	mul	rdi
+	mov	rcx,rax
+	mov	rax,QWORD[((-24))+rsp]
+	mov	rbx,rdx
+	mul	rax
+	add	rcx,rax
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-120))+rsp]
+	lea	rax,[rdx*8+rdx]
+	lea	rdx,[rax*2+rdx]
+	lea	rax,[rdx*1+rdx]
+	mov	QWORD[88+rsp],rdx
+	mul	r10
+	add	rcx,rax
+	lea	rax,[r10*8+r10]
+	adc	rbx,rdx
+	mov	rdx,rcx
+	lea	rax,[rax*2+r10]
+	mov	rcx,rbx
+	mov	QWORD[((-104))+rsp],rdx
+	mov	QWORD[((-96))+rsp],rcx
+	mov	rbx,rax
+	mov		rax,2251799813685247
+	and	rax,QWORD[((-104))+rsp]
+	mov	QWORD[128+rsp],rbx
+	mov	QWORD[24+rsp],rax
+	mov	rax,rbx
+	mul	r10
+	mov	rcx,rax
+	mov	rax,r12
+	mov	rbx,rdx
+	mul	rsi
+	add	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-96))+rsp]
+	shrd	rax,rdx,51
+	shr	rdx,51
+	add	rcx,rax
+	mov	rax,rdi
+	mov	rdi,QWORD[((-120))+rsp]
+	adc	rbx,rdx
+	mov	QWORD[((-104))+rsp],rcx
+	mov	rdx,rbx
+	mov		rbx,2251799813685247
+	and	rbx,QWORD[((-104))+rsp]
+	mov	QWORD[((-96))+rsp],rdx
+	mul	r10
+	mov	QWORD[((-8))+rsp],rbx
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	rsi
+	add	rcx,rax
+	mov	rax,rdi
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-96))+rsp]
+	shrd	rax,rdx,51
+	shr	rdx,51
+	add	rcx,rax
+	mov	rax,QWORD[80+rsp]
+	adc	rbx,rdx
+	mov	QWORD[((-104))+rsp],rcx
+	mov	rdx,rbx
+	mov		rbx,2251799813685247
+	and	rbx,QWORD[((-104))+rsp]
+	mov	QWORD[((-96))+rsp],rdx
+	mul	r14
+	mov	QWORD[((-72))+rsp],rbx
+	mov	rcx,rax
+	mov	rax,r12
+	mov	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbx,rdx
+	mul	rdi
+	mov		rdi,2251799813685247
+	add	rax,rcx
+	mov	rcx,QWORD[((-104))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-96))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov	QWORD[((-104))+rsp],rcx
+	and	rdi,QWORD[((-104))+rsp]
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rbx,QWORD[((-120))+rsp]
+	mov	rax,rbx
+	mul	rbx
+	mov	QWORD[40+rsp],rax
+	mov	rax,QWORD[((-88))+rsp]
+	mov	QWORD[48+rsp],rdx
+	mul	r10
+	mov	rcx,rdx
+	mov	rdx,rax
+	add	rdx,QWORD[40+rsp]
+	adc	rcx,QWORD[48+rsp]
+	mov	rax,r12
+	mov	rbx,rcx
+	mov	rcx,rdx
+	mul	r14
+	add	rax,rcx
+	mov	rcx,QWORD[((-104))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-96))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov		rdx,2251799813685247
+	and	rdx,rcx
+	shrd	rcx,rbx,51
+	mov	QWORD[((-88))+rsp],rdx
+	mov		rdx,2251799813685247
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[24+rsp]
+	mov		rcx,2251799813685247
+	and	rcx,rax
+	shr	rax,51
+	add	rax,QWORD[((-8))+rsp]
+	mov	r12,QWORD[((-72))+rsp]
+	and	rdx,rax
+	shr	rax,51
+	add	r12,rax
+	lea	rax,[rcx*1+rcx]
+	mov	QWORD[((-8))+rsp],rdx
+	mov	QWORD[((-104))+rsp],rax
+	lea	rax,[rdx*1+rdx]
+	mov	rdx,QWORD[((-88))+rsp]
+	mov	QWORD[24+rsp],rax
+	lea	rax,[rdx*8+rdx]
+	lea	rax,[rax*2+rdx]
+	mov	QWORD[136+rsp],rax
+	add	rax,rax
+	mov	QWORD[((-72))+rsp],rax
+	mov	rax,rcx
+	mul	rcx
+	mov	rcx,rax
+	mov	rbx,rdx
+	mov	rax,QWORD[((-72))+rsp]
+	mul	QWORD[((-8))+rsp]
+	add	rcx,rax
+	lea	rax,[r12*8+r12]
+	adc	rbx,rdx
+	lea	rax,[rax*2+r12]
+	add	rax,rax
+	mul	rdi
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov		rcx,2251799813685247
+	mov	QWORD[40+rsp],rax
+	lea	rax,[rdi*8+rdi]
+	mov	QWORD[48+rsp],rbx
+	and	rcx,QWORD[40+rsp]
+	lea	rbx,[rax*2+rdi]
+	mov	rax,rbx
+	mul	rdi
+	mov	QWORD[104+rsp],rcx
+	mov	rbx,rdx
+	mov	rcx,rax
+	mov	rax,QWORD[((-8))+rsp]
+	mul	QWORD[((-104))+rsp]
+	add	rcx,rax
+	mov	rax,QWORD[((-72))+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rax,rcx
+	mov	rcx,QWORD[40+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[48+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov	rax,rcx
+	mov		rcx,2251799813685247
+	mov	QWORD[48+rsp],rbx
+	mov	rbx,QWORD[((-8))+rsp]
+	mov	QWORD[40+rsp],rax
+	and	rcx,QWORD[40+rsp]
+	mov	rax,rbx
+	mul	rbx
+	mov	QWORD[120+rsp],rcx
+	mov	QWORD[((-8))+rsp],rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	QWORD[rsp],rdx
+	mul	rdi
+	add	rax,QWORD[((-8))+rsp]
+	adc	rdx,QWORD[rsp]
+	mov	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	mov	rbx,rdx
+	mul	r12
+	add	rax,rcx
+	mov	rcx,QWORD[40+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[48+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	QWORD[((-72))+rsp],rcx
+	mov	rdx,rbx
+	mov		rbx,2251799813685247
+	and	rbx,QWORD[((-72))+rsp]
+	mov	QWORD[((-64))+rsp],rdx
+	mul	rdi
+	mov	QWORD[40+rsp],rbx
+	mov	QWORD[((-8))+rsp],rax
+	mov	QWORD[rsp],rdx
+	mov	rax,QWORD[136+rsp]
+	mul	QWORD[((-88))+rsp]
+	add	rax,QWORD[((-8))+rsp]
+	adc	rdx,QWORD[rsp]
+	mov	rcx,rax
+	mov	rax,QWORD[24+rsp]
+	mov	rbx,rdx
+	mul	r12
+	add	rax,rcx
+	mov	rcx,QWORD[((-72))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-64))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov		rax,2251799813685247
+	adc	rbx,rdx
+	mov	QWORD[((-72))+rsp],rcx
+	and	rax,QWORD[((-72))+rsp]
+	mov	QWORD[((-64))+rsp],rbx
+	mov	QWORD[((-8))+rsp],rax
+	mov	rax,QWORD[24+rsp]
+	mul	rdi
+	mov	QWORD[24+rsp],rax
+	mov	rax,r12
+	mov	QWORD[32+rsp],rdx
+	mul	r12
+	mov		r12,2251799813685247
+	mov	rcx,rdx
+	mov	rdx,rax
+	add	rdx,QWORD[24+rsp]
+	adc	rcx,QWORD[32+rsp]
+	mov	rax,QWORD[((-104))+rsp]
+	mov	rbx,rcx
+	mov	rcx,rdx
+	mul	QWORD[((-88))+rsp]
+	add	rax,rcx
+	mov	rcx,QWORD[((-72))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-64))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov		rdx,2251799813685247
+	and	rdx,rcx
+	shrd	rcx,rbx,51
+	mov	rdi,rdx
+	mov		rdx,2251799813685247
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[104+rsp]
+	and	r12,rax
+	shr	rax,51
+	add	rax,QWORD[120+rsp]
+	and	rdx,rax
+	shr	rax,51
+	mov	QWORD[((-104))+rsp],rdx
+	add	rax,QWORD[40+rsp]
+	mov	QWORD[((-72))+rsp],rdi
+	mov	QWORD[((-88))+rsp],rax
+	lea	rax,[r13*8+r13]
+	lea	rbx,[rax*2+r13]
+	mov	rax,rbx
+	mul	rdi
+	mov	rdi,QWORD[((-8))+rsp]
+	mov	rcx,rax
+	mov	rbx,rdx
+	mov	rax,rdi
+	mul	QWORD[56+rsp]
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[8+rsp]
+	adc	rbx,rdx
+	mul	QWORD[((-88))+rsp]
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov		rcx,2251799813685247
+	mov	QWORD[((-8))+rsp],rax
+	mov	rax,QWORD[56+rsp]
+	mul	QWORD[((-72))+rsp]
+	and	rcx,QWORD[((-8))+rsp]
+	mov	QWORD[rsp],rbx
+	mov	QWORD[24+rsp],rcx
+	mov	rcx,rax
+	mov	rax,QWORD[8+rsp]
+	mov	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rax,rcx
+	mov	rcx,QWORD[((-8))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov		rax,2251799813685247
+	adc	rbx,rdx
+	mov	QWORD[((-8))+rsp],rcx
+	and	rax,QWORD[((-8))+rsp]
+	mov	QWORD[rsp],rbx
+	mov	QWORD[40+rsp],rax
+	mov	rax,rdi
+	mul	rbp
+	mov	QWORD[56+rsp],rax
+	mov	QWORD[64+rsp],rdx
+	mov	rax,QWORD[8+rsp]
+	mul	QWORD[((-72))+rsp]
+	add	rax,QWORD[56+rsp]
+	adc	rdx,QWORD[64+rsp]
+	mov	rcx,rax
+	mov	rax,r9
+	mov	rbx,rdx
+	mul	r12
+	add	rax,rcx
+	mov	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rdx,rbx
+	mov	rbx,rdx
+	mul	r13
+	add	rax,rcx
+	mov	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rdx,rbx
+	mov	rbx,rdx
+	mul	r11
+	add	rax,rcx
+	mov	rcx,QWORD[((-8))+rsp]
+	mov	QWORD[((-8))+rsp],rdi
+	adc	rdx,rbx
+	mov	rbx,QWORD[rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov		rax,2251799813685247
+	mov	QWORD[8+rsp],rcx
+	adc	rbx,rdx
+	and	rax,QWORD[8+rsp]
+	mov	QWORD[16+rsp],rbx
+	mov	QWORD[56+rsp],rax
+	mov	rax,rdi
+	mul	r11
+	mov	QWORD[104+rsp],rax
+	mov	QWORD[112+rsp],rdx
+	mov	rax,rbp
+	mul	QWORD[((-72))+rsp]
+	mov	rcx,QWORD[104+rsp]
+	mov	rbx,QWORD[112+rsp]
+	add	rcx,rax
+	mov	rax,r15
+	adc	rbx,rdx
+	mov	rdi,rcx
+	mov	rcx,QWORD[8+rsp]
+	mul	r12
+	mov	rbp,rbx
+	mov	rbx,QWORD[16+rsp]
+	add	rdi,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbp,rdx
+	mul	r9
+	add	rdi,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbp,rdx
+	mul	r13
+	add	rax,rdi
+	mov		rdi,2251799813685247
+	adc	rdx,rbp
+	mov		rbp,2251799813685247
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,QWORD[((-72))+rsp]
+	adc	rbx,rdx
+	mov	QWORD[8+rsp],rcx
+	and	rbp,QWORD[8+rsp]
+	mov	QWORD[16+rsp],rbx
+	mul	r11
+	mov	QWORD[((-72))+rsp],rax
+	mov	rax,QWORD[((-8))+rsp]
+	mov	QWORD[((-64))+rsp],rdx
+	mul	r13
+	add	rax,QWORD[((-72))+rsp]
+	adc	rdx,QWORD[((-64))+rsp]
+	mov	rcx,rax
+	mov	rax,r8
+	mov	rbx,rdx
+	mov	r11,rcx
+	mov	rcx,QWORD[8+rsp]
+	mul	r12
+	mov	r12,rbx
+	mov	rbx,QWORD[16+rsp]
+	add	r11,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	r12,rdx
+	mul	r15
+	add	r11,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	r12,rdx
+	mul	r9
+	add	rax,r11
+	mov		r11,2251799813685247
+	adc	rdx,r12
+	mov		r12,2251799813685247
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	and	r12,rax
+	shrd	rax,rdx,51
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[24+rsp]
+	and	r11,rax
+	shr	rax,51
+	add	rax,QWORD[40+rsp]
+	mov	r8,QWORD[56+rsp]
+	mov	r13,QWORD[80+rsp]
+	mov	r15,QWORD[128+rsp]
+	and	rdi,rax
+	shr	rax,51
+	add	r8,rax
+	lea	rax,[rsi*8+rsi]
+	lea	r9,[rax*2+rsi]
+	mov	rax,r9
+	mov	r9,QWORD[((-24))+rsp]
+	mul	r12
+	mov	rcx,rax
+	mov	rax,QWORD[88+rsp]
+	mov	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r9
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,r15
+	adc	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov		rax,2251799813685247
+	mov	QWORD[((-104))+rsp],rcx
+	adc	rbx,rdx
+	and	rax,QWORD[((-104))+rsp]
+	mov	QWORD[((-96))+rsp],rbx
+	mov	QWORD[((-88))+rsp],rax
+	mov	rax,QWORD[88+rsp]
+	mul	r12
+	mov	rcx,rax
+	mov	rax,r15
+	mov	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	rsi
+	add	rcx,rax
+	mov	rax,r9
+	adc	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	r8
+	add	rax,rcx
+	mov	rcx,QWORD[((-104))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-96))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov		rcx,2251799813685247
+	mov	QWORD[((-104))+rsp],rax
+	mov	rax,r13
+	and	rcx,QWORD[((-104))+rsp]
+	mul	rbp
+	mov	QWORD[((-96))+rsp],rbx
+	mov	QWORD[((-72))+rsp],rcx
+	mov	QWORD[((-24))+rsp],rax
+	mov	rax,r15
+	mov	QWORD[((-16))+rsp],rdx
+	mul	r12
+	mov	r15,QWORD[((-120))+rsp]
+	mov	rcx,rax
+	mov	rax,r15
+	add	rcx,QWORD[((-24))+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[((-16))+rsp]
+	mul	r11
+	add	rax,rcx
+	mov	QWORD[((-120))+rsp],rax
+	adc	rdx,rbx
+	mov	rax,rdi
+	mov	QWORD[((-112))+rsp],rdx
+	mul	rsi
+	mov	rcx,rax
+	mov	rax,r9
+	add	rcx,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[((-112))+rsp]
+	mul	r8
+	add	rax,rcx
+	mov	rcx,QWORD[((-104))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-96))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r9
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	mov		rdx,2251799813685247
+	and	rdx,QWORD[((-120))+rsp]
+	mov	QWORD[((-112))+rsp],rbx
+	mov	QWORD[((-104))+rsp],rdx
+	mul	rbp
+	mov	QWORD[((-24))+rsp],rax
+	mov	rax,r13
+	mov	QWORD[((-16))+rsp],rdx
+	mul	r12
+	mov		r13,2251799813685247
+	mov	rcx,rax
+	mov	rax,r11
+	add	rcx,QWORD[((-24))+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[((-16))+rsp]
+	mul	r10
+	add	rax,rcx
+	mov	QWORD[((-24))+rsp],rax
+	adc	rdx,rbx
+	mov	rax,r15
+	mov	QWORD[((-16))+rsp],rdx
+	mul	rdi
+	mov	rcx,rax
+	mov	rax,r8
+	add	rcx,QWORD[((-24))+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[((-16))+rsp]
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r9
+	mov		r9,2251799813685247
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	and	r13,QWORD[((-120))+rsp]
+	mul	r12
+	mov	QWORD[((-112))+rsp],rbx
+	mov	rcx,QWORD[((-120))+rsp]
+	mov	QWORD[24+rsp],r13
+	mov	QWORD[((-24))+rsp],rax
+	mov	rax,rsi
+	mov	QWORD[((-16))+rsp],rdx
+	mul	rbp
+	mov	rbx,rax
+	mov	rax,r14
+	add	rbx,QWORD[((-24))+rsp]
+	mov	rsi,rdx
+	adc	rsi,QWORD[((-16))+rsp]
+	mul	r11
+	add	rbx,rax
+	mov	rax,r10
+	adc	rsi,rdx
+	mov	r13,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	mul	rdi
+	mov	r14,rsi
+	add	r13,rax
+	mov	rax,r15
+	mov	r15,QWORD[24+rsp]
+	adc	r14,rdx
+	mul	r8
+	add	rax,r13
+	adc	rdx,r14
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov		rdx,2251799813685247
+	and	rdx,rcx
+	shrd	rcx,rbx,51
+	mov		rbx,2251799813685247
+	mov	QWORD[224+rsp],rdx
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-88))+rsp]
+	and	rbx,rax
+	shr	rax,51
+	add	rax,QWORD[((-72))+rsp]
+	lea	rsi,[rbx*1+rbx]
+	mov	QWORD[128+rsp],rbx
+	and	r9,rax
+	shr	rax,51
+	add	rax,QWORD[((-104))+rsp]
+	mov	QWORD[136+rsp],r9
+	lea	r14,[r9*1+r9]
+	mov	r10,rax
+	lea	rax,[rdx*8+rdx]
+	mov	QWORD[80+rsp],r10
+	lea	rax,[rax*2+rdx]
+	lea	r13,[rax*1+rax]
+	mov	QWORD[192+rsp],rax
+	mov	rax,rbx
+	mul	rbx
+	mov	rcx,rax
+	mov	rax,r9
+	mov	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	lea	rax,[r10*8+r10]
+	adc	rbx,rdx
+	lea	rax,[rax*2+r10]
+	add	rax,rax
+	mul	r15
+	add	rcx,rax
+	lea	rax,[r15*8+r15]
+	adc	rbx,rdx
+	mov	r9,rcx
+	mov		rcx,2251799813685247
+	mov	r10,rbx
+	lea	rbx,[rax*2+r15]
+	and	rcx,r9
+	mov	r15,rcx
+	mov	QWORD[216+rsp],rbx
+	mov	rax,QWORD[136+rsp]
+	mul	rsi
+	mov	rcx,rax
+	mov	rbx,rdx
+	mov	rax,QWORD[24+rsp]
+	mul	QWORD[216+rsp]
+	add	rcx,rax
+	mov	rax,QWORD[80+rsp]
+	adc	rbx,rdx
+	mul	r13
+	add	rax,rcx
+	mov	rcx,r9
+	mov	r9,QWORD[136+rsp]
+	adc	rdx,rbx
+	mov	rbx,r10
+	shrd	rcx,r10,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r9
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	mov	rdx,rbx
+	mov		rbx,2251799813685247
+	and	rbx,QWORD[((-120))+rsp]
+	mov	QWORD[((-112))+rsp],rdx
+	mul	r9
+	mov	QWORD[((-104))+rsp],rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	mov	r9,rax
+	mov	rax,r13
+	mov	r13,QWORD[24+rsp]
+	mov	r10,rdx
+	mul	r13
+	add	r9,rax
+	mov	rax,QWORD[80+rsp]
+	adc	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,r13
+	adc	r10,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	r9,rcx
+	mov		rcx,2251799813685247
+	adc	r10,rbx
+	and	rcx,r9
+	mul	rsi
+	mov	QWORD[((-120))+rsp],rcx
+	mov	rbx,rdx
+	mov	rcx,rax
+	mov	rax,QWORD[224+rsp]
+	mul	QWORD[192+rsp]
+	add	rcx,rax
+	mov	rax,QWORD[80+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rax,rcx
+	adc	rdx,rbx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rax
+	mov		rax,2251799813685247
+	adc	r10,rdx
+	and	rax,r9
+	mov	r13,rax
+	mov	rax,r14
+	mov	r14,QWORD[80+rsp]
+	mul	QWORD[24+rsp]
+	mov	rcx,rax
+	mov	rax,r14
+	mov	rbx,rdx
+	mul	r14
+	mov		r14,2251799813685247
+	add	rcx,rax
+	mov	rax,rsi
+	mov		rsi,2251799813685247
+	adc	rbx,rdx
+	mul	QWORD[224+rsp]
+	add	rax,rcx
+	mov	rcx,r9
+	adc	rdx,rbx
+	mov	rbx,r10
+	shrd	rcx,r10,51
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	and	rsi,rax
+	mov	rbx,QWORD[((-104))+rsp]
+	shrd	rax,rdx,51
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	rax,[rax*2+rcx]
+	add	r15,rax
+	and	r14,r15
+	shr	r15,51
+	lea	rax,[rbx*1+r15]
+	mov		r15,2251799813685247
+	and	r15,rax
+	shr	rax,51
+	add	rax,QWORD[((-120))+rsp]
+	mov	r10,rax
+	lea	rax,[r12*8+r12]
+	mov	QWORD[((-120))+rsp],r10
+	lea	r9,[rax*2+r12]
+	lea	rax,[r8*8+r8]
+	lea	rax,[rax*2+r8]
+	mov	QWORD[((-104))+rsp],r9
+	mov	QWORD[((-72))+rsp],rax
+	lea	rax,[rbp*8+rbp]
+	lea	rax,[rax*2+rbp]
+	mov	QWORD[((-88))+rsp],rax
+	lea	rax,[rdi*8+rdi]
+	lea	rbx,[rax*2+rdi]
+	mov	rax,rbx
+	mul	rsi
+	mov	rcx,rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r9
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,r10
+	adc	rbx,rdx
+	mul	QWORD[((-88))+rsp]
+	add	rcx,rax
+	mov	rax,QWORD[((-72))+rsp]
+	adc	rbx,rdx
+	mov	r9,rcx
+	mov	r10,rbx
+	mov		rbx,2251799813685247
+	mul	rsi
+	and	rbx,rcx
+	mov	QWORD[((-24))+rsp],rbx
+	mov	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	mov	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,rdi
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	rbx,rdx
+	mul	QWORD[((-104))+rsp]
+	add	rax,rcx
+	mov	rcx,r9
+	adc	rdx,rbx
+	mov	rbx,r10
+	shrd	rcx,r10,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov		rcx,2251799813685247
+	mov	QWORD[((-72))+rsp],rax
+	mov	rax,QWORD[((-104))+rsp]
+	and	rcx,QWORD[((-72))+rsp]
+	mov	QWORD[((-64))+rsp],rbx
+	mov	rbx,QWORD[((-64))+rsp]
+	mul	r13
+	mov	QWORD[((-8))+rsp],rcx
+	mov	rcx,QWORD[((-72))+rsp]
+	mov	r9,rax
+	mov	rax,QWORD[((-88))+rsp]
+	mov	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,r8
+	adc	r10,rdx
+	mul	r14
+	add	r9,rax
+	mov	rax,rdi
+	adc	r10,rdx
+	mul	r15
+	add	r9,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r10,rdx
+	mul	r11
+	add	rax,r9
+	adc	rdx,r10
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov		rax,2251799813685247
+	adc	rbx,rdx
+	and	rax,rcx
+	mov	QWORD[((-88))+rsp],rax
+	mov	rax,r11
+	mul	r13
+	mov	r9,rax
+	mov	rax,QWORD[((-104))+rsp]
+	mov	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,rbp
+	adc	r10,rdx
+	mul	r14
+	add	r9,rax
+	mov	rax,r8
+	adc	r10,rdx
+	mul	r15
+	add	r9,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r10,rdx
+	mul	rdi
+	add	rax,r9
+	mov		r9,2251799813685247
+	adc	rdx,r10
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	and	r9,rcx
+	mul	rsi
+	mov	r10,rax
+	mov	rax,rdi
+	mov	r11,rdx
+	mul	r13
+	mov		r13,2251799813685247
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r12
+	add	rsi,r10
+	adc	rdi,r11
+	mul	r14
+	add	rsi,rax
+	mov	rax,rbp
+	mov		rbp,2251799813685247
+	adc	rdi,rdx
+	mul	r15
+	add	rsi,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	rdi,rdx
+	mul	r8
+	add	rsi,rax
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	rdx,rsi
+	mov	rbx,QWORD[((-88))+rsp]
+	shrd	rdx,rdi,51
+	mov		rdi,2251799813685247
+	and	r13,rsi
+	lea	rax,[rdx*8+rdx]
+	lea	rax,[rax*2+rdx]
+	add	rax,QWORD[((-24))+rsp]
+	and	rbp,rax
+	shr	rax,51
+	add	rax,QWORD[((-8))+rsp]
+	lea	rsi,[rbp*1+rbp]
+	and	rdi,rax
+	shr	rax,51
+	lea	r8,[rbx*1+rax]
+	lea	rax,[r13*8+r13]
+	lea	r15,[rdi*1+rdi]
+	lea	r12,[rax*2+r13]
+	mov	rax,rdi
+	mov	QWORD[((-8))+rsp],r15
+	lea	r14,[r12*1+r12]
+	mul	r14
+	mov	rcx,rax
+	mov	rax,rbp
+	mov	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	lea	rax,[r8*8+r8]
+	adc	rbx,rdx
+	lea	rax,[rax*2+r8]
+	mov	QWORD[((-24))+rsp],rax
+	add	rax,rax
+	mul	r9
+	add	rcx,rax
+	lea	rax,[r9*8+r9]
+	adc	rbx,rdx
+	mov	r10,rcx
+	mov		rcx,2251799813685247
+	mov	r11,rbx
+	lea	rbx,[rax*2+r9]
+	and	rcx,r10
+	mov	QWORD[((-88))+rsp],rcx
+	mov	rax,rbx
+	mov	QWORD[((-72))+rsp],rbx
+	mul	r9
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r14
+	mov		r14,2251799813685247
+	adc	rbx,rdx
+	mov		rdx,2251799813685247
+	shrd	r10,r11,51
+	shr	r11,51
+	add	r10,rcx
+	adc	r11,rbx
+	and	rdx,r10
+	mov	QWORD[((-104))+rsp],rdx
+	mul	r9
+	mov	rcx,rax
+	mov	rax,rdi
+	mov	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	rsi
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	shrd	r10,r11,51
+	shr	r11,51
+	add	r10,rcx
+	adc	r11,rbx
+	and	r14,r10
+	mul	r13
+	mov	QWORD[((-120))+rsp],r14
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,r15
+	adc	rbx,rdx
+	mul	r8
+	add	rax,rcx
+	mov	rcx,r10
+	mov		r10,2251799813685247
+	adc	rdx,rbx
+	mov	rbx,r11
+	shrd	rcx,r11,51
+	shr	rbx,51
+	mov		r11,2251799813685247
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	and	r10,rcx
+	mul	r8
+	mov	r14,rax
+	mov	rax,QWORD[((-8))+rsp]
+	mov	r15,rdx
+	mul	r9
+	add	r14,rax
+	mov	rax,rsi
+	mov	rsi,QWORD[((-120))+rsp]
+	adc	r15,rdx
+	mul	r13
+	add	rax,r14
+	mov		r14,2251799813685247
+	adc	rdx,r15
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	and	r14,rax
+	shrd	rax,rdx,51
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-88))+rsp]
+	mov		rcx,2251799813685247
+	and	rcx,rax
+	shr	rax,51
+	add	rax,QWORD[((-104))+rsp]
+	lea	r15,[rcx*1+rcx]
+	mov	QWORD[((-120))+rsp],rcx
+	and	r11,rax
+	shr	rax,51
+	add	rsi,rax
+	lea	rax,[r14*8+r14]
+	lea	rdx,[r11*1+r11]
+	lea	rbx,[rax*2+r14]
+	mov	QWORD[((-88))+rsp],rdx
+	lea	rdx,[rbx*1+rbx]
+	mov	QWORD[40+rsp],rbx
+	mov	rax,rdx
+	mov	QWORD[((-104))+rsp],rdx
+	mul	r11
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	rax
+	add	rcx,rax
+	lea	rax,[rsi*8+rsi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rsi]
+	add	rax,rax
+	mul	r10
+	add	rcx,rax
+	lea	rax,[r10*8+r10]
+	mov	QWORD[((-120))+rsp],rcx
+	adc	rbx,rdx
+	mov		rdx,2251799813685247
+	and	rdx,QWORD[((-120))+rsp]
+	lea	rcx,[rax*2+r10]
+	mov	QWORD[((-112))+rsp],rbx
+	mov	rax,rcx
+	mov	QWORD[8+rsp],rdx
+	mul	r10
+	mov	rcx,rax
+	mov	rax,r15
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov		rcx,2251799813685247
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[((-104))+rsp]
+	and	rcx,QWORD[((-120))+rsp]
+	mov	QWORD[((-112))+rsp],rbx
+	mul	r10
+	mov	QWORD[((-8))+rsp],rcx
+	mov	QWORD[((-104))+rsp],rax
+	mov	rax,r11
+	mov	QWORD[((-96))+rsp],rdx
+	mul	r11
+	mov		r11,2251799813685247
+	mov	rcx,rax
+	mov	rax,rsi
+	add	rcx,QWORD[((-104))+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[((-96))+rsp]
+	mul	r15
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[40+rsp]
+	adc	rdx,rbx
+	mov	QWORD[((-112))+rsp],rdx
+	and	r11,QWORD[((-120))+rsp]
+	mul	r14
+	mov	QWORD[((-104))+rsp],r11
+	mov		r11,2251799813685247
+	mov	rcx,rax
+	mov	rax,r15
+	mov	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	and	r11,QWORD[((-120))+rsp]
+	mul	rsi
+	mov	QWORD[((-112))+rsp],rbx
+	mov	QWORD[40+rsp],rax
+	mov	rax,QWORD[((-88))+rsp]
+	mov	QWORD[48+rsp],rdx
+	mul	r10
+	mov	rbx,rax
+	mov	rax,r14
+	add	rbx,QWORD[40+rsp]
+	mov	rsi,rdx
+	adc	rsi,QWORD[48+rsp]
+	mov	rcx,QWORD[((-120))+rsp]
+	mul	r15
+	mov		r14,2251799813685247
+	add	rax,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	adc	rdx,rsi
+	mov	rsi,QWORD[((-104))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	and	r14,rax
+	shrd	rax,rdx,51
+	mov		rdx,2251799813685247
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[8+rsp]
+	mov		rcx,2251799813685247
+	and	rcx,rax
+	shr	rax,51
+	add	rax,QWORD[((-8))+rsp]
+	mov	QWORD[((-120))+rsp],rcx
+	lea	r15,[rcx*1+rcx]
+	and	rdx,rax
+	shr	rax,51
+	add	rsi,rax
+	lea	rax,[r14*8+r14]
+	mov	r10,rdx
+	lea	rdx,[rdx*1+rdx]
+	lea	rbx,[rax*2+r14]
+	mov	QWORD[((-88))+rsp],rdx
+	lea	rdx,[rbx*1+rbx]
+	mov	QWORD[40+rsp],rbx
+	mov	rax,rdx
+	mov	QWORD[((-104))+rsp],rdx
+	mul	r10
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	rax
+	add	rcx,rax
+	lea	rax,[rsi*8+rsi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rsi]
+	add	rax,rax
+	mul	r11
+	add	rcx,rax
+	lea	rax,[r11*8+r11]
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	mov		rdx,2251799813685247
+	and	rdx,QWORD[((-120))+rsp]
+	lea	rcx,[rax*2+r11]
+	mov	QWORD[((-112))+rsp],rbx
+	mov	rax,rcx
+	mov	QWORD[8+rsp],rdx
+	mul	r11
+	mov	rcx,rax
+	mov	rax,r15
+	mov	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov		rcx,2251799813685247
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[((-104))+rsp]
+	and	rcx,QWORD[((-120))+rsp]
+	mov	QWORD[((-112))+rsp],rbx
+	mul	r11
+	mov	QWORD[((-8))+rsp],rcx
+	mov	QWORD[((-104))+rsp],rax
+	mov	rax,r10
+	mov	QWORD[((-96))+rsp],rdx
+	mul	r10
+	mov		r10,2251799813685247
+	mov	rcx,rax
+	mov	rax,rsi
+	add	rcx,QWORD[((-104))+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[((-96))+rsp]
+	mul	r15
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[40+rsp]
+	adc	rdx,rbx
+	mov	QWORD[((-112))+rsp],rdx
+	and	r10,QWORD[((-120))+rsp]
+	mul	r14
+	mov	QWORD[((-104))+rsp],r10
+	mov	rcx,rax
+	mov	rax,r15
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	mov	rdx,rbx
+	mov		rbx,2251799813685247
+	and	rbx,QWORD[((-120))+rsp]
+	mov	QWORD[((-112))+rsp],rdx
+	mul	rsi
+	mov	r10,rbx
+	mov	QWORD[40+rsp],rax
+	mov	rax,QWORD[((-88))+rsp]
+	mov	QWORD[48+rsp],rdx
+	mul	r11
+	mov		r11,2251799813685247
+	mov	rbx,rax
+	mov	rax,r14
+	add	rbx,QWORD[40+rsp]
+	mov	rsi,rdx
+	adc	rsi,QWORD[48+rsp]
+	mov		r14,2251799813685247
+	mul	r15
+	add	rax,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	adc	rdx,rsi
+	mov	rsi,QWORD[((-104))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	and	r14,rax
+	shrd	rax,rdx,51
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[8+rsp]
+	mov		rcx,2251799813685247
+	and	rcx,rax
+	shr	rax,51
+	add	rax,QWORD[((-8))+rsp]
+	mov	QWORD[((-120))+rsp],rcx
+	lea	r15,[rcx*1+rcx]
+	and	r11,rax
+	shr	rax,51
+	add	rsi,rax
+	lea	rax,[r14*8+r14]
+	lea	rdx,[r11*1+r11]
+	lea	rbx,[rax*2+r14]
+	mov	QWORD[((-88))+rsp],rdx
+	lea	rdx,[rbx*1+rbx]
+	mov	QWORD[40+rsp],rbx
+	mov	rax,rdx
+	mov	QWORD[((-104))+rsp],rdx
+	mul	r11
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	rax
+	add	rcx,rax
+	lea	rax,[rsi*8+rsi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rsi]
+	add	rax,rax
+	mul	r10
+	add	rcx,rax
+	lea	rax,[r10*8+r10]
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	mov		rdx,2251799813685247
+	and	rdx,QWORD[((-120))+rsp]
+	lea	rcx,[rax*2+r10]
+	mov	QWORD[((-112))+rsp],rbx
+	mov	rax,rcx
+	mov	QWORD[8+rsp],rdx
+	mul	r10
+	mov	rcx,rax
+	mov	rax,r15
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov		rcx,2251799813685247
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[((-104))+rsp]
+	and	rcx,QWORD[((-120))+rsp]
+	mov	QWORD[((-112))+rsp],rbx
+	mul	r10
+	mov	QWORD[((-8))+rsp],rcx
+	mov	QWORD[((-104))+rsp],rax
+	mov	rax,r11
+	mov	QWORD[((-96))+rsp],rdx
+	mul	r11
+	mov		r11,2251799813685247
+	mov	rcx,rax
+	mov	rax,rsi
+	add	rcx,QWORD[((-104))+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[((-96))+rsp]
+	mul	r15
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[40+rsp]
+	adc	rdx,rbx
+	mov	QWORD[((-112))+rsp],rdx
+	and	r11,QWORD[((-120))+rsp]
+	mul	r14
+	mov	QWORD[((-104))+rsp],r11
+	mov		r11,2251799813685247
+	mov	rcx,rax
+	mov	rax,r15
+	mov	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,QWORD[((-120))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	and	r11,QWORD[((-120))+rsp]
+	mul	rsi
+	mov	QWORD[((-112))+rsp],rbx
+	mov	rcx,QWORD[((-120))+rsp]
+	mov	QWORD[40+rsp],rax
+	mov	rax,QWORD[((-88))+rsp]
+	mov	QWORD[48+rsp],rdx
+	mul	r10
+	mov	rbx,rax
+	mov	rax,r14
+	add	rbx,QWORD[40+rsp]
+	mov	rsi,rdx
+	adc	rsi,QWORD[48+rsp]
+	mov		r14,2251799813685247
+	mul	r15
+	mov		r15,2251799813685247
+	add	rax,rbx
+	mov	rbx,QWORD[((-112))+rsp]
+	adc	rdx,rsi
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	and	r14,rax
+	shrd	rax,rdx,51
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[8+rsp]
+	mov		rcx,2251799813685247
+	mov	rsi,QWORD[((-104))+rsp]
+	and	rcx,rax
+	shr	rax,51
+	add	rax,QWORD[((-8))+rsp]
+	and	r15,rax
+	shr	rax,51
+	lea	r10,[rsi*1+rax]
+	lea	rax,[rcx*1+rcx]
+	lea	rsi,[r15*1+r15]
+	mov	QWORD[((-120))+rsp],rax
+	lea	rax,[r14*8+r14]
+	mov	QWORD[((-104))+rsp],rsi
+	lea	rdx,[rax*2+r14]
+	mov	rax,rcx
+	lea	rsi,[rdx*1+rdx]
+	mov	QWORD[56+rsp],rdx
+	mul	rcx
+	mov	QWORD[((-88))+rsp],rsi
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	r15
+	mov		rsi,2251799813685247
+	add	rcx,rax
+	lea	rax,[r10*8+r10]
+	adc	rbx,rdx
+	lea	rax,[rax*2+r10]
+	add	rax,rax
+	mul	r11
+	add	rcx,rax
+	lea	rax,[r11*8+r11]
+	adc	rbx,rdx
+	mov	QWORD[((-8))+rsp],rcx
+	mov	QWORD[rsp],rbx
+	mov	rbx,rcx
+	and	rbx,rsi
+	mov	QWORD[8+rsp],rbx
+	lea	rbx,[rax*2+r11]
+	mov	rax,rbx
+	mul	r11
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbx,rdx
+	mul	r10
+	add	rax,rcx
+	mov	rcx,QWORD[((-8))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r15
+	adc	rbx,rdx
+	mov	QWORD[((-8))+rsp],rcx
+	and	rcx,rsi
+	mul	r15
+	mov	r15,QWORD[((-120))+rsp]
+	mov	QWORD[40+rsp],rcx
+	mov	QWORD[rsp],rbx
+	mov	QWORD[88+rsp],rax
+	mov	rax,QWORD[((-88))+rsp]
+	mov	QWORD[96+rsp],rdx
+	mul	r11
+	mov	rcx,rax
+	mov	rax,r15
+	add	rcx,QWORD[88+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[96+rsp]
+	mul	r10
+	add	rax,rcx
+	mov	rcx,QWORD[((-8))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r15
+	adc	rbx,rdx
+	mov	QWORD[((-88))+rsp],rcx
+	and	rcx,rsi
+	mul	r11
+	mov	QWORD[((-80))+rsp],rbx
+	mov	QWORD[((-8))+rsp],rcx
+	mov	QWORD[88+rsp],rax
+	mov	rax,QWORD[56+rsp]
+	mov	QWORD[96+rsp],rdx
+	mul	r14
+	mov	rcx,rax
+	add	rcx,QWORD[88+rsp]
+	mov	rax,QWORD[((-104))+rsp]
+	mov	rbx,rdx
+	adc	rbx,QWORD[96+rsp]
+	mul	r10
+	add	rax,rcx
+	mov	rcx,QWORD[((-88))+rsp]
+	adc	rdx,rbx
+	mov	rbx,QWORD[((-80))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,rcx
+	and	rdx,rsi
+	mov	r15,rdx
+	mul	r11
+	mov	QWORD[((-104))+rsp],rax
+	mov	rax,r10
+	mov	QWORD[((-96))+rsp],rdx
+	mul	r10
+	mov	r10,rax
+	mov	rax,QWORD[((-120))+rsp]
+	add	r10,QWORD[((-104))+rsp]
+	mov	r11,rdx
+	adc	r11,QWORD[((-96))+rsp]
+	mul	r14
+	add	rax,r10
+	adc	rdx,r11
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov	r14,rcx
+	shrd	rcx,rbx,51
+	and	r14,rsi
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[8+rsp]
+	mov	r10,rax
+	shr	rax,51
+	add	rax,QWORD[40+rsp]
+	and	r10,rsi
+	mov	QWORD[((-104))+rsp],r10
+	mov	r11,rax
+	shr	rax,51
+	add	rax,QWORD[((-8))+rsp]
+	and	r11,rsi
+	mov	QWORD[((-88))+rsp],r11
+	mov	QWORD[((-120))+rsp],rax
+	lea	rax,[rdi*8+rdi]
+	lea	rcx,[rax*2+rdi]
+	mov	rax,rcx
+	mul	r14
+	mov	rcx,rax
+	mov	rax,QWORD[((-24))+rsp]
+	mov	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,r10
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	rbx,rdx
+	mul	QWORD[((-72))+rsp]
+	add	rcx,rax
+	mov	rax,QWORD[((-24))+rsp]
+	adc	rbx,rdx
+	mov	r10,rcx
+	and	rcx,rsi
+	mov	QWORD[((-8))+rsp],rcx
+	mov	r11,rbx
+	mul	r14
+	mov	rcx,rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rax,rcx
+	mov	rcx,r10
+	adc	rdx,rbx
+	mov	rbx,r11
+	shrd	rcx,r11,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r15
+	adc	rbx,rdx
+	mov	rdx,rcx
+	and	rdx,rsi
+	mov	QWORD[((-24))+rsp],rdx
+	mul	r12
+	mov	r10,rax
+	mov	rax,QWORD[((-72))+rsp]
+	mov	r11,rdx
+	mul	r14
+	add	r10,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	r11,rdx
+	mul	r8
+	add	r10,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	r11,rdx
+	mul	rdi
+	add	r10,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r11,rdx
+	mul	rbp
+	add	rax,r10
+	adc	rdx,r11
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	and	rax,rsi
+	mov	QWORD[((-72))+rsp],rax
+	mov	rax,r15
+	mul	rbp
+	mov	r10,rax
+	mov	rax,r12
+	mov	r11,rdx
+	mul	r14
+	add	r10,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	r11,rdx
+	mul	r9
+	add	r10,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	r11,rdx
+	mul	r8
+	add	r10,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r11,rdx
+	mul	rdi
+	add	rax,r10
+	adc	rdx,r11
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mov	QWORD[8+rsp],rcx
+	mul	rbp
+	mov	QWORD[16+rsp],rbx
+	mov	rbx,rcx
+	and	rbx,rsi
+	mov	rcx,QWORD[8+rsp]
+	mov	r10,rbx
+	mov	QWORD[104+rsp],rbx
+	mov	rbx,QWORD[16+rsp]
+	mov	r11,rax
+	mov	rax,r15
+	mov	r12,rdx
+	mul	rdi
+	mov	rdi,rax
+	mov	rax,QWORD[((-104))+rsp]
+	mov	rbp,rdx
+	add	rdi,r11
+	adc	rbp,r12
+	mov	r12,r10
+	mul	r13
+	mov	r13,rsi
+	add	rdi,rax
+	mov	rax,QWORD[((-88))+rsp]
+	adc	rbp,rdx
+	mul	r9
+	add	rdi,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	rbp,rdx
+	mul	r8
+	add	rax,rdi
+	adc	rdx,rbp
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov	rdx,rcx
+	shrd	rcx,rbx,51
+	and	rdx,rsi
+	lea	rax,[rcx*8+rcx]
+	mov	QWORD[208+rsp],rdx
+	mov	rbp,rdx
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-8))+rsp]
+	mov	r9,rax
+	shr	rax,51
+	add	rax,QWORD[((-24))+rsp]
+	and	r9,rsi
+	mov	QWORD[((-24))+rsp],10
+	mov	QWORD[8+rsp],r9
+	mov	rcx,r9
+	mov	r14,rax
+	shr	rax,51
+	add	rax,QWORD[((-72))+rsp]
+	and	r14,rsi
+	mov	QWORD[40+rsp],r14
+	mov	r15,rax
+	mov	QWORD[56+rsp],rax
+
+
+$L$4:
+	lea	rax,[rbp*8+rbp]
+	lea	r8,[rcx*1+rcx]
+	lea	r10,[r14*1+r14]
+	lea	rax,[rax*2+rbp]
+	mov	QWORD[((-120))+rsp],r10
+	lea	r11,[rax*1+rax]
+	mov	QWORD[((-72))+rsp],rax
+	lea	rax,[r15*8+r15]
+	lea	rax,[rax*2+r15]
+	lea	rbx,[rax*1+rax]
+	mov	rax,rbx
+	mul	r12
+	mov	rsi,rax
+	mov	rax,rcx
+	mov	rdi,rdx
+	mul	rcx
+	mov	rcx,rax
+	mov	rbx,rdx
+	mov	rax,r11
+	add	rcx,rsi
+	adc	rbx,rdi
+	mul	r14
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mov	rsi,rcx
+	mul	rbp
+	mov	rdi,rbx
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r15
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	lea	rax,[r12*8+r12]
+	mov	QWORD[((-104))+rsp],rcx
+	adc	rbx,rdx
+	lea	rcx,[rax*2+r12]
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rbx,rsi
+	and	rbx,r13
+	mov	rax,rcx
+	mov	QWORD[((-88))+rsp],rbx
+	mul	r12
+	mov	rcx,rax
+	mov	rax,r14
+	mov	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	rdi,rcx
+	mul	r15
+	and	rdi,r13
+	mov	r10,rdi
+	mov	rsi,rax
+	mov	rax,r14
+	mov	rdi,rdx
+	mul	r14
+	add	rsi,rax
+	mov	rax,r11
+	adc	rdi,rdx
+	mul	r12
+	add	rsi,rax
+	mov	rax,r12
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r9,rsi
+	mul	r8
+	and	r9,r13
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[((-72))+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-96))+rsp]
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	r12,rcx
+	shrd	rcx,rbx,51
+	and	r12,r13
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	mov	rbp,rax
+	shrd	rax,rdx,51
+	and	rbp,r13
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	r15,[rax*2+rcx]
+	add	r15,QWORD[((-88))+rsp]
+	mov	rcx,r15
+	shr	r15,51
+	add	r15,r10
+	and	rcx,r13
+	mov	r14,r15
+	shr	r15,51
+	and	r14,r13
+	add	r15,r9
+	sub	QWORD[((-24))+rsp],1
+	jne	NEAR $L$4
+	mov	rbx,QWORD[208+rsp]
+	mov	r11,QWORD[40+rsp]
+	mov	r9,rcx
+	mov	rcx,QWORD[104+rsp]
+	mov	r8,QWORD[8+rsp]
+	lea	rax,[rbx*8+rbx]
+	lea	rdi,[rax*2+rbx]
+	lea	rax,[r11*8+r11]
+	mov	rbx,QWORD[56+rsp]
+	lea	rax,[rax*2+r11]
+	mov	QWORD[88+rsp],rdi
+	mov	QWORD[248+rsp],rax
+	lea	rax,[rbx*8+rbx]
+	lea	rsi,[rax*2+rbx]
+	lea	rax,[rcx*8+rcx]
+	lea	r10,[rax*2+rcx]
+	mov	rax,r8
+	mov	QWORD[240+rsp],rsi
+	mul	r9
+	mov	QWORD[200+rsp],r10
+	mov	rcx,rax
+	mov	rax,rdi
+	mov	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[248+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r10
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mul	r12
+	mov	rsi,rax
+	mov	rdi,rdx
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r13
+	mov	r10,rax
+	mov	rax,r8
+	mov	r8,r11
+	mul	r14
+	mov	rcx,rax
+	mov	rax,r11
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[88+rsp]
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[240+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[8+rsp]
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r11,rsi
+	mul	r15
+	and	r11,r13
+	mov	rcx,rax
+	mov	rax,QWORD[56+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[88+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r13
+	mov	r8,rax
+	mov	rax,QWORD[8+rsp]
+	mul	r12
+	mov	rcx,rax
+	mov	rax,QWORD[104+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[56+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[40+rsp]
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[88+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r13
+	mov	QWORD[144+rsp],rax
+	mov	rax,rbp
+	mov	rbp,r13
+	mul	QWORD[8+rsp]
+	mov	rcx,rax
+	mov	rbx,rdx
+	mov	rax,r9
+	mul	QWORD[208+rsp]
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	mul	QWORD[40+rsp]
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	QWORD[104+rsp]
+	add	rcx,rax
+	mov	rax,r15
+	mov		r15,2251799813685247
+	adc	rbx,rdx
+	mul	QWORD[56+rsp]
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	rdi,rcx
+	shrd	rcx,rbx,51
+	and	rdi,r13
+	lea	rax,[rcx*8+rcx]
+	mov	QWORD[232+rsp],rdi
+	lea	rax,[rax*2+rcx]
+	add	r10,rax
+	mov	r9,r10
+	shr	r10,51
+	add	r11,r10
+	and	r9,r13
+	and	rbp,r11
+	shr	r11,51
+	mov	QWORD[152+rsp],r9
+	lea	r14,[r8*1+r11]
+	mov	QWORD[120+rsp],rbp
+	mov	rcx,r9
+	mov	rsi,rbp
+	mov	r11,rdi
+	mov	QWORD[160+rsp],r14
+	mov	r13,QWORD[144+rsp]
+	mov	QWORD[((-8))+rsp],20
+
+
+$L$5:
+	lea	rax,[r11*8+r11]
+	lea	r12,[rcx*1+rcx]
+	lea	rdi,[rsi*1+rsi]
+	lea	rax,[rax*2+r11]
+	mov	QWORD[((-120))+rsp],rdi
+	lea	r8,[rax*1+rax]
+	mov	QWORD[((-24))+rsp],rax
+	lea	rax,[r14*8+r14]
+	lea	rax,[rax*2+r14]
+	lea	rbx,[rax*1+rax]
+	mov	rax,rbx
+	mul	r13
+	mov	r9,rax
+	mov	rax,rcx
+	mov	r10,rdx
+	mul	rcx
+	mov	rcx,rax
+	mov	rbx,rdx
+	mov	rax,r8
+	add	rcx,r9
+	adc	rbx,r10
+	mul	rsi
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	mov	r9,rcx
+	mul	r11
+	mov	r10,rbx
+	mov	rbp,r9
+	mov	rcx,rax
+	mov	rax,rdi
+	mov	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	lea	rax,[r13*8+r13]
+	mov	QWORD[((-104))+rsp],rcx
+	adc	rbx,rdx
+	and	rbp,r15
+	lea	rcx,[rax*2+r13]
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rax,rcx
+	mul	r13
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	rcx,r9
+	adc	rbx,r10
+	mov	QWORD[((-88))+rsp],rcx
+	mul	r14
+	mov	QWORD[((-80))+rsp],rbx
+	mov	rbx,rcx
+	and	rbx,r15
+	mov	rcx,QWORD[((-88))+rsp]
+	mov	QWORD[((-72))+rsp],rbx
+	mov	rbx,QWORD[((-80))+rsp]
+	mov	r9,rax
+	mov	rax,rsi
+	mov	r10,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r8
+	add	rsi,r9
+	adc	rdi,r10
+	mul	r13
+	add	rsi,rax
+	mov	rax,r13
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r8,rsi
+	mul	r12
+	and	r8,r15
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[((-24))+rsp]
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-96))+rsp]
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	rcx,rsi
+	mov	r13,rsi
+	shrd	rcx,rdi,51
+	mov	rbx,rdi
+	and	r13,r15
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	mov	r11,rax
+	shrd	rax,rdx,51
+	and	r11,r15
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	r14,[rax*2+rcx]
+	add	r14,rbp
+	mov	rcx,r14
+	shr	r14,51
+	add	r14,QWORD[((-72))+rsp]
+	and	rcx,r15
+	mov	rsi,r14
+	shr	r14,51
+	and	rsi,r15
+	add	r14,r8
+	sub	QWORD[((-8))+rsp],1
+	jne	NEAR $L$5
+	mov	rbx,QWORD[232+rsp]
+	mov	r9,rcx
+	mov	rcx,QWORD[144+rsp]
+	mov	r8,r11
+	mov	rdi,QWORD[120+rsp]
+	mov	rbp,rsi
+	mov	QWORD[((-24))+rsp],10
+	lea	rax,[rbx*8+rbx]
+	lea	r12,[rax*2+rbx]
+	mov	rbx,QWORD[160+rsp]
+	lea	rax,[rbx*8+rbx]
+	lea	r11,[rax*2+rbx]
+	lea	rax,[rcx*8+rcx]
+	lea	r10,[rax*2+rcx]
+	mov	rax,QWORD[152+rsp]
+	mul	r9
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	lea	rax,[rdi*8+rdi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rdi]
+	mul	r8
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	r11
+	mov	rsi,rax
+	mov	rdi,rdx
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r15
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[152+rsp]
+	mul	rbp
+	mov	rcx,rax
+	mov	rax,QWORD[120+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,QWORD[152+rsp]
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r11,rsi
+	mul	r14
+	and	r11,r15
+	mov	rcx,rax
+	mov	rax,QWORD[160+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[120+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r10
+	adc	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,QWORD[152+rsp]
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r10,rsi
+	mul	r13
+	and	r10,r15
+	mov	rcx,rax
+	mov	rax,QWORD[144+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[160+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[120+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,QWORD[152+rsp]
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r12,rsi
+	mul	r8
+	and	r12,r15
+	mov	rcx,rax
+	mov	rax,QWORD[232+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[120+rsp]
+	adc	rbx,rdx
+	mul	r13
+	mov		r13,2251799813685247
+	add	rcx,rax
+	mov	rax,QWORD[144+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[160+rsp]
+	adc	rbx,rdx
+	mul	r14
+	mov	r14,r15
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	rcx,rsi
+	mov	rbp,rsi
+	shrd	rcx,rdi,51
+	and	rbp,r15
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-120))+rsp]
+	mov	r9,rax
+	shr	rax,51
+	add	r11,rax
+	and	r9,r15
+	and	r14,r11
+	shr	r11,51
+	mov	rcx,r9
+	lea	r15,[r10*1+r11]
+
+
+$L$6:
+	lea	rax,[rbp*8+rbp]
+	lea	r8,[rcx*1+rcx]
+	lea	r10,[r14*1+r14]
+	lea	rax,[rax*2+rbp]
+	mov	QWORD[((-120))+rsp],r10
+	lea	r11,[rax*1+rax]
+	mov	QWORD[((-72))+rsp],rax
+	lea	rax,[r15*8+r15]
+	lea	rax,[rax*2+r15]
+	lea	rbx,[rax*1+rax]
+	mov	rax,rbx
+	mul	r12
+	mov	rsi,rax
+	mov	rax,rcx
+	mov	rdi,rdx
+	mul	rcx
+	mov	rcx,rax
+	mov	rbx,rdx
+	mov	rax,r11
+	add	rcx,rsi
+	adc	rbx,rdi
+	mul	r14
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mov	rsi,rcx
+	mul	rbp
+	mov	rdi,rbx
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r15
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	lea	rax,[r12*8+r12]
+	mov	QWORD[((-104))+rsp],rcx
+	adc	rbx,rdx
+	lea	rcx,[rax*2+r12]
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rbx,rsi
+	and	rbx,r13
+	mov	rax,rcx
+	mov	QWORD[((-88))+rsp],rbx
+	mul	r12
+	mov	rcx,rax
+	mov	rax,r14
+	mov	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	rdi,rcx
+	mul	r15
+	and	rdi,r13
+	mov	r10,rdi
+	mov	rsi,rax
+	mov	rax,r14
+	mov	rdi,rdx
+	mul	r14
+	add	rsi,rax
+	mov	rax,r11
+	adc	rdi,rdx
+	mul	r12
+	add	rsi,rax
+	mov	rax,r12
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r9,rsi
+	mul	r8
+	and	r9,r13
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[((-72))+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-96))+rsp]
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	r12,rcx
+	shrd	rcx,rbx,51
+	and	r12,r13
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	mov	rbp,rax
+	shrd	rax,rdx,51
+	and	rbp,r13
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	r15,[rax*2+rcx]
+	add	r15,QWORD[((-88))+rsp]
+	mov	rcx,r15
+	shr	r15,51
+	add	r15,r10
+	and	rcx,r13
+	mov	r14,r15
+	shr	r15,51
+	and	r14,r13
+	add	r15,r9
+	sub	QWORD[((-24))+rsp],1
+	jne	NEAR $L$6
+	mov	r11,QWORD[8+rsp]
+	mov	r10,QWORD[88+rsp]
+	mov	r9,rcx
+	mov	QWORD[((-24))+rsp],50
+	mov	rax,r11
+	mul	rcx
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[248+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[240+rsp]
+	adc	rbx,rdx
+	mul	r12
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r11
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r8,rsi
+	mul	r14
+	and	r8,r13
+	mov	rcx,rax
+	mov	rax,QWORD[40+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,r10
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[240+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r13
+	mov	r10,rax
+	mov	rax,r11
+	mul	r15
+	mov	rcx,rax
+	mov	rax,QWORD[56+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[40+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[88+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r13
+	mov	r11,rax
+	mov	rax,QWORD[8+rsp]
+	mul	r12
+	mov	rcx,rax
+	mov	rax,QWORD[104+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[56+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[40+rsp]
+	adc	rbx,rdx
+	mul	r15
+	add	rcx,rax
+	mov	rax,QWORD[88+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r13
+	mov	QWORD[88+rsp],rax
+	mov	rax,QWORD[8+rsp]
+	mul	rbp
+	mov	rbp,r13
+	mov	rcx,rax
+	mov	rax,QWORD[208+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[40+rsp]
+	adc	rbx,rdx
+	mul	r12
+	mov	r12,QWORD[88+rsp]
+	add	rcx,rax
+	mov	rax,QWORD[104+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[56+rsp]
+	adc	rbx,rdx
+	mul	r15
+	mov		r15,2251799813685247
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	rdi,rcx
+	shrd	rcx,rbx,51
+	and	rdi,r13
+	lea	rax,[rcx*8+rcx]
+	mov	QWORD[160+rsp],rdi
+	lea	rax,[rax*2+rcx]
+	add	r8,rax
+	mov	rcx,r8
+	shr	r8,51
+	add	r10,r8
+	and	rcx,r13
+	and	rbp,r10
+	shr	r10,51
+	mov	QWORD[40+rsp],rcx
+	lea	r14,[r11*1+r10]
+	mov	QWORD[8+rsp],rbp
+	mov	r11,rdi
+	mov	rsi,rbp
+	mov	QWORD[((-8))+rsp],r14
+
+
+$L$7:
+	lea	rax,[r11*8+r11]
+	lea	rbp,[rcx*1+rcx]
+	lea	rdi,[rsi*1+rsi]
+	lea	r13,[rax*2+r11]
+	lea	rax,[r14*8+r14]
+	mov	QWORD[((-120))+rsp],rdi
+	lea	r9,[rax*2+r14]
+	lea	r8,[r13*1+r13]
+	add	r9,r9
+	mov	rax,r9
+	mul	r12
+	mov	r9,rax
+	mov	rax,rcx
+	mov	r10,rdx
+	mul	rcx
+	add	r9,rax
+	mov	rax,r8
+	adc	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,rbp
+	adc	r10,rdx
+	mul	r11
+	mov	rcx,rax
+	mov	rax,rdi
+	mov	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	lea	rax,[r12*8+r12]
+	mov	QWORD[((-104))+rsp],rcx
+	adc	rbx,rdx
+	lea	rcx,[rax*2+r12]
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rbx,r9
+	and	rbx,r15
+	mov	rax,rcx
+	mov	QWORD[((-88))+rsp],rbx
+	mul	r12
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,rbp
+	adc	rbx,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	rcx,r9
+	mov	rdx,rcx
+	adc	rbx,r10
+	and	rdx,r15
+	mov	QWORD[((-72))+rsp],rdx
+	mul	r14
+	mov	r9,rax
+	mov	rax,rsi
+	mov	r10,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r8
+	add	rsi,r9
+	adc	rdi,r10
+	mul	r12
+	add	rsi,rax
+	mov	rax,r12
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r8,rsi
+	mul	rbp
+	and	r8,r15
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-96))+rsp]
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	r12,rcx
+	shrd	rcx,rbx,51
+	and	r12,r15
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	mov	r11,rax
+	shrd	rax,rdx,51
+	and	r11,r15
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	r14,[rax*2+rcx]
+	add	r14,QWORD[((-88))+rsp]
+	mov	rcx,r14
+	shr	r14,51
+	add	r14,QWORD[((-72))+rsp]
+	and	rcx,r15
+	mov	rsi,r14
+	shr	r14,51
+	and	rsi,r15
+	add	r14,r8
+	sub	QWORD[((-24))+rsp],1
+	jne	NEAR $L$7
+	mov	rbx,QWORD[160+rsp]
+	mov	rdi,QWORD[((-8))+rsp]
+	mov	rbp,rsi
+	mov	rsi,QWORD[88+rsp]
+	mov	r10,QWORD[40+rsp]
+	lea	rax,[rbx*8+rbx]
+	lea	r13,[rax*2+rbx]
+	mov	rbx,QWORD[8+rsp]
+	mov	QWORD[152+rsp],r13
+	lea	rax,[rbx*8+rbx]
+	lea	rax,[rax*2+rbx]
+	mov	QWORD[240+rsp],rax
+	lea	rax,[rdi*8+rdi]
+	lea	r8,[rax*2+rdi]
+	lea	rax,[rsi*8+rsi]
+	lea	r9,[rax*2+rsi]
+	mov	rax,r10
+	mov	QWORD[232+rsp],r8
+	mul	rcx
+	mov	QWORD[200+rsp],r9
+	mov	rsi,rax
+	mov	rax,r13
+	mov	rdi,rdx
+	mul	rbp
+	add	rsi,rax
+	mov	rax,QWORD[240+rsp]
+	adc	rdi,rdx
+	mul	r11
+	add	rsi,rax
+	mov	rax,r9
+	adc	rdi,rdx
+	mul	r14
+	add	rsi,rax
+	mov	rax,r8
+	adc	rdi,rdx
+	mul	r12
+	add	rsi,rax
+	mov	rax,rsi
+	adc	rdi,rdx
+	mov	QWORD[((-120))+rsp],rsi
+	and	rax,r15
+	mov	QWORD[((-112))+rsp],rdi
+	mov	r8,rax
+	mov	rax,r10
+	mov	r10,QWORD[((-112))+rsp]
+	mul	rbp
+	mov	rsi,rax
+	mov	rax,rbx
+	mov	rdi,rdx
+	mul	rcx
+	add	rsi,rax
+	mov	rax,r13
+	adc	rdi,rdx
+	mul	r14
+	add	rsi,rax
+	mov	rax,QWORD[232+rsp]
+	adc	rdi,rdx
+	mul	r11
+	add	rsi,rax
+	mov	rax,r9
+	mov	r9,QWORD[((-120))+rsp]
+	adc	rdi,rdx
+	mul	r12
+	add	rsi,rax
+	adc	rdi,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rsi
+	mov	rax,r9
+	adc	r10,rdi
+	and	rax,r15
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[40+rsp]
+	mul	r14
+	mov	rsi,rax
+	mov	rax,QWORD[((-8))+rsp]
+	mov	rdi,rdx
+	mul	rcx
+	add	rsi,rax
+	mov	rax,rbx
+	adc	rdi,rdx
+	mul	rbp
+	add	rsi,rax
+	mov	rax,r13
+	adc	rdi,rdx
+	mul	r12
+	add	rsi,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rdi,rdx
+	mul	r11
+	add	rsi,rax
+	mov	rax,QWORD[40+rsp]
+	adc	rdi,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rsi
+	mov	rdx,r9
+	adc	r10,rdi
+	and	rdx,r15
+	mov	r13,rdx
+	mul	r12
+	mov	rsi,rax
+	mov	rax,QWORD[88+rsp]
+	mov	rdi,rdx
+	mul	rcx
+	add	rsi,rax
+	mov	rax,QWORD[((-8))+rsp]
+	adc	rdi,rdx
+	mul	rbp
+	add	rsi,rax
+	mov	rax,rbx
+	adc	rdi,rdx
+	mul	r14
+	add	rsi,rax
+	mov	rax,QWORD[152+rsp]
+	adc	rdi,rdx
+	mul	r11
+	add	rax,rsi
+	mov	rsi,r9
+	adc	rdx,rdi
+	mov	rdi,r10
+	shrd	rsi,r10,51
+	shr	rdi,51
+	add	rsi,rax
+	mov	rax,rsi
+	adc	rdi,rdx
+	and	rax,r15
+	mov	QWORD[56+rsp],rax
+	mov	rax,r11
+	mul	QWORD[40+rsp]
+	mov	r9,rax
+	mov	r10,rdx
+	mov	rax,rcx
+	mul	QWORD[160+rsp]
+	mov	rcx,rax
+	mov	rbx,rdx
+	mov	rax,r12
+	add	rcx,r9
+	adc	rbx,r10
+	mul	QWORD[8+rsp]
+	add	rcx,rax
+	mov	rax,rbp
+	adc	rbx,rdx
+	mul	QWORD[88+rsp]
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	QWORD[((-8))+rsp]
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	rdi,rcx
+	shrd	rcx,rbx,51
+	and	rdi,r15
+	lea	rax,[rcx*8+rcx]
+	mov	QWORD[208+rsp],rdi
+	lea	rax,[rax*2+rcx]
+	add	r8,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rsi,r8
+	shr	r8,51
+	and	rsi,r15
+	lea	r11,[rax*1+r8]
+	mov	QWORD[104+rsp],rsi
+	mov	rcx,rsi
+	and	r15,r11
+	shr	r11,51
+	mov	QWORD[120+rsp],r15
+	lea	r14,[r13*1+r11]
+	mov	r12,QWORD[56+rsp]
+	mov	rbp,r15
+	mov	QWORD[((-24))+rsp],100
+	mov		r15,2251799813685247
+	mov	rsi,rbp
+	mov	QWORD[144+rsp],r14
+	mov	rbp,rdi
+
+
+$L$8:
+	lea	rax,[rbp*8+rbp]
+	lea	r11,[rcx*1+rcx]
+	lea	rdi,[rsi*1+rsi]
+	lea	r13,[rax*2+rbp]
+	lea	rax,[r14*8+r14]
+	mov	QWORD[((-120))+rsp],rdi
+	lea	r9,[rax*2+r14]
+	lea	r8,[r13*1+r13]
+	add	r9,r9
+	mov	rax,r9
+	mul	r12
+	mov	r9,rax
+	mov	rax,rcx
+	mov	r10,rdx
+	mul	rcx
+	add	r9,rax
+	mov	rax,r8
+	adc	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,r11
+	adc	r10,rdx
+	mul	rbp
+	mov	rcx,rax
+	mov	rax,rdi
+	mov	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	lea	rax,[r12*8+r12]
+	mov	QWORD[((-104))+rsp],rcx
+	adc	rbx,rdx
+	lea	rcx,[rax*2+r12]
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rbx,r9
+	and	rbx,r15
+	mov	rax,rcx
+	mov	QWORD[((-88))+rsp],rbx
+	mul	r12
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	rcx,r9
+	mov	rdx,rcx
+	adc	rbx,r10
+	and	rdx,r15
+	mov	QWORD[((-72))+rsp],rdx
+	mul	r14
+	mov	r9,rax
+	mov	rax,rsi
+	mov	r10,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r8
+	add	rsi,r9
+	adc	rdi,r10
+	mul	r12
+	add	rsi,rax
+	mov	rax,r12
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r9,rsi
+	mul	r11
+	and	r9,r15
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,rbp
+	adc	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-96))+rsp]
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	r12,rcx
+	shrd	rcx,rbx,51
+	and	r12,r15
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	mov	rbp,rax
+	shrd	rax,rdx,51
+	and	rbp,r15
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	r14,[rax*2+rcx]
+	add	r14,QWORD[((-88))+rsp]
+	mov	rcx,r14
+	shr	r14,51
+	add	r14,QWORD[((-72))+rsp]
+	and	rcx,r15
+	mov	rsi,r14
+	shr	r14,51
+	and	rsi,r15
+	add	r14,r9
+	sub	QWORD[((-24))+rsp],1
+	jne	NEAR $L$8
+	mov	rbx,QWORD[208+rsp]
+	mov	r11,rbp
+	mov	rbp,rsi
+	mov	rsi,rcx
+	mov	rcx,QWORD[56+rsp]
+	mov	r9,QWORD[104+rsp]
+	mov	r10,QWORD[120+rsp]
+	mov	QWORD[((-24))+rsp],50
+	lea	rax,[rbx*8+rbx]
+	lea	rdi,[rax*2+rbx]
+	mov	rbx,QWORD[144+rsp]
+	lea	rax,[rbx*8+rbx]
+	lea	r13,[rax*2+rbx]
+	lea	rax,[rcx*8+rcx]
+	lea	r8,[rax*2+rcx]
+	mov	rax,r9
+	mul	rsi
+	mov	rcx,rax
+	mov	rax,rbp
+	mov	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	lea	rax,[r10*8+r10]
+	adc	rbx,rdx
+	lea	rax,[rax*2+r10]
+	mul	r11
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov	QWORD[((-120))+rsp],rcx
+	and	rax,r15
+	mov	QWORD[((-112))+rsp],rbx
+	mov	QWORD[((-104))+rsp],rax
+	mov	rax,r9
+	mov	r9,QWORD[((-120))+rsp]
+	mul	rbp
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	rsi
+	mov	r10,QWORD[((-112))+rsp]
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,r13
+	mov	r13,QWORD[120+rsp]
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,QWORD[104+rsp]
+	adc	rbx,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rcx
+	mov	rdx,r9
+	adc	r10,rbx
+	and	rdx,r15
+	mov	QWORD[((-120))+rsp],rdx
+	mul	r14
+	mov	rcx,rax
+	mov	rax,QWORD[144+rsp]
+	mov	rbx,rdx
+	mul	rsi
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[104+rsp]
+	adc	rbx,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rcx
+	adc	r10,rbx
+	mov	r8,r9
+	mul	r12
+	and	r8,r15
+	mov	rcx,rax
+	mov	rax,QWORD[56+rsp]
+	mov	rbx,rdx
+	mul	rsi
+	add	rcx,rax
+	mov	rax,QWORD[144+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,rdi
+	adc	rbx,rdx
+	mul	r11
+	add	rax,rcx
+	mov	rcx,r9
+	adc	rdx,rbx
+	mov	rbx,r10
+	shrd	rcx,r10,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,QWORD[104+rsp]
+	adc	rbx,rdx
+	mov	rdi,rcx
+	and	rdi,r15
+	mul	r11
+	mov	r13,rdi
+	mov	r9,rax
+	mov	rax,QWORD[208+rsp]
+	mov	r10,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rax,QWORD[120+rsp]
+	mov	rdi,rdx
+	add	rsi,r9
+	adc	rdi,r10
+	mul	r12
+	add	rsi,rax
+	mov	rax,QWORD[56+rsp]
+	adc	rdi,rdx
+	mul	rbp
+	add	rsi,rax
+	mov	rax,QWORD[144+rsp]
+	adc	rdi,rdx
+	mul	r14
+	add	rsi,rax
+	adc	rdi,rdx
+	mov	rdx,QWORD[((-120))+rsp]
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	r12,rcx
+	shrd	rcx,rbx,51
+	mov	rbx,QWORD[((-104))+rsp]
+	and	r12,r15
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	lea	rbp,[rbx*1+rax]
+	mov	r9,rbp
+	shr	rbp,51
+	lea	r11,[rbp*1+rdx]
+	and	r9,r15
+	mov	rcx,r9
+	and	r15,r11
+	shr	r11,51
+	mov	rbp,r15
+	lea	r14,[r8*1+r11]
+	mov		r15,2251799813685247
+	mov	rsi,rbp
+
+
+$L$9:
+	lea	rax,[r12*8+r12]
+	lea	r11,[rcx*1+rcx]
+	lea	rdi,[rsi*1+rsi]
+	lea	rbp,[rax*2+r12]
+	lea	rax,[r14*8+r14]
+	mov	QWORD[((-120))+rsp],rdi
+	lea	r9,[rax*2+r14]
+	lea	r8,[rbp*1+rbp]
+	add	r9,r9
+	mov	rax,r9
+	mul	r13
+	mov	r9,rax
+	mov	rax,rcx
+	mov	r10,rdx
+	mul	rcx
+	add	r9,rax
+	mov	rax,r8
+	adc	r10,rdx
+	mul	rsi
+	add	r9,rax
+	mov	rax,r11
+	adc	r10,rdx
+	mul	r12
+	mov	rcx,rax
+	mov	rax,rdi
+	mov	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	lea	rax,[r13*8+r13]
+	mov	QWORD[((-104))+rsp],rcx
+	adc	rbx,rdx
+	lea	rcx,[rax*2+r13]
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rbx,r9
+	and	rbx,r15
+	mov	rax,rcx
+	mov	QWORD[((-88))+rsp],rbx
+	mul	r13
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,r8
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	shrd	r9,r10,51
+	shr	r10,51
+	add	rcx,r9
+	mov	rdx,rcx
+	adc	rbx,r10
+	and	rdx,r15
+	mov	QWORD[((-72))+rsp],rdx
+	mul	r14
+	mov	r9,rax
+	mov	rax,rsi
+	mov	r10,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r8
+	add	rsi,r9
+	adc	rdi,r10
+	mul	r13
+	add	rsi,rax
+	mov	rax,r13
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r8,rsi
+	mul	r11
+	and	r8,r15
+	mov	rcx,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,r12
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	rbx,rdx
+	mov	rdx,QWORD[((-96))+rsp]
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	r13,rcx
+	shrd	rcx,rbx,51
+	and	r13,r15
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	mov	r12,rax
+	shrd	rax,rdx,51
+	and	r12,r15
+	mov	rcx,rax
+	lea	rax,[rax*8+rax]
+	lea	r14,[rax*2+rcx]
+	add	r14,QWORD[((-88))+rsp]
+	mov	rcx,r14
+	shr	r14,51
+	add	r14,QWORD[((-72))+rsp]
+	and	rcx,r15
+	mov	rsi,r14
+	shr	r14,51
+	and	rsi,r15
+	add	r14,r8
+	sub	QWORD[((-24))+rsp],1
+	jne	NEAR $L$9
+	mov	r11,QWORD[40+rsp]
+	mov	r10,QWORD[152+rsp]
+	mov	r9,rcx
+	mov	rbp,rsi
+	mov	rax,r11
+	mul	rcx
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	rsi
+	add	rcx,rax
+	mov	rax,QWORD[240+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[232+rsp]
+	adc	rbx,rdx
+	mul	r13
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r11
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r8,rsi
+	mul	rbp
+	and	r8,r15
+	mov	rcx,rax
+	mov	rax,QWORD[8+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,r10
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[232+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r15
+	mov	r10,rax
+	mov	rax,r11
+	mul	r14
+	mov	rcx,rax
+	mov	rax,QWORD[((-8))+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[8+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[152+rsp]
+	adc	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,QWORD[200+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r15
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,r11
+	mul	r13
+	mov	rcx,rax
+	mov	rax,QWORD[88+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,QWORD[((-8))+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[8+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,QWORD[152+rsp]
+	adc	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	rsi,rcx
+	mul	r12
+	and	rsi,r15
+	mov	r11,rax
+	mov	rax,QWORD[160+rsp]
+	mov	r12,rdx
+	mul	r9
+	add	r11,rax
+	mov	rax,QWORD[8+rsp]
+	adc	r12,rdx
+	mul	r13
+	add	r11,rax
+	mov	rax,QWORD[88+rsp]
+	adc	r12,rdx
+	mul	rbp
+	add	r11,rax
+	mov	rax,QWORD[((-8))+rsp]
+	adc	r12,rdx
+	mul	r14
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov	r9,rcx
+	shrd	rcx,rbx,51
+	and	r9,r15
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	r8,rax
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rcx,r8
+	shr	r8,51
+	add	r10,r8
+	and	rcx,r15
+	mov	r8,r10
+	shr	r10,51
+	lea	rdi,[rax*1+r10]
+	lea	rax,[r9*8+r9]
+	and	r8,r15
+	lea	r10,[rcx*1+rcx]
+	lea	r14,[r8*1+r8]
+	lea	r13,[rax*2+r9]
+	mov	rax,rcx
+	mul	rcx
+	lea	rbp,[r13*1+r13]
+	mov	rcx,rax
+	mov	rax,rbp
+	mov	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	lea	rax,[rdi*8+rdi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rdi]
+	add	rax,rax
+	mul	rsi
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov	r11,rcx
+	and	rax,r15
+	mov	r12,rbx
+	mov	QWORD[((-120))+rsp],rax
+	lea	rax,[rsi*8+rsi]
+	lea	rcx,[rax*2+rsi]
+	mov	rax,rcx
+	mul	rsi
+	mov	rcx,rax
+	mov	rax,r8
+	mov	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,rbp
+	adc	rbx,rdx
+	mul	rdi
+	add	rax,rcx
+	mov	rcx,r11
+	adc	rdx,rbx
+	mov	rbx,r12
+	shrd	rcx,r12,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rbp
+	adc	rbx,rdx
+	mov	QWORD[((-104))+rsp],rcx
+	and	rcx,r15
+	mul	rsi
+	mov	QWORD[((-88))+rsp],rcx
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rcx,QWORD[((-104))+rsp]
+	mov	rbx,QWORD[((-96))+rsp]
+	mov	r11,rax
+	mov	rax,r8
+	mov	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,r10
+	adc	r12,rdx
+	mul	rdi
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mov	rbp,rcx
+	mul	r10
+	and	rbp,r15
+	mov	r11,rax
+	mov	rax,r13
+	mov	r12,rdx
+	mul	r9
+	add	r11,rax
+	mov	rax,r14
+	adc	r12,rdx
+	mul	rdi
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mov	r8,rcx
+	mul	r14
+	and	r8,r15
+	mov	r11,rax
+	mov	rax,rdi
+	mov	r12,rdx
+	mul	rdi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r9
+	add	rsi,r11
+	adc	rdi,r12
+	mul	r10
+	add	rsi,rax
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	rcx,rsi
+	mov	r9,rsi
+	shrd	rcx,rdi,51
+	and	r9,r15
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-120))+rsp]
+	mov	r11,rax
+	shr	rax,51
+	add	rax,QWORD[((-88))+rsp]
+	and	r11,r15
+	lea	r10,[r11*1+r11]
+	mov	rdi,rax
+	shr	rax,51
+	lea	rsi,[rbp*1+rax]
+	lea	rax,[r9*8+r9]
+	and	rdi,r15
+	lea	r14,[rdi*1+rdi]
+	lea	r13,[rax*2+r9]
+	mov	rax,rdi
+	lea	rbp,[r13*1+r13]
+	mul	rbp
+	mov	rcx,rax
+	mov	rax,r11
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	lea	rax,[rsi*8+rsi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rsi]
+	add	rax,rax
+	mul	r8
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov	r11,rcx
+	and	rax,r15
+	mov	r12,rbx
+	mov	QWORD[((-88))+rsp],rax
+	lea	rax,[r8*8+r8]
+	lea	rcx,[rax*2+r8]
+	mov	rax,rcx
+	mul	r8
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mul	rbp
+	add	rax,rcx
+	mov	rcx,r11
+	adc	rdx,rbx
+	mov	rbx,r12
+	shrd	rcx,r12,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rbp
+	adc	rbx,rdx
+	mov	QWORD[((-104))+rsp],rcx
+	and	rcx,r15
+	mul	r8
+	mov	QWORD[((-96))+rsp],rbx
+	mov	QWORD[((-120))+rsp],rcx
+	mov	rbx,QWORD[((-96))+rsp]
+	mov	rcx,QWORD[((-104))+rsp]
+	mov	r11,rax
+	mov	rax,rdi
+	mov	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,rsi
+	adc	r12,rdx
+	mul	r10
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	mov	r11,rax
+	mov	rbp,rax
+	mov	rax,r13
+	mov	r12,rdx
+	and	rbp,r15
+	mul	r9
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,r11
+	adc	rdx,rbx
+	mov	rbx,r12
+	shrd	rcx,r12,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mov	rdi,rcx
+	mul	rsi
+	and	rdi,r15
+	mov	r11,rax
+	mov	rax,r14
+	mov	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,r9
+	adc	r12,rdx
+	mul	r10
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov	r9,rcx
+	shrd	rcx,rbx,51
+	and	r9,r15
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-88))+rsp]
+	mov	r11,rax
+	shr	rax,51
+	add	rax,QWORD[((-120))+rsp]
+	and	r11,r15
+	lea	r10,[r11*1+r11]
+	mov	r8,rax
+	shr	rax,51
+	lea	rsi,[rbp*1+rax]
+	lea	rax,[r9*8+r9]
+	and	r8,r15
+	lea	r14,[r8*1+r8]
+	lea	r13,[rax*2+r9]
+	mov	rax,r8
+	lea	rbp,[r13*1+r13]
+	mul	rbp
+	mov	rcx,rax
+	mov	rax,r11
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	lea	rax,[rsi*8+rsi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rsi]
+	add	rax,rax
+	mul	rdi
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov	r11,rcx
+	and	rax,r15
+	mov	r12,rbx
+	mov	QWORD[((-88))+rsp],rax
+	lea	rax,[rdi*8+rdi]
+	lea	rcx,[rax*2+rdi]
+	mov	rax,rcx
+	mul	rdi
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mul	rbp
+	add	rax,rcx
+	mov	rcx,r11
+	adc	rdx,rbx
+	mov	rbx,r12
+	shrd	rcx,r12,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rbp
+	adc	rbx,rdx
+	mov	QWORD[((-104))+rsp],rcx
+	and	rcx,r15
+	mul	rdi
+	mov	QWORD[((-120))+rsp],rcx
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rcx,QWORD[((-104))+rsp]
+	mov	rbx,QWORD[((-96))+rsp]
+	mov	r11,rax
+	mov	rax,r8
+	mov	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,rsi
+	adc	r12,rdx
+	mul	r10
+	add	r11,rax
+	mov	rax,r13
+	adc	r12,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	r11,rcx
+	adc	r12,rbx
+	mov	rbp,r11
+	mul	r9
+	and	rbp,r15
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,r11
+	adc	rdx,rbx
+	mov	rbx,r12
+	shrd	rcx,r12,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mov	r8,rcx
+	mul	rsi
+	and	r8,r15
+	mov	r11,rax
+	mov	rax,r14
+	mov	r12,rdx
+	mul	rdi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r9
+	add	rsi,r11
+	adc	rdi,r12
+	mul	r10
+	add	rsi,rax
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	rcx,rsi
+	mov	r9,rsi
+	shrd	rcx,rdi,51
+	and	r9,r15
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-88))+rsp]
+	mov	r11,rax
+	shr	rax,51
+	add	rax,QWORD[((-120))+rsp]
+	and	r11,r15
+	lea	r10,[r11*1+r11]
+	mov	rdi,rax
+	shr	rax,51
+	lea	rsi,[rbp*1+rax]
+	lea	rax,[r9*8+r9]
+	and	rdi,r15
+	lea	r14,[rdi*1+rdi]
+	lea	r13,[rax*2+r9]
+	mov	rax,rdi
+	lea	rbp,[r13*1+r13]
+	mul	rbp
+	mov	rcx,rax
+	mov	rax,r11
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	lea	rax,[rsi*8+rsi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rsi]
+	add	rax,rax
+	mul	r8
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov	r11,rcx
+	and	rax,r15
+	mov	r12,rbx
+	mov	QWORD[((-88))+rsp],rax
+	lea	rax,[r8*8+r8]
+	lea	rcx,[rax*2+r8]
+	mov	rax,rcx
+	mul	r8
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	rdi
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mul	rbp
+	add	rax,rcx
+	mov	rcx,r11
+	adc	rdx,rbx
+	mov	rbx,r12
+	shrd	rcx,r12,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rbp
+	adc	rbx,rdx
+	mov	QWORD[((-104))+rsp],rcx
+	and	rcx,r15
+	mul	r8
+	mov	QWORD[((-120))+rsp],rcx
+	mov	QWORD[((-96))+rsp],rbx
+	mov	rcx,QWORD[((-104))+rsp]
+	mov	rbx,QWORD[((-96))+rsp]
+	mov	r11,rax
+	mov	rax,rdi
+	mov	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,rsi
+	adc	r12,rdx
+	mul	r10
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	adc	rdx,rbx
+	mov	r11,rax
+	mov	rbp,rax
+	mov	rax,r13
+	mov	r12,rdx
+	and	rbp,r15
+	mul	r9
+	mov	rcx,rax
+	mov	rax,r10
+	mov	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,r11
+	adc	rdx,rbx
+	mov	rbx,r12
+	shrd	rcx,r12,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rsi
+	adc	rbx,rdx
+	mov	rdi,rcx
+	mul	rsi
+	and	rdi,r15
+	mov	r11,rax
+	mov	rax,r14
+	mov	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,r9
+	adc	r12,rdx
+	mul	r10
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	adc	rbx,rdx
+	mov	r8,rcx
+	shrd	rcx,rbx,51
+	and	r8,r15
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-88))+rsp]
+	mov	rcx,rax
+	shr	rax,51
+	add	rax,QWORD[((-120))+rsp]
+	and	rcx,r15
+	mov	r11,rax
+	shr	rax,51
+	and	r11,r15
+	lea	rsi,[rbp*1+rax]
+	lea	rax,[r8*8+r8]
+	lea	r14,[r11*1+r11]
+	lea	rbp,[rcx*1+rcx]
+	mov	QWORD[((-120))+rsp],r14
+	lea	r14,[rax*2+r8]
+	mov	rax,rcx
+	mul	rcx
+	lea	r13,[r14*1+r14]
+	mov	rcx,rax
+	mov	rax,r13
+	mov	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	lea	rax,[rsi*8+rsi]
+	adc	rbx,rdx
+	lea	rax,[rax*2+rsi]
+	add	rax,rax
+	mul	rdi
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov	r9,rcx
+	and	rax,r15
+	mov	r10,rbx
+	mov	QWORD[((-104))+rsp],rax
+	lea	rax,[rdi*8+rdi]
+	lea	rcx,[rax*2+rdi]
+	mov	rax,rcx
+	mul	rdi
+	mov	rcx,rax
+	mov	rax,r11
+	mov	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	rsi
+	add	rax,rcx
+	mov	rcx,r9
+	adc	rdx,rbx
+	mov	rbx,r10
+	shrd	rcx,r10,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r11
+	adc	rbx,rdx
+	mov	r9,rcx
+	and	rcx,r15
+	mul	r11
+	mov	QWORD[((-88))+rsp],rcx
+	mov	rcx,r9
+	mov	r11,rax
+	mov	rax,r13
+	mov	r12,rdx
+	mul	rdi
+	add	r11,rax
+	mov	rax,rbp
+	adc	r12,rdx
+	mul	rsi
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,rdi
+	adc	rbx,rdx
+	mov	r13,rcx
+	mul	rbp
+	and	r13,r15
+	mov	r9,rax
+	mov	rax,r14
+	mov	r10,rdx
+	mul	r8
+	mov	r14,QWORD[((-120))+rsp]
+	add	r9,rax
+	mov	rax,r14
+	adc	r10,rdx
+	mul	rsi
+	add	rax,r9
+	adc	rdx,r10
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,r14
+	mov	r14,QWORD[136+rsp]
+	adc	rbx,rdx
+	mov	r9,rcx
+	mul	rdi
+	and	r9,r15
+	mov	r11,rax
+	mov	rax,rsi
+	mov	r12,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,rbp
+	add	rsi,r11
+	adc	rdi,r12
+	mul	r8
+	add	rsi,rax
+	adc	rdi,rdx
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rsi
+	adc	rbx,rdi
+	mov	r8,rcx
+	shrd	rcx,rbx,51
+	and	r8,r15
+	lea	rax,[rcx*8+rcx]
+	lea	rax,[rax*2+rcx]
+	add	rax,QWORD[((-104))+rsp]
+	mov	r10,rax
+	shr	rax,51
+	add	rax,QWORD[((-88))+rsp]
+	and	r10,r15
+	mov	r11,rax
+	shr	rax,51
+	lea	rbp,[r13*1+rax]
+	mov	r13,QWORD[80+rsp]
+	and	r11,r15
+	lea	rax,[r13*8+r13]
+	lea	r12,[rax*2+r13]
+	lea	rax,[r14*8+r14]
+	lea	rbx,[rax*2+r14]
+	mov	rax,rbx
+	mul	r8
+	mov	rcx,rax
+	mov	rax,r9
+	mov	rbx,rdx
+	mul	r12
+	add	rcx,rax
+	mov	rax,QWORD[128+rsp]
+	adc	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,QWORD[192+rsp]
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[216+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	mov	rsi,rax
+	mov	rdi,rdx
+	add	rsi,rcx
+	mov	rax,rsi
+	adc	rdi,rbx
+	and	rax,r15
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,r12
+	mul	r8
+	mov	rcx,rax
+	mov	rax,QWORD[216+rsp]
+	mov	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,QWORD[128+rsp]
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[192+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[192+rsp]
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	r12,rsi
+	mul	r9
+	and	r12,r15
+	mov	rcx,rax
+	mov	rax,QWORD[216+rsp]
+	mov	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,QWORD[128+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[128+rsp]
+	adc	rbx,rdx
+	shrd	rsi,rdi,51
+	shr	rdi,51
+	add	rsi,rcx
+	adc	rdi,rbx
+	mov	rcx,rsi
+	mul	r9
+	and	rcx,r15
+	mov	QWORD[((-104))+rsp],rcx
+	mov	rcx,rax
+	mov	rax,QWORD[192+rsp]
+	mov	rbx,rdx
+	mul	r8
+	add	rcx,rax
+	mov	rax,QWORD[24+rsp]
+	adc	rbx,rdx
+	mul	r10
+	add	rcx,rax
+	mov	rax,r13
+	adc	rbx,rdx
+	mul	r11
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	rbp
+	add	rax,rcx
+	mov	rcx,rsi
+	adc	rdx,rbx
+	mov	rbx,rdi
+	shrd	rcx,rdi,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,QWORD[128+rsp]
+	adc	rbx,rdx
+	mov	rsi,rcx
+	and	rsi,r15
+	mul	r8
+	mov	r13,rax
+	mov	rax,QWORD[136+rsp]
+	mov	r14,rdx
+	mul	r9
+	add	r13,rax
+	mov	rax,QWORD[224+rsp]
+	adc	r14,rdx
+	mul	r10
+	add	r13,rax
+	mov	rax,QWORD[24+rsp]
+	adc	r14,rdx
+	mul	r11
+	add	r13,rax
+	mov	rax,QWORD[80+rsp]
+	adc	r14,rdx
+	mul	rbp
+	add	rax,r13
+	adc	rdx,r14
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rax,rcx
+	mov	rcx,QWORD[((-104))+rsp]
+	adc	rdx,rbx
+	mov	rdi,rax
+	shrd	rax,rdx,51
+	and	rdi,r15
+	lea	rdx,[rax*8+rax]
+	lea	rax,[rdx*2+rax]
+	add	rax,QWORD[((-120))+rsp]
+	mov	rbp,rax
+	shr	rax,51
+	add	r12,rax
+	lea	rax,[rdi*8+rdi]
+	and	rbp,r15
+	mov	r9,r12
+	shr	r12,51
+	lea	r10,[rcx*1+r12]
+	lea	r8,[rax*2+rdi]
+	and	r9,r15
+	lea	rax,[r10*8+r10]
+	lea	r14,[rax*2+r10]
+	lea	rax,[rsi*8+rsi]
+	lea	r13,[rax*2+rsi]
+	mov	rax,QWORD[((-56))+rsp]
+	mul	r8
+	mov	rcx,rax
+	mov	rax,QWORD[168+rsp]
+	mov	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,QWORD[72+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	lea	rax,[r9*8+r9]
+	adc	rbx,rdx
+	lea	rax,[rax*2+r9]
+	mul	QWORD[((-32))+rsp]
+	add	rcx,rax
+	mov	rax,QWORD[184+rsp]
+	adc	rbx,rdx
+	mul	r14
+	add	rcx,rax
+	mov	rax,rcx
+	adc	rbx,rdx
+	mov	r11,rcx
+	and	rax,r15
+	mov	r12,rbx
+	mov	QWORD[((-120))+rsp],rax
+	mov	rax,QWORD[168+rsp]
+	mul	r8
+	mov	rcx,rax
+	mov	rax,QWORD[184+rsp]
+	mov	rbx,rdx
+	mul	r13
+	add	rcx,rax
+	mov	rax,QWORD[((-56))+rsp]
+	adc	rbx,rdx
+	mul	rbp
+	add	rcx,rax
+	mov	rax,QWORD[72+rsp]
+	adc	rbx,rdx
+	mul	r9
+	add	rcx,rax
+	mov	rax,r14
+	adc	rbx,rdx
+	mul	QWORD[((-32))+rsp]
+	add	rax,rcx
+	mov	rcx,r11
+	adc	rdx,rbx
+	mov	rbx,r12
+	shrd	rcx,r12,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,QWORD[184+rsp]
+	adc	rbx,rdx
+	mov	r14,rcx
+	and	r14,r15
+	mul	r8
+	mov	r11,rax
+	mov	r12,rdx
+	mov	rax,r13
+	mul	QWORD[((-32))+rsp]
+	add	r11,rax
+	mov	rax,QWORD[168+rsp]
+	adc	r12,rdx
+	mul	rbp
+	add	r11,rax
+	mov	rax,QWORD[((-56))+rsp]
+	adc	r12,rdx
+	mul	r9
+	add	r11,rax
+	mov	rax,QWORD[72+rsp]
+	mov	QWORD[((-96))+rsp],0
+	adc	r12,rdx
+	mov	QWORD[((-112))+rsp],0
+	mul	r10
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,QWORD[72+rsp]
+	adc	rbx,rdx
+	mov	r13,rcx
+	and	r13,r15
+	mul	rsi
+	mov	r11,rax
+	mov	rax,r8
+	mov	r8,QWORD[((-32))+rsp]
+	mov	r12,rdx
+	mul	r8
+	add	r11,rax
+	mov	rax,QWORD[184+rsp]
+	adc	r12,rdx
+	mul	rbp
+	add	r11,rax
+	mov	rax,QWORD[168+rsp]
+	adc	r12,rdx
+	mul	r9
+	add	r11,rax
+	mov	rax,QWORD[((-56))+rsp]
+	adc	r12,rdx
+	mul	r10
+	add	rax,r11
+	adc	rdx,r12
+	shrd	rcx,rbx,51
+	shr	rbx,51
+	add	rcx,rax
+	mov	rax,QWORD[72+rsp]
+	adc	rbx,rdx
+	mul	rdi
+	mov	r11,rax
+	mov	rax,QWORD[((-56))+rsp]
+	mov	r12,rdx
+	mul	rsi
+	mov	rsi,rax
+	mov	rdi,rdx
+	mov	rax,r8
+	add	rsi,r11
+	adc	rdi,r12
+	mul	rbp
+	add	rsi,rax
+	mov	rax,QWORD[184+rsp]
+	adc	rdi,rdx
+	mul	r9
+	add	rsi,rax
+	mov	rax,QWORD[168+rsp]
+	adc	rdi,rdx
+	mul	r10
+	add	rsi,rax
+	mov	rax,rcx
+	adc	rdi,rdx
+	mov	rdx,rbx
+	shrd	rax,rbx,51
+	shr	rdx,51
+	add	rax,rsi
+	adc	rdx,rdi
+	mov	rsi,rax
+	mov	QWORD[((-104))+rsp],rax
+	shrd	rsi,rdx,51
+	mov	rax,rcx
+	xor	r10d,r10d
+	lea	rdx,[rsi*8+rsi]
+	and	rax,r15
+	mov	r9,rax
+	mov	rax,QWORD[((-104))+rsp]
+	lea	rbp,[rdx*2+rsi]
+	add	rbp,QWORD[((-120))+rsp]
+	xor	edx,edx
+	mov	rbx,rdx
+	shr	rbx,51
+	mov	rdi,rbp
+	shr	rdi,51
+	add	rdi,r14
+	mov	r8,rdi
+	shr	r8,51
+	add	r8,r13
+	mov	rcx,r8
+	shrd	rcx,rdx,51
+	add	r9,rcx
+	adc	r10,rbx
+	and	rax,r15
+	xor	r12d,r12d
+	mov	r11,rax
+	mov	rax,r9
+	mov	rdx,r10
+	shrd	rax,r10,51
+	shr	rdx,51
+	add	r11,rax
+	mov	eax,19
+	adc	r12,rdx
+	mov	rcx,r11
+	and	rbp,r15
+	shrd	rcx,r12,51
+	mov	rbx,r12
+	mul	rcx
+	shr	rbx,51
+	imul	rsi,rbx,19
+	mov	rcx,rax
+	mov	rbx,rdx
+	xor	edx,edx
+	add	rbx,rsi
+	add	rcx,rbp
+	adc	rbx,rdx
+	mov	rsi,rcx
+	and	rdi,r15
+	shrd	rsi,rbx,51
+	mov	rax,rdi
+	mov	rdi,rbx
+	xor	edx,edx
+	shr	rdi,51
+	add	rsi,rax
+	mov	rbx,rcx
+	adc	rdi,rdx
+	and	r8,r15
+	xor	edx,edx
+	mov	rbp,rdi
+	mov	rdi,rsi
+	mov	QWORD[((-120))+rsp],rsi
+	shrd	rdi,rbp,51
+	shr	rbp,51
+	mov	rsi,r9
+	add	rdi,r8
+	adc	rbp,rdx
+	mov	r8,rdi
+	and	rsi,r15
+	shrd	r8,rbp,51
+	mov	r9,rbp
+	xor	edx,edx
+	shr	r9,51
+	add	r8,rsi
+	mov	rsi,r11
+	adc	r9,rdx
+	and	rsi,r15
+	xor	edx,edx
+	mov	r10,r9
+	mov	r9,r8
+	mov	QWORD[((-104))+rsp],r8
+	shrd	r9,r10,51
+	shr	r10,51
+	add	r9,rsi
+	mov	esi,19
+	mov	r8,r9
+	adc	r10,rdx
+	and	rbx,r15
+	mov	rax,r8
+	mov	rdx,r10
+	mov	r11,rbx
+	shrd	rax,r10,51
+	shr	rdx,51
+	xor	r12d,r12d
+	mov	rbx,rdi
+	imul	r10,rdx,19
+	mul	rsi
+	add	rdx,r10
+	add	r11,19
+	adc	r12,0
+	add	r11,rax
+	mov	rax,QWORD[((-120))+rsp]
+	adc	r12,rdx
+	xor	r14d,r14d
+	mov	rdx,r12
+	and	rax,r15
+	shr	rdx,51
+	mov	r13,rax
+	mov	rax,r11
+	shrd	rax,r12,51
+	add	r13,rax
+	mov	rax,QWORD[((-104))+rsp]
+	adc	r14,rdx
+	mov	rsi,r13
+	and	rbx,r15
+	shrd	rsi,r14,51
+	mov	rdi,r14
+	xor	edx,edx
+	shr	rdi,51
+	add	rsi,rbx
+	adc	rdi,rdx
+	and	rax,r15
+	xor	ebx,ebx
+	mov	rcx,rax
+	mov	rax,rsi
+	mov	rdx,rdi
+	shrd	rax,rdi,51
+	shr	rdx,51
+	mov	rdi,r8
+	mov	QWORD[((-120))+rsp],rsi
+	add	rcx,rax
+	mov	esi,19
+	adc	rbx,rdx
+	mov	r8,rcx
+	and	rdi,r15
+	shrd	r8,rbx,51
+	mov	r9,rbx
+	xor	edx,edx
+	mov	rbx,rcx
+	shr	r9,51
+	add	r8,rdi
+	adc	r9,rdx
+	mov	rax,r8
+	xor	r12d,r12d
+	shrd	rax,r9,51
+	mov	rdx,r9
+	mov	QWORD[((-80))+rsp],r9
+	xor	r10d,r10d
+	shr	rdx,51
+	mov	QWORD[((-88))+rsp],r8
+	imul	rdi,rdx,19
+	mul	rsi
+	add	rdx,rdi
+	mov	rdi,r11
+	mov		r11,2251799813685229
+	and	rdi,r15
+	mov	r9,rdi
+	mov	rdi,r13
+	add	r9,r11
+	adc	r10,r12
+	add	rax,r9
+	mov		r9,2251799813685247
+	adc	rdx,r10
+	and	rdi,r15
+	xor	r10d,r10d
+	mov	r11,rdi
+	xor	r12d,r12d
+	mov	r13,rax
+	add	r11,r9
+	mov	rdi,QWORD[((-120))+rsp]
+	mov	r14,rdx
+	adc	r12,r10
+	shr	r14,51
+	shrd	r13,rdx,51
+	add	r13,r11
+	adc	r14,r12
+	and	rdi,r15
+	xor	r12d,r12d
+	mov	rsi,rdi
+	mov	r11,r13
+	mov	rdi,r12
+	add	rsi,r9
+	mov	r12,r14
+	adc	rdi,r10
+	shr	r12,51
+	shrd	r11,r14,51
+	add	r11,rsi
+	adc	r12,rdi
+	and	rbx,r15
+	xor	r14d,r14d
+	mov	rdi,r13
+	mov	rcx,rbx
+	mov	r13,r11
+	and	rdi,r15
+	mov	rbx,r14
+	add	rcx,r9
+	adc	rbx,r10
+	mov	r14,r12
+	mov	QWORD[((-104))+rsp],rdi
+	shrd	r13,r12,51
+	mov	rsi,QWORD[((-104))+rsp]
+	shr	r14,51
+	add	r13,rcx
+	adc	r14,rbx
+	mov	rbx,r11
+	mov	rdi,r13
+	and	rbx,r15
+	mov	rdx,rsi
+	and	rax,r15
+	mov	r13,rbx
+	mov	rbx,rdi
+	sal	rdx,51
+	and	rbx,r15
+	or	rax,rdx
+	mov	r11,rdi
+	mov	QWORD[((-120))+rsp],rbx
+	mov	rbx,QWORD[352+rsp]
+	mov	rdx,rax
+	shr	rdx,8
+	mov	rdi,QWORD[((-96))+rsp]
+	mov	rbp,r14
+	mov	rcx,r13
+	xor	r14d,r14d
+	mov	BYTE[1+rbx],dl
+	mov	rdx,rax
+	mov	BYTE[rbx],al
+	shr	rdx,16
+	mov	BYTE[2+rbx],dl
+	mov	rdx,rax
+	shr	rdx,24
+	mov	BYTE[3+rbx],dl
+	mov	rdx,rax
+	shr	rdx,32
+	mov	BYTE[4+rbx],dl
+	mov	rdx,rax
+	mov	r12,QWORD[((-112))+rsp]
+	shr	rdx,40
+	mov	BYTE[5+rbx],dl
+	mov	rdx,rax
+	shr	rax,56
+	mov	BYTE[7+rbx],al
+	mov	rax,r13
+	shr	rdx,48
+	shrd	rsi,rdi,13
+	sal	rax,38
+	mov	rdi,rbx
+	mov	BYTE[6+rbx],dl
+	or	rsi,rax
+	mov	r13,r11
+	mov	r11,QWORD[((-120))+rsp]
+	mov	rax,rsi
+	mov	BYTE[8+rbx],sil
+	shr	rax,8
+	mov	BYTE[9+rbx],al
+	mov	rax,rsi
+	shr	rax,16
+	mov	BYTE[10+rbx],al
+	mov	rax,rsi
+	shr	rax,24
+	mov	BYTE[11+rbx],al
+	mov	rax,rsi
+	shr	rax,32
+	mov	BYTE[12+rbx],al
+	mov	rax,rsi
+	shr	rax,40
+	mov	BYTE[13+rbx],al
+	mov	rax,rsi
+	shr	rsi,56
+	shr	rax,48
+	mov	BYTE[15+rbx],sil
+	mov	BYTE[14+rbx],al
+	mov	rax,QWORD[((-120))+rsp]
+	mov	rbx,rdi
+	shrd	rcx,r14,26
+	sal	rax,25
+	or	rcx,rax
+	mov	rax,rcx
+	mov	BYTE[16+rdi],cl
+	shr	rax,8
+	mov	BYTE[17+rdi],al
+	mov	rax,rcx
+	shr	rax,16
+	mov	BYTE[18+rdi],al
+	mov	rax,rcx
+	shr	rax,24
+	mov	BYTE[19+rdi],al
+	mov	rax,rcx
+	shr	rax,32
+	mov	BYTE[20+rdi],al
+	mov	rax,rcx
+	shr	rax,40
+	mov	BYTE[21+rdi],al
+	mov	rax,rcx
+	shr	rax,48
+	shr	rcx,56
+	mov	BYTE[22+rdi],al
+	mov	BYTE[23+rdi],cl
+	mov	rdi,QWORD[((-88))+rsp]
+	and	rdi,r15
+	mov	rax,rdi
+	add	rax,r9
+	shrd	r13,rbp,51
+	add	rax,r13
+	and	rax,r15
+	shrd	r11,r12,39
+	sal	rax,12
+	or	rax,r11
+	mov	rdx,rax
+	mov	BYTE[24+rbx],al
+	shr	rdx,8
+	mov	BYTE[25+rbx],dl
+	mov	rdx,rax
+	shr	rdx,16
+	mov	BYTE[26+rbx],dl
+	mov	rdx,rax
+	shr	rdx,24
+	mov	BYTE[27+rbx],dl
+	mov	rdx,rax
+	shr	rdx,32
+	mov	BYTE[28+rbx],dl
+	mov	rdx,rax
+	shr	rdx,40
+	mov	BYTE[29+rbx],dl
+	mov	rdx,rax
+	shr	rax,56
+	shr	rdx,48
+	mov	BYTE[31+rbx],al
+	xor	eax,eax
+	mov	BYTE[30+rbx],dl
+	add	rsp,784
+
+
+	pop rdi
+	pop rsi
+	pop	rbx
+	pop	rbp
+	pop	r12
+	pop	r13
+	pop	r14
+	pop	r15
+	ret
+
+$L$FE13:
+
+
diff --git a/crypto/make_all_asm_files.sh b/crypto/make_all_asm_files.sh
new file mode 100644
index 0000000..fcbed3d
--- /dev/null
+++ b/crypto/make_all_asm_files.sh
@@ -0,0 +1,28 @@
+#!/bin/sh
+
+set -e
+
+# macos
+perl make_chacha20_x64.pl macosx > chacha20_x64_gas_macosx.s
+perl make_poly1305_x64.pl macosx > poly1305_x64_gas_macosx.s
+
+cd aesgcm
+
+perl aesni-gcm-x86_64.pl macosx > aesni_gcm_x64_gas_macosx.s
+perl aesni-x86_64.pl macosx > aesni_x64_gas_macosx.s
+perl ghash-x86_64.pl macosx > ghash_x64_gas_macosx.s
+
+cd ..
+
+
+# linux,freebsd
+perl make_chacha20_x64.pl gas > chacha20_x64_gas.s
+perl make_poly1305_x64.pl gas > poly1305_x64_gas.s
+
+cd aesgcm
+
+perl aesni-gcm-x86_64.pl gas > aesni_gcm_x64_gas.s
+perl aesni-x86_64.pl gas > aesni_x64_gas.s
+perl ghash-x86_64.pl gas > ghash_x64_gas.s
+
+cd ..
diff --git a/crypto/make_chacha20_x64.pl b/crypto/make_chacha20_x64.pl
new file mode 100644
index 0000000..f9379ca
--- /dev/null
+++ b/crypto/make_chacha20_x64.pl
@@ -0,0 +1,3665 @@
+#! /usr/bin/env perl
+# Copyright 2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# November 2014
+#
+# chacha20 for x86_64.
+#
+# December 2016
+#
+# Add AVX512F code path.
+#
+# December 2017
+#
+# Add AVX512VL code path.
+#
+# Performance in cycles per byte out of large buffer.
+#
+#		IALU/gcc 4.8(i)	1xSSSE3/SSE2	4xSSSE3	    NxAVX(v)
+#
+# P4		9.48/+99%	-/22.7(ii)	-
+# Core2		7.83/+55%	7.90/8.08	4.35
+# Westmere	7.19/+50%	5.60/6.70	3.00
+# Sandy Bridge	8.31/+42%	5.45/6.76	2.72
+# Ivy Bridge	6.71/+46%	5.40/6.49	2.41
+# Haswell	5.92/+43%	5.20/6.45	2.42	    1.23
+# Skylake[-X]	5.87/+39%	4.70/-		2.31	    1.19[0.80(vi)]
+# Silvermont	12.0/+33%	7.75/7.40	7.03(iii)
+# Knights L	11.7/-		-		9.60(iii)   0.80
+# Goldmont	10.6/+17%	5.10/-		3.28
+# Sledgehammer	7.28/+52%	-/14.2(ii)	-
+# Bulldozer	9.66/+28%	9.85/11.1	3.06(iv)
+# Ryzen		5.96/+50%	5.19/-		2.40        2.09
+# VIA Nano	10.5/+46%	6.72/8.60	6.05
+#
+# (i)	compared to older gcc 3.x one can observe >2x improvement on
+#	most platforms;
+# (ii)	as it can be seen, SSE2 performance is too low on legacy
+#	processors; NxSSE2 results are naturally better, but not
+#	impressively better than IALU ones, which is why you won't
+#	find SSE2 code below;
+# (iii)	this is not optimal result for Atom because of MSROM
+#	limitations, SSE2 can do better, but gain is considered too
+#	low to justify the [maintenance] effort;
+# (iv)	Bulldozer actually executes 4xXOP code path that delivers 2.20;
+# (v)	8xAVX2, 8xAVX512VL or 16xAVX512F, whichever best applicable;
+# (vi)	even though Skylake-X can execute AVX512F code and deliver 0.57
+#	cpb in single thread, the corresponding capability is suppressed;
+
+$flavour = shift;
+$output  = shift;
+if ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
+
+$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/);
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or
+die "can't locate x86_64-xlate.pl";
+
+$avx = 3;
+$avx = 2 if ($flavour =~ /macosx/);
+
+open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
+*STDOUT=*OUT;
+
+# input parameter block
+($out,$inp,$len,$key,$counter)=("%rdi","%rsi","%rdx","%rcx","%r8");
+
+$code.=<<___;
+.text
+
+.align	64
+.Lzero:
+.long	0,0,0,0
+.Lone:
+.long	1,0,0,0
+.Linc:
+.long	0,1,2,3
+.Lfour:
+.long	4,4,4,4
+.Lincy:
+.long	0,2,4,6,1,3,5,7
+.Leight:
+.long	8,8,8,8,8,8,8,8
+.Lrot16:
+.byte	0x2,0x3,0x0,0x1, 0x6,0x7,0x4,0x5, 0xa,0xb,0x8,0x9, 0xe,0xf,0xc,0xd
+.Lrot24:
+.byte	0x3,0x0,0x1,0x2, 0x7,0x4,0x5,0x6, 0xb,0x8,0x9,0xa, 0xf,0xc,0xd,0xe
+.Lsigma:
+.asciz	"expand 32-byte k"
+.align	64
+.Lzeroz:
+.long	0,0,0,0, 1,0,0,0, 2,0,0,0, 3,0,0,0
+.Lfourz:
+.long	4,0,0,0, 4,0,0,0, 4,0,0,0, 4,0,0,0
+.Lincz:
+.long	0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+.Lsixteen:
+.long	16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16
+.align	64
+.Ltwoy:
+.long	2,0,0,0, 2,0,0,0
+___
+
+sub AUTOLOAD()          # thunk [simplified] 32-bit style perlasm
+{ my $opcode = $AUTOLOAD; $opcode =~ s/.*:://;
+  my $arg = pop;
+    $arg = "\$$arg" if ($arg*1 eq $arg);
+    $code .= "\t$opcode\t".join(',',$arg,reverse @_)."\n";
+}
+
+@x=("%eax","%ebx","%ecx","%edx",map("%r${_}d",(8..11)),
+    "%nox","%nox","%nox","%nox",map("%r${_}d",(12..15)));
+@t=("%esi","%edi");
+
+sub ROUND {			# critical path is 24 cycles per round
+my ($a0,$b0,$c0,$d0)=@_;
+my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0));
+my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1));
+my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2));
+my ($xc,$xc_)=map("\"$_\"",@t);
+my @x=map("\"$_\"",@x);
+
+	# Consider order in which variables are addressed by their
+	# index:
+	#
+	#	a   b   c   d
+	#
+	#	0   4   8  12 < even round
+	#	1   5   9  13
+	#	2   6  10  14
+	#	3   7  11  15
+	#	0   5  10  15 < odd round
+	#	1   6  11  12
+	#	2   7   8  13
+	#	3   4   9  14
+	#
+	# 'a', 'b' and 'd's are permanently allocated in registers,
+	# @x[0..7,12..15], while 'c's are maintained in memory. If
+	# you observe 'c' column, you'll notice that pair of 'c's is
+	# invariant between rounds. This means that we have to reload
+	# them once per round, in the middle. This is why you'll see
+	# bunch of 'c' stores and loads in the middle, but none in
+	# the beginning or end.
+
+	# Normally instructions would be interleaved to favour in-order
+	# execution. Generally out-of-order cores manage it gracefully,
+	# but not this time for some reason. As in-order execution
+	# cores are dying breed, old Atom is the only one around,
+	# instructions are left uninterleaved. Besides, Atom is better
+	# off executing 1xSSSE3 code anyway...
+
+	(
+	"&add	(@x[$a0],@x[$b0])",	# Q1
+	"&xor	(@x[$d0],@x[$a0])",
+	"&rol	(@x[$d0],16)",
+	 "&add	(@x[$a1],@x[$b1])",	# Q2
+	 "&xor	(@x[$d1],@x[$a1])",
+	 "&rol	(@x[$d1],16)",
+
+	"&add	($xc,@x[$d0])",
+	"&xor	(@x[$b0],$xc)",
+	"&rol	(@x[$b0],12)",
+	 "&add	($xc_,@x[$d1])",
+	 "&xor	(@x[$b1],$xc_)",
+	 "&rol	(@x[$b1],12)",
+
+	"&add	(@x[$a0],@x[$b0])",
+	"&xor	(@x[$d0],@x[$a0])",
+	"&rol	(@x[$d0],8)",
+	 "&add	(@x[$a1],@x[$b1])",
+	 "&xor	(@x[$d1],@x[$a1])",
+	 "&rol	(@x[$d1],8)",
+
+	"&add	($xc,@x[$d0])",
+	"&xor	(@x[$b0],$xc)",
+	"&rol	(@x[$b0],7)",
+	 "&add	($xc_,@x[$d1])",
+	 "&xor	(@x[$b1],$xc_)",
+	 "&rol	(@x[$b1],7)",
+
+	"&mov	(\"4*$c0(%rsp)\",$xc)",	# reload pair of 'c's
+	 "&mov	(\"4*$c1(%rsp)\",$xc_)",
+	"&mov	($xc,\"4*$c2(%rsp)\")",
+	 "&mov	($xc_,\"4*$c3(%rsp)\")",
+
+	"&add	(@x[$a2],@x[$b2])",	# Q3
+	"&xor	(@x[$d2],@x[$a2])",
+	"&rol	(@x[$d2],16)",
+	 "&add	(@x[$a3],@x[$b3])",	# Q4
+	 "&xor	(@x[$d3],@x[$a3])",
+	 "&rol	(@x[$d3],16)",
+
+	"&add	($xc,@x[$d2])",
+	"&xor	(@x[$b2],$xc)",
+	"&rol	(@x[$b2],12)",
+	 "&add	($xc_,@x[$d3])",
+	 "&xor	(@x[$b3],$xc_)",
+	 "&rol	(@x[$b3],12)",
+
+	"&add	(@x[$a2],@x[$b2])",
+	"&xor	(@x[$d2],@x[$a2])",
+	"&rol	(@x[$d2],8)",
+	 "&add	(@x[$a3],@x[$b3])",
+	 "&xor	(@x[$d3],@x[$a3])",
+	 "&rol	(@x[$d3],8)",
+
+	"&add	($xc,@x[$d2])",
+	"&xor	(@x[$b2],$xc)",
+	"&rol	(@x[$b2],7)",
+	 "&add	($xc_,@x[$d3])",
+	 "&xor	(@x[$b3],$xc_)",
+	 "&rol	(@x[$b3],7)"
+	);
+}
+########################################################################
+# HCHACHA20_SSSE3
+$code.=<<___;
+
+.global hchacha20_ssse3
+.type	hchacha20_ssse3,\@function,5
+.align	32
+hchacha20_ssse3:
+.cfi_startproc
+.Lhchacha20_ssse3:
+	movdqa	.Lsigma(%rip),%xmm0
+	movdqu	(%rdx),%xmm1
+	movdqu	16(%rdx),%xmm2
+	movdqu	(%rsi),%xmm3
+	movdqa	.Lrot16(%rip),%xmm6
+	movdqa	.Lrot24(%rip),%xmm7
+	movq	10,%r8
+	.align	32
+.Loop_hssse3:
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	20,%xmm1
+	pslld	12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	25,%xmm1
+	pslld	7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	\$78,%xmm2,%xmm2
+	pshufd	\$57,%xmm1,%xmm1
+	pshufd	\$147,%xmm3,%xmm3
+	nop
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm6,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	20,%xmm1
+	pslld	12,%xmm4
+	por	%xmm4,%xmm1
+	paddd	%xmm1,%xmm0
+	pxor	%xmm0,%xmm3
+	pshufb	%xmm7,%xmm3
+	paddd	%xmm3,%xmm2
+	pxor	%xmm2,%xmm1
+	movdqa	%xmm1,%xmm4
+	psrld	25,%xmm1
+	pslld	7,%xmm4
+	por	%xmm4,%xmm1
+	pshufd	\$78,%xmm2,%xmm2
+	pshufd	\$147,%xmm1,%xmm1
+	pshufd	\$57,%xmm3,%xmm3
+	decq	%r8
+	jnz	.Loop_hssse3
+	movdqu	%xmm0,0(%rdi)
+	movdqu	%xmm3,16(%rdi)
+	ret
+.cfi_endproc
+.size	hchacha20_ssse3,.-hchacha20_ssse3
+___
+
+
+########################################################################
+# SSSE3 code path that handles shorter lengths
+{
+my ($a,$b,$c,$d,$t,$t1,$rot16,$rot24)=map("%xmm$_",(0..7));
+
+sub SSSE3ROUND {	# critical path is 20 "SIMD ticks" per round
+	&paddd	($a,$b);
+	&pxor	($d,$a);
+	&pshufb	($d,$rot16);
+
+	&paddd	($c,$d);
+	&pxor	($b,$c);
+	&movdqa	($t,$b);
+	&psrld	($b,20);
+	&pslld	($t,12);
+	&por	($b,$t);
+
+	&paddd	($a,$b);
+	&pxor	($d,$a);
+	&pshufb	($d,$rot24);
+
+	&paddd	($c,$d);
+	&pxor	($b,$c);
+	&movdqa	($t,$b);
+	&psrld	($b,25);
+	&pslld	($t,7);
+	&por	($b,$t);
+}
+
+my $xframe = $win64 ? 32+8 : 8;
+
+$code.=<<___;
+.global chacha20_ssse3
+.type	chacha20_ssse3,\@function,5
+.align	32
+chacha20_ssse3:
+.cfi_startproc
+.Lchacha20_ssse3:
+	mov	%rsp,%r9		# frame pointer
+.cfi_def_cfa_register	%r9
+___
+#$code.=<<___	if ($avx);
+#	test	\$`1<<(43-32)`,%r10d
+#	jnz	.Lchacha20_4xop		# XOP is fastest even if we use 1/4
+#___
+$code.=<<___;
+	cmp	\$128,$len		# we might throw away some data,
+	ja	.Lchacha20_4x		# but overall it won't be slower
+
+.Ldo_sse3_after_all:
+	sub	\$64+$xframe,%rsp
+___
+$code.=<<___	if ($win64);
+	movaps	%xmm6,-0x28(%r9)
+	movaps	%xmm7,-0x18(%r9)
+.Lssse3_body:
+___
+$code.=<<___;
+	movdqa	.Lsigma(%rip),$a
+	movdqu	($key),$b
+	movdqu	16($key),$c
+	movdqu	($counter),$d
+	movdqa	.Lrot16(%rip),$rot16
+	movdqa	.Lrot24(%rip),$rot24
+
+	movdqa	$a,0x00(%rsp)
+	movdqa	$b,0x10(%rsp)
+	movdqa	$c,0x20(%rsp)
+	movdqa	$d,0x30(%rsp)
+	mov	\$10,$counter		# reuse $counter
+	jmp	.Loop_ssse3
+
+.align	32
+.Loop_outer_ssse3:
+	movdqa	.Lone(%rip),$d
+	movdqa	0x00(%rsp),$a
+	movdqa	0x10(%rsp),$b
+	movdqa	0x20(%rsp),$c
+	paddd	0x30(%rsp),$d
+	mov	\$10,$counter
+	movdqa	$d,0x30(%rsp)
+	jmp	.Loop_ssse3
+
+.align	32
+.Loop_ssse3:
+___
+	&SSSE3ROUND();
+	&pshufd	($c,$c,0b01001110);
+	&pshufd	($b,$b,0b00111001);
+	&pshufd	($d,$d,0b10010011);
+	&nop	();
+
+	&SSSE3ROUND();
+	&pshufd	($c,$c,0b01001110);
+	&pshufd	($b,$b,0b10010011);
+	&pshufd	($d,$d,0b00111001);
+
+	&dec	($counter);
+	&jnz	(".Loop_ssse3");
+
+$code.=<<___;
+	paddd	0x00(%rsp),$a
+	paddd	0x10(%rsp),$b
+	paddd	0x20(%rsp),$c
+	paddd	0x30(%rsp),$d
+
+	cmp	\$64,$len
+	jb	.Ltail_ssse3
+
+	movdqu	0x00($inp),$t
+	movdqu	0x10($inp),$t1
+	pxor	$t,$a			# xor with input
+	movdqu	0x20($inp),$t
+	pxor	$t1,$b
+	movdqu	0x30($inp),$t1
+	lea	0x40($inp),$inp		# inp+=64
+	pxor	$t,$c
+	pxor	$t1,$d
+
+	movdqu	$a,0x00($out)		# write output
+	movdqu	$b,0x10($out)
+	movdqu	$c,0x20($out)
+	movdqu	$d,0x30($out)
+	lea	0x40($out),$out		# out+=64
+
+	sub	\$64,$len
+	jnz	.Loop_outer_ssse3
+
+	jmp	.Ldone_ssse3
+
+.align	16
+.Ltail_ssse3:
+	movdqa	$a,0x00(%rsp)
+	movdqa	$b,0x10(%rsp)
+	movdqa	$c,0x20(%rsp)
+	movdqa	$d,0x30(%rsp)
+	xor	$counter,$counter
+
+.Loop_tail_ssse3:
+	movzb	($inp,$counter),%eax
+	movzb	(%rsp,$counter),%ecx
+	lea	1($counter),$counter
+	xor	%ecx,%eax
+	mov	%al,-1($out,$counter)
+	dec	$len
+	jnz	.Loop_tail_ssse3
+
+.Ldone_ssse3:
+___
+$code.=<<___	if ($win64);
+	movaps	-0x28(%r9),%xmm6
+	movaps	-0x18(%r9),%xmm7
+___
+$code.=<<___;
+	lea	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lssse3_epilogue:
+	ret
+.cfi_endproc
+.size	chacha20_ssse3,.-chacha20_ssse3
+___
+}
+
+########################################################################
+# SSSE3 code path that handles longer messages.
+{
+# assign variables to favor Atom front-end
+my ($xd0,$xd1,$xd2,$xd3, $xt0,$xt1,$xt2,$xt3,
+    $xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3)=map("%xmm$_",(0..15));
+my  @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
+	"%nox","%nox","%nox","%nox", $xd0,$xd1,$xd2,$xd3);
+
+sub SSSE3_lane_ROUND {
+my ($a0,$b0,$c0,$d0)=@_;
+my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0));
+my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1));
+my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2));
+my ($xc,$xc_,$t0,$t1)=map("\"$_\"",$xt0,$xt1,$xt2,$xt3);
+my @x=map("\"$_\"",@xx);
+
+	# Consider order in which variables are addressed by their
+	# index:
+	#
+	#	a   b   c   d
+	#
+	#	0   4   8  12 < even round
+	#	1   5   9  13
+	#	2   6  10  14
+	#	3   7  11  15
+	#	0   5  10  15 < odd round
+	#	1   6  11  12
+	#	2   7   8  13
+	#	3   4   9  14
+	#
+	# 'a', 'b' and 'd's are permanently allocated in registers,
+	# @x[0..7,12..15], while 'c's are maintained in memory. If
+	# you observe 'c' column, you'll notice that pair of 'c's is
+	# invariant between rounds. This means that we have to reload
+	# them once per round, in the middle. This is why you'll see
+	# bunch of 'c' stores and loads in the middle, but none in
+	# the beginning or end.
+
+	(
+	"&paddd		(@x[$a0],@x[$b0])",	# Q1
+	 "&paddd	(@x[$a1],@x[$b1])",	# Q2
+	"&pxor		(@x[$d0],@x[$a0])",
+	 "&pxor		(@x[$d1],@x[$a1])",
+	"&pshufb	(@x[$d0],$t1)",
+	 "&pshufb	(@x[$d1],$t1)",
+
+	"&paddd		($xc,@x[$d0])",
+	 "&paddd	($xc_,@x[$d1])",
+	"&pxor		(@x[$b0],$xc)",
+	 "&pxor		(@x[$b1],$xc_)",
+	"&movdqa	($t0,@x[$b0])",
+	"&pslld		(@x[$b0],12)",
+	"&psrld		($t0,20)",
+	 "&movdqa	($t1,@x[$b1])",
+	 "&pslld	(@x[$b1],12)",
+	"&por		(@x[$b0],$t0)",
+	 "&psrld	($t1,20)",
+	"&movdqa	($t0,'(%r11)')",	# .Lrot24(%rip)
+	 "&por		(@x[$b1],$t1)",
+
+	"&paddd		(@x[$a0],@x[$b0])",
+	 "&paddd	(@x[$a1],@x[$b1])",
+	"&pxor		(@x[$d0],@x[$a0])",
+	 "&pxor		(@x[$d1],@x[$a1])",
+	"&pshufb	(@x[$d0],$t0)",
+	 "&pshufb	(@x[$d1],$t0)",
+
+	"&paddd		($xc,@x[$d0])",
+	 "&paddd	($xc_,@x[$d1])",
+	"&pxor		(@x[$b0],$xc)",
+	 "&pxor		(@x[$b1],$xc_)",
+	"&movdqa	($t1,@x[$b0])",
+	"&pslld		(@x[$b0],7)",
+	"&psrld		($t1,25)",
+	 "&movdqa	($t0,@x[$b1])",
+	 "&pslld	(@x[$b1],7)",
+	"&por		(@x[$b0],$t1)",
+	 "&psrld	($t0,25)",
+	"&movdqa	($t1,'(%r10)')",	# .Lrot16(%rip)
+	 "&por		(@x[$b1],$t0)",
+
+	"&movdqa	(\"`16*($c0-8)`(%rsp)\",$xc)",	# reload pair of 'c's
+	 "&movdqa	(\"`16*($c1-8)`(%rsp)\",$xc_)",
+	"&movdqa	($xc,\"`16*($c2-8)`(%rsp)\")",
+	 "&movdqa	($xc_,\"`16*($c3-8)`(%rsp)\")",
+
+	"&paddd		(@x[$a2],@x[$b2])",	# Q3
+	 "&paddd	(@x[$a3],@x[$b3])",	# Q4
+	"&pxor		(@x[$d2],@x[$a2])",
+	 "&pxor		(@x[$d3],@x[$a3])",
+	"&pshufb	(@x[$d2],$t1)",
+	 "&pshufb	(@x[$d3],$t1)",
+
+	"&paddd		($xc,@x[$d2])",
+	 "&paddd	($xc_,@x[$d3])",
+	"&pxor		(@x[$b2],$xc)",
+	 "&pxor		(@x[$b3],$xc_)",
+	"&movdqa	($t0,@x[$b2])",
+	"&pslld		(@x[$b2],12)",
+	"&psrld		($t0,20)",
+	 "&movdqa	($t1,@x[$b3])",
+	 "&pslld	(@x[$b3],12)",
+	"&por		(@x[$b2],$t0)",
+	 "&psrld	($t1,20)",
+	"&movdqa	($t0,'(%r11)')",	# .Lrot24(%rip)
+	 "&por		(@x[$b3],$t1)",
+
+	"&paddd		(@x[$a2],@x[$b2])",
+	 "&paddd	(@x[$a3],@x[$b3])",
+	"&pxor		(@x[$d2],@x[$a2])",
+	 "&pxor		(@x[$d3],@x[$a3])",
+	"&pshufb	(@x[$d2],$t0)",
+	 "&pshufb	(@x[$d3],$t0)",
+
+	"&paddd		($xc,@x[$d2])",
+	 "&paddd	($xc_,@x[$d3])",
+	"&pxor		(@x[$b2],$xc)",
+	 "&pxor		(@x[$b3],$xc_)",
+	"&movdqa	($t1,@x[$b2])",
+	"&pslld		(@x[$b2],7)",
+	"&psrld		($t1,25)",
+	 "&movdqa	($t0,@x[$b3])",
+	 "&pslld	(@x[$b3],7)",
+	"&por		(@x[$b2],$t1)",
+	 "&psrld	($t0,25)",
+	"&movdqa	($t1,'(%r10)')",	# .Lrot16(%rip)
+	 "&por		(@x[$b3],$t0)"
+	);
+}
+
+my $xframe = $win64 ? 0xa8 : 8;
+
+$code.=<<___;
+.global chacha20_4x
+.type	chacha20_4x,\@function,5
+.align	32
+chacha20_4x:
+.cfi_startproc
+.Lchacha20_4x:
+	mov		%rsp,%r9		# frame pointer
+.cfi_def_cfa_register	%r9
+#	mov		%r10,%r11
+___
+$code.=<<___	if ($avx>1);
+#	shr		\$32,%r10		# OPENSSL_ia32cap_P+8
+#	test		\$`1<<5`,%r10		# test AVX2
+#	jnz		.Lchacha20_avx2
+___
+$code.=<<___;
+#	cmp		\$192,$len
+#	ja		.Lproceed4x
+
+#	and		\$`1<<26|1<<22`,%r11	# isolate XSAVE+MOVBE
+#	cmp		\$`1<<22`,%r11		# check for MOVBE without XSAVE
+#	je		.Ldo_sse3_after_all	# to detect Atom
+
+.Lproceed4x:
+	sub		\$0x140+$xframe,%rsp
+___
+	################ stack layout
+	# +0x00		SIMD equivalent of @x[8-12]
+	# ...
+	# +0x40		constant copy of key[0-2] smashed by lanes
+	# ...
+	# +0x100	SIMD counters (with nonce smashed by lanes)
+	# ...
+	# +0x140
+$code.=<<___	if ($win64);
+	movaps		%xmm6,-0xa8(%r9)
+	movaps		%xmm7,-0x98(%r9)
+	movaps		%xmm8,-0x88(%r9)
+	movaps		%xmm9,-0x78(%r9)
+	movaps		%xmm10,-0x68(%r9)
+	movaps		%xmm11,-0x58(%r9)
+	movaps		%xmm12,-0x48(%r9)
+	movaps		%xmm13,-0x38(%r9)
+	movaps		%xmm14,-0x28(%r9)
+	movaps		%xmm15,-0x18(%r9)
+.L4x_body:
+___
+$code.=<<___;
+	movdqa		.Lsigma(%rip),$xa3	# key[0]
+	movdqu		($key),$xb3		# key[1]
+	movdqu		16($key),$xt3		# key[2]
+	movdqu		($counter),$xd3		# key[3]
+	lea		0x100(%rsp),%rcx	# size optimization
+	lea		.Lrot16(%rip),%r10
+	lea		.Lrot24(%rip),%r11
+
+	pshufd		\$0x00,$xa3,$xa0	# smash key by lanes...
+	pshufd		\$0x55,$xa3,$xa1
+	movdqa		$xa0,0x40(%rsp)		# ... and offload
+	pshufd		\$0xaa,$xa3,$xa2
+	movdqa		$xa1,0x50(%rsp)
+	pshufd		\$0xff,$xa3,$xa3
+	movdqa		$xa2,0x60(%rsp)
+	movdqa		$xa3,0x70(%rsp)
+
+	pshufd		\$0x00,$xb3,$xb0
+	pshufd		\$0x55,$xb3,$xb1
+	movdqa		$xb0,0x80-0x100(%rcx)
+	pshufd		\$0xaa,$xb3,$xb2
+	movdqa		$xb1,0x90-0x100(%rcx)
+	pshufd		\$0xff,$xb3,$xb3
+	movdqa		$xb2,0xa0-0x100(%rcx)
+	movdqa		$xb3,0xb0-0x100(%rcx)
+
+	pshufd		\$0x00,$xt3,$xt0	# "$xc0"
+	pshufd		\$0x55,$xt3,$xt1	# "$xc1"
+	movdqa		$xt0,0xc0-0x100(%rcx)
+	pshufd		\$0xaa,$xt3,$xt2	# "$xc2"
+	movdqa		$xt1,0xd0-0x100(%rcx)
+	pshufd		\$0xff,$xt3,$xt3	# "$xc3"
+	movdqa		$xt2,0xe0-0x100(%rcx)
+	movdqa		$xt3,0xf0-0x100(%rcx)
+
+	pshufd		\$0x00,$xd3,$xd0
+	pshufd		\$0x55,$xd3,$xd1
+	paddd		.Linc(%rip),$xd0	# don't save counters yet
+	pshufd		\$0xaa,$xd3,$xd2
+	movdqa		$xd1,0x110-0x100(%rcx)
+	pshufd		\$0xff,$xd3,$xd3
+	movdqa		$xd2,0x120-0x100(%rcx)
+	movdqa		$xd3,0x130-0x100(%rcx)
+
+	jmp		.Loop_enter4x
+
+.align	32
+.Loop_outer4x:
+	movdqa		0x40(%rsp),$xa0		# re-load smashed key
+	movdqa		0x50(%rsp),$xa1
+	movdqa		0x60(%rsp),$xa2
+	movdqa		0x70(%rsp),$xa3
+	movdqa		0x80-0x100(%rcx),$xb0
+	movdqa		0x90-0x100(%rcx),$xb1
+	movdqa		0xa0-0x100(%rcx),$xb2
+	movdqa		0xb0-0x100(%rcx),$xb3
+	movdqa		0xc0-0x100(%rcx),$xt0	# "$xc0"
+	movdqa		0xd0-0x100(%rcx),$xt1	# "$xc1"
+	movdqa		0xe0-0x100(%rcx),$xt2	# "$xc2"
+	movdqa		0xf0-0x100(%rcx),$xt3	# "$xc3"
+	movdqa		0x100-0x100(%rcx),$xd0
+	movdqa		0x110-0x100(%rcx),$xd1
+	movdqa		0x120-0x100(%rcx),$xd2
+	movdqa		0x130-0x100(%rcx),$xd3
+	paddd		.Lfour(%rip),$xd0	# next SIMD counters
+
+.Loop_enter4x:
+	movdqa		$xt2,0x20(%rsp)		# SIMD equivalent of "@x[10]"
+	movdqa		$xt3,0x30(%rsp)		# SIMD equivalent of "@x[11]"
+	movdqa		(%r10),$xt3		# .Lrot16(%rip)
+	mov		\$10,%eax
+	movdqa		$xd0,0x100-0x100(%rcx)	# save SIMD counters
+	jmp		.Loop4x
+
+.align	32
+.Loop4x:
+___
+	foreach (&SSSE3_lane_ROUND(0, 4, 8,12)) { eval; }
+	foreach (&SSSE3_lane_ROUND(0, 5,10,15)) { eval; }
+$code.=<<___;
+	dec		%eax
+	jnz		.Loop4x
+
+	paddd		0x40(%rsp),$xa0		# accumulate key material
+	paddd		0x50(%rsp),$xa1
+	paddd		0x60(%rsp),$xa2
+	paddd		0x70(%rsp),$xa3
+
+	movdqa		$xa0,$xt2		# "de-interlace" data
+	punpckldq	$xa1,$xa0
+	movdqa		$xa2,$xt3
+	punpckldq	$xa3,$xa2
+	punpckhdq	$xa1,$xt2
+	punpckhdq	$xa3,$xt3
+	movdqa		$xa0,$xa1
+	punpcklqdq	$xa2,$xa0		# "a0"
+	movdqa		$xt2,$xa3
+	punpcklqdq	$xt3,$xt2		# "a2"
+	punpckhqdq	$xa2,$xa1		# "a1"
+	punpckhqdq	$xt3,$xa3		# "a3"
+___
+	($xa2,$xt2)=($xt2,$xa2);
+$code.=<<___;
+	paddd		0x80-0x100(%rcx),$xb0
+	paddd		0x90-0x100(%rcx),$xb1
+	paddd		0xa0-0x100(%rcx),$xb2
+	paddd		0xb0-0x100(%rcx),$xb3
+
+	movdqa		$xa0,0x00(%rsp)		# offload $xaN
+	movdqa		$xa1,0x10(%rsp)
+	movdqa		0x20(%rsp),$xa0		# "xc2"
+	movdqa		0x30(%rsp),$xa1		# "xc3"
+
+	movdqa		$xb0,$xt2
+	punpckldq	$xb1,$xb0
+	movdqa		$xb2,$xt3
+	punpckldq	$xb3,$xb2
+	punpckhdq	$xb1,$xt2
+	punpckhdq	$xb3,$xt3
+	movdqa		$xb0,$xb1
+	punpcklqdq	$xb2,$xb0		# "b0"
+	movdqa		$xt2,$xb3
+	punpcklqdq	$xt3,$xt2		# "b2"
+	punpckhqdq	$xb2,$xb1		# "b1"
+	punpckhqdq	$xt3,$xb3		# "b3"
+___
+	($xb2,$xt2)=($xt2,$xb2);
+	my ($xc0,$xc1,$xc2,$xc3)=($xt0,$xt1,$xa0,$xa1);
+$code.=<<___;
+	paddd		0xc0-0x100(%rcx),$xc0
+	paddd		0xd0-0x100(%rcx),$xc1
+	paddd		0xe0-0x100(%rcx),$xc2
+	paddd		0xf0-0x100(%rcx),$xc3
+
+	movdqa		$xa2,0x20(%rsp)		# keep offloading $xaN
+	movdqa		$xa3,0x30(%rsp)
+
+	movdqa		$xc0,$xt2
+	punpckldq	$xc1,$xc0
+	movdqa		$xc2,$xt3
+	punpckldq	$xc3,$xc2
+	punpckhdq	$xc1,$xt2
+	punpckhdq	$xc3,$xt3
+	movdqa		$xc0,$xc1
+	punpcklqdq	$xc2,$xc0		# "c0"
+	movdqa		$xt2,$xc3
+	punpcklqdq	$xt3,$xt2		# "c2"
+	punpckhqdq	$xc2,$xc1		# "c1"
+	punpckhqdq	$xt3,$xc3		# "c3"
+___
+	($xc2,$xt2)=($xt2,$xc2);
+	($xt0,$xt1)=($xa2,$xa3);		# use $xaN as temporary
+$code.=<<___;
+	paddd		0x100-0x100(%rcx),$xd0
+	paddd		0x110-0x100(%rcx),$xd1
+	paddd		0x120-0x100(%rcx),$xd2
+	paddd		0x130-0x100(%rcx),$xd3
+
+	movdqa		$xd0,$xt2
+	punpckldq	$xd1,$xd0
+	movdqa		$xd2,$xt3
+	punpckldq	$xd3,$xd2
+	punpckhdq	$xd1,$xt2
+	punpckhdq	$xd3,$xt3
+	movdqa		$xd0,$xd1
+	punpcklqdq	$xd2,$xd0		# "d0"
+	movdqa		$xt2,$xd3
+	punpcklqdq	$xt3,$xt2		# "d2"
+	punpckhqdq	$xd2,$xd1		# "d1"
+	punpckhqdq	$xt3,$xd3		# "d3"
+___
+	($xd2,$xt2)=($xt2,$xd2);
+$code.=<<___;
+	cmp		\$64*4,$len
+	jb		.Ltail4x
+
+	movdqu		0x00($inp),$xt0		# xor with input
+	movdqu		0x10($inp),$xt1
+	movdqu		0x20($inp),$xt2
+	movdqu		0x30($inp),$xt3
+	pxor		0x00(%rsp),$xt0		# $xaN is offloaded, remember?
+	pxor		$xb0,$xt1
+	pxor		$xc0,$xt2
+	pxor		$xd0,$xt3
+
+	 movdqu		$xt0,0x00($out)
+	movdqu		0x40($inp),$xt0
+	 movdqu		$xt1,0x10($out)
+	movdqu		0x50($inp),$xt1
+	 movdqu		$xt2,0x20($out)
+	movdqu		0x60($inp),$xt2
+	 movdqu		$xt3,0x30($out)
+	movdqu		0x70($inp),$xt3
+	lea		0x80($inp),$inp		# size optimization
+	pxor		0x10(%rsp),$xt0
+	pxor		$xb1,$xt1
+	pxor		$xc1,$xt2
+	pxor		$xd1,$xt3
+
+	 movdqu		$xt0,0x40($out)
+	movdqu		0x00($inp),$xt0
+	 movdqu		$xt1,0x50($out)
+	movdqu		0x10($inp),$xt1
+	 movdqu		$xt2,0x60($out)
+	movdqu		0x20($inp),$xt2
+	 movdqu		$xt3,0x70($out)
+	 lea		0x80($out),$out		# size optimization
+	movdqu		0x30($inp),$xt3
+	pxor		0x20(%rsp),$xt0
+	pxor		$xb2,$xt1
+	pxor		$xc2,$xt2
+	pxor		$xd2,$xt3
+
+	 movdqu		$xt0,0x00($out)
+	movdqu		0x40($inp),$xt0
+	 movdqu		$xt1,0x10($out)
+	movdqu		0x50($inp),$xt1
+	 movdqu		$xt2,0x20($out)
+	movdqu		0x60($inp),$xt2
+	 movdqu		$xt3,0x30($out)
+	movdqu		0x70($inp),$xt3
+	lea		0x80($inp),$inp		# inp+=64*4
+	pxor		0x30(%rsp),$xt0
+	pxor		$xb3,$xt1
+	pxor		$xc3,$xt2
+	pxor		$xd3,$xt3
+	movdqu		$xt0,0x40($out)
+	movdqu		$xt1,0x50($out)
+	movdqu		$xt2,0x60($out)
+	movdqu		$xt3,0x70($out)
+	lea		0x80($out),$out		# out+=64*4
+
+	sub		\$64*4,$len
+	jnz		.Loop_outer4x
+
+	jmp		.Ldone4x
+
+.Ltail4x:
+	cmp		\$192,$len
+	jae		.L192_or_more4x
+	cmp		\$128,$len
+	jae		.L128_or_more4x
+	cmp		\$64,$len
+	jae		.L64_or_more4x
+
+	#movdqa		0x00(%rsp),$xt0		# $xaN is offloaded, remember?
+	xor		%r10,%r10
+	#movdqa		$xt0,0x00(%rsp)
+	movdqa		$xb0,0x10(%rsp)
+	movdqa		$xc0,0x20(%rsp)
+	movdqa		$xd0,0x30(%rsp)
+	jmp		.Loop_tail4x
+
+.align	32
+.L64_or_more4x:
+	movdqu		0x00($inp),$xt0		# xor with input
+	movdqu		0x10($inp),$xt1
+	movdqu		0x20($inp),$xt2
+	movdqu		0x30($inp),$xt3
+	pxor		0x00(%rsp),$xt0		# $xaxN is offloaded, remember?
+	pxor		$xb0,$xt1
+	pxor		$xc0,$xt2
+	pxor		$xd0,$xt3
+	movdqu		$xt0,0x00($out)
+	movdqu		$xt1,0x10($out)
+	movdqu		$xt2,0x20($out)
+	movdqu		$xt3,0x30($out)
+	je		.Ldone4x
+
+	movdqa		0x10(%rsp),$xt0		# $xaN is offloaded, remember?
+	lea		0x40($inp),$inp		# inp+=64*1
+	xor		%r10,%r10
+	movdqa		$xt0,0x00(%rsp)
+	movdqa		$xb1,0x10(%rsp)
+	lea		0x40($out),$out		# out+=64*1
+	movdqa		$xc1,0x20(%rsp)
+	sub		\$64,$len		# len-=64*1
+	movdqa		$xd1,0x30(%rsp)
+	jmp		.Loop_tail4x
+
+.align	32
+.L128_or_more4x:
+	movdqu		0x00($inp),$xt0		# xor with input
+	movdqu		0x10($inp),$xt1
+	movdqu		0x20($inp),$xt2
+	movdqu		0x30($inp),$xt3
+	pxor		0x00(%rsp),$xt0		# $xaN is offloaded, remember?
+	pxor		$xb0,$xt1
+	pxor		$xc0,$xt2
+	pxor		$xd0,$xt3
+
+	 movdqu		$xt0,0x00($out)
+	movdqu		0x40($inp),$xt0
+	 movdqu		$xt1,0x10($out)
+	movdqu		0x50($inp),$xt1
+	 movdqu		$xt2,0x20($out)
+	movdqu		0x60($inp),$xt2
+	 movdqu		$xt3,0x30($out)
+	movdqu		0x70($inp),$xt3
+	pxor		0x10(%rsp),$xt0
+	pxor		$xb1,$xt1
+	pxor		$xc1,$xt2
+	pxor		$xd1,$xt3
+	movdqu		$xt0,0x40($out)
+	movdqu		$xt1,0x50($out)
+	movdqu		$xt2,0x60($out)
+	movdqu		$xt3,0x70($out)
+	je		.Ldone4x
+
+	movdqa		0x20(%rsp),$xt0		# $xaN is offloaded, remember?
+	lea		0x80($inp),$inp		# inp+=64*2
+	xor		%r10,%r10
+	movdqa		$xt0,0x00(%rsp)
+	movdqa		$xb2,0x10(%rsp)
+	lea		0x80($out),$out		# out+=64*2
+	movdqa		$xc2,0x20(%rsp)
+	sub		\$128,$len		# len-=64*2
+	movdqa		$xd2,0x30(%rsp)
+	jmp		.Loop_tail4x
+
+.align	32
+.L192_or_more4x:
+	movdqu		0x00($inp),$xt0		# xor with input
+	movdqu		0x10($inp),$xt1
+	movdqu		0x20($inp),$xt2
+	movdqu		0x30($inp),$xt3
+	pxor		0x00(%rsp),$xt0		# $xaN is offloaded, remember?
+	pxor		$xb0,$xt1
+	pxor		$xc0,$xt2
+	pxor		$xd0,$xt3
+
+	 movdqu		$xt0,0x00($out)
+	movdqu		0x40($inp),$xt0
+	 movdqu		$xt1,0x10($out)
+	movdqu		0x50($inp),$xt1
+	 movdqu		$xt2,0x20($out)
+	movdqu		0x60($inp),$xt2
+	 movdqu		$xt3,0x30($out)
+	movdqu		0x70($inp),$xt3
+	lea		0x80($inp),$inp		# size optimization
+	pxor		0x10(%rsp),$xt0
+	pxor		$xb1,$xt1
+	pxor		$xc1,$xt2
+	pxor		$xd1,$xt3
+
+	 movdqu		$xt0,0x40($out)
+	movdqu		0x00($inp),$xt0
+	 movdqu		$xt1,0x50($out)
+	movdqu		0x10($inp),$xt1
+	 movdqu		$xt2,0x60($out)
+	movdqu		0x20($inp),$xt2
+	 movdqu		$xt3,0x70($out)
+	 lea		0x80($out),$out		# size optimization
+	movdqu		0x30($inp),$xt3
+	pxor		0x20(%rsp),$xt0
+	pxor		$xb2,$xt1
+	pxor		$xc2,$xt2
+	pxor		$xd2,$xt3
+	movdqu		$xt0,0x00($out)
+	movdqu		$xt1,0x10($out)
+	movdqu		$xt2,0x20($out)
+	movdqu		$xt3,0x30($out)
+	je		.Ldone4x
+
+	movdqa		0x30(%rsp),$xt0		# $xaN is offloaded, remember?
+	lea		0x40($inp),$inp		# inp+=64*3
+	xor		%r10,%r10
+	movdqa		$xt0,0x00(%rsp)
+	movdqa		$xb3,0x10(%rsp)
+	lea		0x40($out),$out		# out+=64*3
+	movdqa		$xc3,0x20(%rsp)
+	sub		\$192,$len		# len-=64*3
+	movdqa		$xd3,0x30(%rsp)
+
+.Loop_tail4x:
+	movzb		($inp,%r10),%eax
+	movzb		(%rsp,%r10),%ecx
+	lea		1(%r10),%r10
+	xor		%ecx,%eax
+	mov		%al,-1($out,%r10)
+	dec		$len
+	jnz		.Loop_tail4x
+
+.Ldone4x:
+___
+$code.=<<___	if ($win64);
+	movaps		-0xa8(%r9),%xmm6
+	movaps		-0x98(%r9),%xmm7
+	movaps		-0x88(%r9),%xmm8
+	movaps		-0x78(%r9),%xmm9
+	movaps		-0x68(%r9),%xmm10
+	movaps		-0x58(%r9),%xmm11
+	movaps		-0x48(%r9),%xmm12
+	movaps		-0x38(%r9),%xmm13
+	movaps		-0x28(%r9),%xmm14
+	movaps		-0x18(%r9),%xmm15
+___
+$code.=<<___;
+	lea		(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L4x_epilogue:
+	ret
+.cfi_endproc
+.size	chacha20_4x,.-chacha20_4x
+___
+}
+
+########################################################################
+# XOP code path that handles all lengths.
+if ($avx && 0) {
+# There is some "anomaly" observed depending on instructions' size or
+# alignment. If you look closely at below code you'll notice that
+# sometimes argument order varies. The order affects instruction
+# encoding by making it larger, and such fiddling gives 5% performance
+# improvement. This is on FX-4100...
+
+my ($xb0,$xb1,$xb2,$xb3, $xd0,$xd1,$xd2,$xd3,
+    $xa0,$xa1,$xa2,$xa3, $xt0,$xt1,$xt2,$xt3)=map("%xmm$_",(0..15));
+my  @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
+	 $xt0,$xt1,$xt2,$xt3, $xd0,$xd1,$xd2,$xd3);
+
+sub XOP_lane_ROUND {
+my ($a0,$b0,$c0,$d0)=@_;
+my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0));
+my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1));
+my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2));
+my @x=map("\"$_\"",@xx);
+
+	(
+	"&vpaddd	(@x[$a0],@x[$a0],@x[$b0])",	# Q1
+	 "&vpaddd	(@x[$a1],@x[$a1],@x[$b1])",	# Q2
+	  "&vpaddd	(@x[$a2],@x[$a2],@x[$b2])",	# Q3
+	   "&vpaddd	(@x[$a3],@x[$a3],@x[$b3])",	# Q4
+	"&vpxor		(@x[$d0],@x[$a0],@x[$d0])",
+	 "&vpxor	(@x[$d1],@x[$a1],@x[$d1])",
+	  "&vpxor	(@x[$d2],@x[$a2],@x[$d2])",
+	   "&vpxor	(@x[$d3],@x[$a3],@x[$d3])",
+	"&vprotd	(@x[$d0],@x[$d0],16)",
+	 "&vprotd	(@x[$d1],@x[$d1],16)",
+	  "&vprotd	(@x[$d2],@x[$d2],16)",
+	   "&vprotd	(@x[$d3],@x[$d3],16)",
+
+	"&vpaddd	(@x[$c0],@x[$c0],@x[$d0])",
+	 "&vpaddd	(@x[$c1],@x[$c1],@x[$d1])",
+	  "&vpaddd	(@x[$c2],@x[$c2],@x[$d2])",
+	   "&vpaddd	(@x[$c3],@x[$c3],@x[$d3])",
+	"&vpxor		(@x[$b0],@x[$c0],@x[$b0])",
+	 "&vpxor	(@x[$b1],@x[$c1],@x[$b1])",
+	  "&vpxor	(@x[$b2],@x[$b2],@x[$c2])",	# flip
+	   "&vpxor	(@x[$b3],@x[$b3],@x[$c3])",	# flip
+	"&vprotd	(@x[$b0],@x[$b0],12)",
+	 "&vprotd	(@x[$b1],@x[$b1],12)",
+	  "&vprotd	(@x[$b2],@x[$b2],12)",
+	   "&vprotd	(@x[$b3],@x[$b3],12)",
+
+	"&vpaddd	(@x[$a0],@x[$b0],@x[$a0])",	# flip
+	 "&vpaddd	(@x[$a1],@x[$b1],@x[$a1])",	# flip
+	  "&vpaddd	(@x[$a2],@x[$a2],@x[$b2])",
+	   "&vpaddd	(@x[$a3],@x[$a3],@x[$b3])",
+	"&vpxor		(@x[$d0],@x[$a0],@x[$d0])",
+	 "&vpxor	(@x[$d1],@x[$a1],@x[$d1])",
+	  "&vpxor	(@x[$d2],@x[$a2],@x[$d2])",
+	   "&vpxor	(@x[$d3],@x[$a3],@x[$d3])",
+	"&vprotd	(@x[$d0],@x[$d0],8)",
+	 "&vprotd	(@x[$d1],@x[$d1],8)",
+	  "&vprotd	(@x[$d2],@x[$d2],8)",
+	   "&vprotd	(@x[$d3],@x[$d3],8)",
+
+	"&vpaddd	(@x[$c0],@x[$c0],@x[$d0])",
+	 "&vpaddd	(@x[$c1],@x[$c1],@x[$d1])",
+	  "&vpaddd	(@x[$c2],@x[$c2],@x[$d2])",
+	   "&vpaddd	(@x[$c3],@x[$c3],@x[$d3])",
+	"&vpxor		(@x[$b0],@x[$c0],@x[$b0])",
+	 "&vpxor	(@x[$b1],@x[$c1],@x[$b1])",
+	  "&vpxor	(@x[$b2],@x[$b2],@x[$c2])",	# flip
+	   "&vpxor	(@x[$b3],@x[$b3],@x[$c3])",	# flip
+	"&vprotd	(@x[$b0],@x[$b0],7)",
+	 "&vprotd	(@x[$b1],@x[$b1],7)",
+	  "&vprotd	(@x[$b2],@x[$b2],7)",
+	   "&vprotd	(@x[$b3],@x[$b3],7)"
+	);
+}
+
+my $xframe = $win64 ? 0xa8 : 8;
+
+$code.=<<___;
+.global chacha20_4xop
+.type	chacha20_4xop,\@function,5
+.align	32
+chacha20_4xop:
+.cfi_startproc
+.Lchacha20_4xop:
+	mov		%rsp,%r9		# frame pointer
+.cfi_def_cfa_register	%r9
+	sub		\$0x140+$xframe,%rsp
+___
+	################ stack layout
+	# +0x00		SIMD equivalent of @x[8-12]
+	# ...
+	# +0x40		constant copy of key[0-2] smashed by lanes
+	# ...
+	# +0x100	SIMD counters (with nonce smashed by lanes)
+	# ...
+	# +0x140
+$code.=<<___	if ($win64);
+	movaps		%xmm6,-0xa8(%r9)
+	movaps		%xmm7,-0x98(%r9)
+	movaps		%xmm8,-0x88(%r9)
+	movaps		%xmm9,-0x78(%r9)
+	movaps		%xmm10,-0x68(%r9)
+	movaps		%xmm11,-0x58(%r9)
+	movaps		%xmm12,-0x48(%r9)
+	movaps		%xmm13,-0x38(%r9)
+	movaps		%xmm14,-0x28(%r9)
+	movaps		%xmm15,-0x18(%r9)
+.L4xop_body:
+___
+$code.=<<___;
+	vzeroupper
+
+	vmovdqa		.Lsigma(%rip),$xa3	# key[0]
+	vmovdqu		($key),$xb3		# key[1]
+	vmovdqu		16($key),$xt3		# key[2]
+	vmovdqu		($counter),$xd3		# key[3]
+	lea		0x100(%rsp),%rcx	# size optimization
+
+	vpshufd		\$0x00,$xa3,$xa0	# smash key by lanes...
+	vpshufd		\$0x55,$xa3,$xa1
+	vmovdqa		$xa0,0x40(%rsp)		# ... and offload
+	vpshufd		\$0xaa,$xa3,$xa2
+	vmovdqa		$xa1,0x50(%rsp)
+	vpshufd		\$0xff,$xa3,$xa3
+	vmovdqa		$xa2,0x60(%rsp)
+	vmovdqa		$xa3,0x70(%rsp)
+
+	vpshufd		\$0x00,$xb3,$xb0
+	vpshufd		\$0x55,$xb3,$xb1
+	vmovdqa		$xb0,0x80-0x100(%rcx)
+	vpshufd		\$0xaa,$xb3,$xb2
+	vmovdqa		$xb1,0x90-0x100(%rcx)
+	vpshufd		\$0xff,$xb3,$xb3
+	vmovdqa		$xb2,0xa0-0x100(%rcx)
+	vmovdqa		$xb3,0xb0-0x100(%rcx)
+
+	vpshufd		\$0x00,$xt3,$xt0	# "$xc0"
+	vpshufd		\$0x55,$xt3,$xt1	# "$xc1"
+	vmovdqa		$xt0,0xc0-0x100(%rcx)
+	vpshufd		\$0xaa,$xt3,$xt2	# "$xc2"
+	vmovdqa		$xt1,0xd0-0x100(%rcx)
+	vpshufd		\$0xff,$xt3,$xt3	# "$xc3"
+	vmovdqa		$xt2,0xe0-0x100(%rcx)
+	vmovdqa		$xt3,0xf0-0x100(%rcx)
+
+	vpshufd		\$0x00,$xd3,$xd0
+	vpshufd		\$0x55,$xd3,$xd1
+	vpaddd		.Linc(%rip),$xd0,$xd0	# don't save counters yet
+	vpshufd		\$0xaa,$xd3,$xd2
+	vmovdqa		$xd1,0x110-0x100(%rcx)
+	vpshufd		\$0xff,$xd3,$xd3
+	vmovdqa		$xd2,0x120-0x100(%rcx)
+	vmovdqa		$xd3,0x130-0x100(%rcx)
+
+	jmp		.Loop_enter4xop
+
+.align	32
+.Loop_outer4xop:
+	vmovdqa		0x40(%rsp),$xa0		# re-load smashed key
+	vmovdqa		0x50(%rsp),$xa1
+	vmovdqa		0x60(%rsp),$xa2
+	vmovdqa		0x70(%rsp),$xa3
+	vmovdqa		0x80-0x100(%rcx),$xb0
+	vmovdqa		0x90-0x100(%rcx),$xb1
+	vmovdqa		0xa0-0x100(%rcx),$xb2
+	vmovdqa		0xb0-0x100(%rcx),$xb3
+	vmovdqa		0xc0-0x100(%rcx),$xt0	# "$xc0"
+	vmovdqa		0xd0-0x100(%rcx),$xt1	# "$xc1"
+	vmovdqa		0xe0-0x100(%rcx),$xt2	# "$xc2"
+	vmovdqa		0xf0-0x100(%rcx),$xt3	# "$xc3"
+	vmovdqa		0x100-0x100(%rcx),$xd0
+	vmovdqa		0x110-0x100(%rcx),$xd1
+	vmovdqa		0x120-0x100(%rcx),$xd2
+	vmovdqa		0x130-0x100(%rcx),$xd3
+	vpaddd		.Lfour(%rip),$xd0,$xd0	# next SIMD counters
+
+.Loop_enter4xop:
+	mov		\$10,%eax
+	vmovdqa		$xd0,0x100-0x100(%rcx)	# save SIMD counters
+	jmp		.Loop4xop
+
+.align	32
+.Loop4xop:
+___
+	foreach (&XOP_lane_ROUND(0, 4, 8,12)) { eval; }
+	foreach (&XOP_lane_ROUND(0, 5,10,15)) { eval; }
+$code.=<<___;
+	dec		%eax
+	jnz		.Loop4xop
+
+	vpaddd		0x40(%rsp),$xa0,$xa0	# accumulate key material
+	vpaddd		0x50(%rsp),$xa1,$xa1
+	vpaddd		0x60(%rsp),$xa2,$xa2
+	vpaddd		0x70(%rsp),$xa3,$xa3
+
+	vmovdqa		$xt2,0x20(%rsp)		# offload $xc2,3
+	vmovdqa		$xt3,0x30(%rsp)
+
+	vpunpckldq	$xa1,$xa0,$xt2		# "de-interlace" data
+	vpunpckldq	$xa3,$xa2,$xt3
+	vpunpckhdq	$xa1,$xa0,$xa0
+	vpunpckhdq	$xa3,$xa2,$xa2
+	vpunpcklqdq	$xt3,$xt2,$xa1		# "a0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "a1"
+	vpunpcklqdq	$xa2,$xa0,$xa3		# "a2"
+	vpunpckhqdq	$xa2,$xa0,$xa0		# "a3"
+___
+        ($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2);
+$code.=<<___;
+	vpaddd		0x80-0x100(%rcx),$xb0,$xb0
+	vpaddd		0x90-0x100(%rcx),$xb1,$xb1
+	vpaddd		0xa0-0x100(%rcx),$xb2,$xb2
+	vpaddd		0xb0-0x100(%rcx),$xb3,$xb3
+
+	vmovdqa		$xa0,0x00(%rsp)		# offload $xa0,1
+	vmovdqa		$xa1,0x10(%rsp)
+	vmovdqa		0x20(%rsp),$xa0		# "xc2"
+	vmovdqa		0x30(%rsp),$xa1		# "xc3"
+
+	vpunpckldq	$xb1,$xb0,$xt2
+	vpunpckldq	$xb3,$xb2,$xt3
+	vpunpckhdq	$xb1,$xb0,$xb0
+	vpunpckhdq	$xb3,$xb2,$xb2
+	vpunpcklqdq	$xt3,$xt2,$xb1		# "b0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "b1"
+	vpunpcklqdq	$xb2,$xb0,$xb3		# "b2"
+	vpunpckhqdq	$xb2,$xb0,$xb0		# "b3"
+___
+	($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2);
+	my ($xc0,$xc1,$xc2,$xc3)=($xt0,$xt1,$xa0,$xa1);
+$code.=<<___;
+	vpaddd		0xc0-0x100(%rcx),$xc0,$xc0
+	vpaddd		0xd0-0x100(%rcx),$xc1,$xc1
+	vpaddd		0xe0-0x100(%rcx),$xc2,$xc2
+	vpaddd		0xf0-0x100(%rcx),$xc3,$xc3
+
+	vpunpckldq	$xc1,$xc0,$xt2
+	vpunpckldq	$xc3,$xc2,$xt3
+	vpunpckhdq	$xc1,$xc0,$xc0
+	vpunpckhdq	$xc3,$xc2,$xc2
+	vpunpcklqdq	$xt3,$xt2,$xc1		# "c0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "c1"
+	vpunpcklqdq	$xc2,$xc0,$xc3		# "c2"
+	vpunpckhqdq	$xc2,$xc0,$xc0		# "c3"
+___
+	($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2);
+$code.=<<___;
+	vpaddd		0x100-0x100(%rcx),$xd0,$xd0
+	vpaddd		0x110-0x100(%rcx),$xd1,$xd1
+	vpaddd		0x120-0x100(%rcx),$xd2,$xd2
+	vpaddd		0x130-0x100(%rcx),$xd3,$xd3
+
+	vpunpckldq	$xd1,$xd0,$xt2
+	vpunpckldq	$xd3,$xd2,$xt3
+	vpunpckhdq	$xd1,$xd0,$xd0
+	vpunpckhdq	$xd3,$xd2,$xd2
+	vpunpcklqdq	$xt3,$xt2,$xd1		# "d0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "d1"
+	vpunpcklqdq	$xd2,$xd0,$xd3		# "d2"
+	vpunpckhqdq	$xd2,$xd0,$xd0		# "d3"
+___
+	($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2);
+	($xa0,$xa1)=($xt2,$xt3);
+$code.=<<___;
+	vmovdqa		0x00(%rsp),$xa0		# restore $xa0,1
+	vmovdqa		0x10(%rsp),$xa1
+
+	cmp		\$64*4,$len
+	jb		.Ltail4xop
+
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x10($inp),$xb0,$xb0
+	vpxor		0x20($inp),$xc0,$xc0
+	vpxor		0x30($inp),$xd0,$xd0
+	vpxor		0x40($inp),$xa1,$xa1
+	vpxor		0x50($inp),$xb1,$xb1
+	vpxor		0x60($inp),$xc1,$xc1
+	vpxor		0x70($inp),$xd1,$xd1
+	lea		0x80($inp),$inp		# size optimization
+	vpxor		0x00($inp),$xa2,$xa2
+	vpxor		0x10($inp),$xb2,$xb2
+	vpxor		0x20($inp),$xc2,$xc2
+	vpxor		0x30($inp),$xd2,$xd2
+	vpxor		0x40($inp),$xa3,$xa3
+	vpxor		0x50($inp),$xb3,$xb3
+	vpxor		0x60($inp),$xc3,$xc3
+	vpxor		0x70($inp),$xd3,$xd3
+	lea		0x80($inp),$inp		# inp+=64*4
+
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x10($out)
+	vmovdqu		$xc0,0x20($out)
+	vmovdqu		$xd0,0x30($out)
+	vmovdqu		$xa1,0x40($out)
+	vmovdqu		$xb1,0x50($out)
+	vmovdqu		$xc1,0x60($out)
+	vmovdqu		$xd1,0x70($out)
+	lea		0x80($out),$out		# size optimization
+	vmovdqu		$xa2,0x00($out)
+	vmovdqu		$xb2,0x10($out)
+	vmovdqu		$xc2,0x20($out)
+	vmovdqu		$xd2,0x30($out)
+	vmovdqu		$xa3,0x40($out)
+	vmovdqu		$xb3,0x50($out)
+	vmovdqu		$xc3,0x60($out)
+	vmovdqu		$xd3,0x70($out)
+	lea		0x80($out),$out		# out+=64*4
+
+	sub		\$64*4,$len
+	jnz		.Loop_outer4xop
+
+	jmp		.Ldone4xop
+
+.align	32
+.Ltail4xop:
+	cmp		\$192,$len
+	jae		.L192_or_more4xop
+	cmp		\$128,$len
+	jae		.L128_or_more4xop
+	cmp		\$64,$len
+	jae		.L64_or_more4xop
+
+	xor		%r10,%r10
+	vmovdqa		$xa0,0x00(%rsp)
+	vmovdqa		$xb0,0x10(%rsp)
+	vmovdqa		$xc0,0x20(%rsp)
+	vmovdqa		$xd0,0x30(%rsp)
+	jmp		.Loop_tail4xop
+
+.align	32
+.L64_or_more4xop:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x10($inp),$xb0,$xb0
+	vpxor		0x20($inp),$xc0,$xc0
+	vpxor		0x30($inp),$xd0,$xd0
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x10($out)
+	vmovdqu		$xc0,0x20($out)
+	vmovdqu		$xd0,0x30($out)
+	je		.Ldone4xop
+
+	lea		0x40($inp),$inp		# inp+=64*1
+	vmovdqa		$xa1,0x00(%rsp)
+	xor		%r10,%r10
+	vmovdqa		$xb1,0x10(%rsp)
+	lea		0x40($out),$out		# out+=64*1
+	vmovdqa		$xc1,0x20(%rsp)
+	sub		\$64,$len		# len-=64*1
+	vmovdqa		$xd1,0x30(%rsp)
+	jmp		.Loop_tail4xop
+
+.align	32
+.L128_or_more4xop:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x10($inp),$xb0,$xb0
+	vpxor		0x20($inp),$xc0,$xc0
+	vpxor		0x30($inp),$xd0,$xd0
+	vpxor		0x40($inp),$xa1,$xa1
+	vpxor		0x50($inp),$xb1,$xb1
+	vpxor		0x60($inp),$xc1,$xc1
+	vpxor		0x70($inp),$xd1,$xd1
+
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x10($out)
+	vmovdqu		$xc0,0x20($out)
+	vmovdqu		$xd0,0x30($out)
+	vmovdqu		$xa1,0x40($out)
+	vmovdqu		$xb1,0x50($out)
+	vmovdqu		$xc1,0x60($out)
+	vmovdqu		$xd1,0x70($out)
+	je		.Ldone4xop
+
+	lea		0x80($inp),$inp		# inp+=64*2
+	vmovdqa		$xa2,0x00(%rsp)
+	xor		%r10,%r10
+	vmovdqa		$xb2,0x10(%rsp)
+	lea		0x80($out),$out		# out+=64*2
+	vmovdqa		$xc2,0x20(%rsp)
+	sub		\$128,$len		# len-=64*2
+	vmovdqa		$xd2,0x30(%rsp)
+	jmp		.Loop_tail4xop
+
+.align	32
+.L192_or_more4xop:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x10($inp),$xb0,$xb0
+	vpxor		0x20($inp),$xc0,$xc0
+	vpxor		0x30($inp),$xd0,$xd0
+	vpxor		0x40($inp),$xa1,$xa1
+	vpxor		0x50($inp),$xb1,$xb1
+	vpxor		0x60($inp),$xc1,$xc1
+	vpxor		0x70($inp),$xd1,$xd1
+	lea		0x80($inp),$inp		# size optimization
+	vpxor		0x00($inp),$xa2,$xa2
+	vpxor		0x10($inp),$xb2,$xb2
+	vpxor		0x20($inp),$xc2,$xc2
+	vpxor		0x30($inp),$xd2,$xd2
+
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x10($out)
+	vmovdqu		$xc0,0x20($out)
+	vmovdqu		$xd0,0x30($out)
+	vmovdqu		$xa1,0x40($out)
+	vmovdqu		$xb1,0x50($out)
+	vmovdqu		$xc1,0x60($out)
+	vmovdqu		$xd1,0x70($out)
+	lea		0x80($out),$out		# size optimization
+	vmovdqu		$xa2,0x00($out)
+	vmovdqu		$xb2,0x10($out)
+	vmovdqu		$xc2,0x20($out)
+	vmovdqu		$xd2,0x30($out)
+	je		.Ldone4xop
+
+	lea		0x40($inp),$inp		# inp+=64*3
+	vmovdqa		$xa3,0x00(%rsp)
+	xor		%r10,%r10
+	vmovdqa		$xb3,0x10(%rsp)
+	lea		0x40($out),$out		# out+=64*3
+	vmovdqa		$xc3,0x20(%rsp)
+	sub		\$192,$len		# len-=64*3
+	vmovdqa		$xd3,0x30(%rsp)
+
+.Loop_tail4xop:
+	movzb		($inp,%r10),%eax
+	movzb		(%rsp,%r10),%ecx
+	lea		1(%r10),%r10
+	xor		%ecx,%eax
+	mov		%al,-1($out,%r10)
+	dec		$len
+	jnz		.Loop_tail4xop
+
+.Ldone4xop:
+	vzeroupper
+___
+$code.=<<___	if ($win64);
+	movaps		-0xa8(%r9),%xmm6
+	movaps		-0x98(%r9),%xmm7
+	movaps		-0x88(%r9),%xmm8
+	movaps		-0x78(%r9),%xmm9
+	movaps		-0x68(%r9),%xmm10
+	movaps		-0x58(%r9),%xmm11
+	movaps		-0x48(%r9),%xmm12
+	movaps		-0x38(%r9),%xmm13
+	movaps		-0x28(%r9),%xmm14
+	movaps		-0x18(%r9),%xmm15
+___
+$code.=<<___;
+	lea		(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L4xop_epilogue:
+	ret
+.cfi_endproc
+.size	chacha20_4xop,.-chacha20_4xop
+___
+}
+
+########################################################################
+# AVX2 code path
+if ($avx>1) {
+my ($xb0,$xb1,$xb2,$xb3, $xd0,$xd1,$xd2,$xd3,
+    $xa0,$xa1,$xa2,$xa3, $xt0,$xt1,$xt2,$xt3)=map("%ymm$_",(0..15));
+my @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
+	"%nox","%nox","%nox","%nox", $xd0,$xd1,$xd2,$xd3);
+
+sub AVX2_lane_ROUND {
+my ($a0,$b0,$c0,$d0)=@_;
+my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0));
+my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1));
+my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2));
+my ($xc,$xc_,$t0,$t1)=map("\"$_\"",$xt0,$xt1,$xt2,$xt3);
+my @x=map("\"$_\"",@xx);
+
+	# Consider order in which variables are addressed by their
+	# index:
+	#
+	#	a   b   c   d
+	#
+	#	0   4   8  12 < even round
+	#	1   5   9  13
+	#	2   6  10  14
+	#	3   7  11  15
+	#	0   5  10  15 < odd round
+	#	1   6  11  12
+	#	2   7   8  13
+	#	3   4   9  14
+	#
+	# 'a', 'b' and 'd's are permanently allocated in registers,
+	# @x[0..7,12..15], while 'c's are maintained in memory. If
+	# you observe 'c' column, you'll notice that pair of 'c's is
+	# invariant between rounds. This means that we have to reload
+	# them once per round, in the middle. This is why you'll see
+	# bunch of 'c' stores and loads in the middle, but none in
+	# the beginning or end.
+
+	(
+	"&vpaddd	(@x[$a0],@x[$a0],@x[$b0])",	# Q1
+	"&vpxor		(@x[$d0],@x[$a0],@x[$d0])",
+	"&vpshufb	(@x[$d0],@x[$d0],$t1)",
+	 "&vpaddd	(@x[$a1],@x[$a1],@x[$b1])",	# Q2
+	 "&vpxor	(@x[$d1],@x[$a1],@x[$d1])",
+	 "&vpshufb	(@x[$d1],@x[$d1],$t1)",
+
+	"&vpaddd	($xc,$xc,@x[$d0])",
+	"&vpxor		(@x[$b0],$xc,@x[$b0])",
+	"&vpslld	($t0,@x[$b0],12)",
+	"&vpsrld	(@x[$b0],@x[$b0],20)",
+	"&vpor		(@x[$b0],$t0,@x[$b0])",
+	"&vbroadcasti128($t0,'(%r11)')",		# .Lrot24(%rip)
+	 "&vpaddd	($xc_,$xc_,@x[$d1])",
+	 "&vpxor	(@x[$b1],$xc_,@x[$b1])",
+	 "&vpslld	($t1,@x[$b1],12)",
+	 "&vpsrld	(@x[$b1],@x[$b1],20)",
+	 "&vpor		(@x[$b1],$t1,@x[$b1])",
+
+	"&vpaddd	(@x[$a0],@x[$a0],@x[$b0])",
+	"&vpxor		(@x[$d0],@x[$a0],@x[$d0])",
+	"&vpshufb	(@x[$d0],@x[$d0],$t0)",
+	 "&vpaddd	(@x[$a1],@x[$a1],@x[$b1])",
+	 "&vpxor	(@x[$d1],@x[$a1],@x[$d1])",
+	 "&vpshufb	(@x[$d1],@x[$d1],$t0)",
+
+	"&vpaddd	($xc,$xc,@x[$d0])",
+	"&vpxor		(@x[$b0],$xc,@x[$b0])",
+	"&vpslld	($t1,@x[$b0],7)",
+	"&vpsrld	(@x[$b0],@x[$b0],25)",
+	"&vpor		(@x[$b0],$t1,@x[$b0])",
+	"&vbroadcasti128($t1,'(%r10)')",		# .Lrot16(%rip)
+	 "&vpaddd	($xc_,$xc_,@x[$d1])",
+	 "&vpxor	(@x[$b1],$xc_,@x[$b1])",
+	 "&vpslld	($t0,@x[$b1],7)",
+	 "&vpsrld	(@x[$b1],@x[$b1],25)",
+	 "&vpor		(@x[$b1],$t0,@x[$b1])",
+
+	"&vmovdqa	(\"`32*($c0-8)`(%rsp)\",$xc)",	# reload pair of 'c's
+	 "&vmovdqa	(\"`32*($c1-8)`(%rsp)\",$xc_)",
+	"&vmovdqa	($xc,\"`32*($c2-8)`(%rsp)\")",
+	 "&vmovdqa	($xc_,\"`32*($c3-8)`(%rsp)\")",
+
+	"&vpaddd	(@x[$a2],@x[$a2],@x[$b2])",	# Q3
+	"&vpxor		(@x[$d2],@x[$a2],@x[$d2])",
+	"&vpshufb	(@x[$d2],@x[$d2],$t1)",
+	 "&vpaddd	(@x[$a3],@x[$a3],@x[$b3])",	# Q4
+	 "&vpxor	(@x[$d3],@x[$a3],@x[$d3])",
+	 "&vpshufb	(@x[$d3],@x[$d3],$t1)",
+
+	"&vpaddd	($xc,$xc,@x[$d2])",
+	"&vpxor		(@x[$b2],$xc,@x[$b2])",
+	"&vpslld	($t0,@x[$b2],12)",
+	"&vpsrld	(@x[$b2],@x[$b2],20)",
+	"&vpor		(@x[$b2],$t0,@x[$b2])",
+	"&vbroadcasti128($t0,'(%r11)')",		# .Lrot24(%rip)
+	 "&vpaddd	($xc_,$xc_,@x[$d3])",
+	 "&vpxor	(@x[$b3],$xc_,@x[$b3])",
+	 "&vpslld	($t1,@x[$b3],12)",
+	 "&vpsrld	(@x[$b3],@x[$b3],20)",
+	 "&vpor		(@x[$b3],$t1,@x[$b3])",
+
+	"&vpaddd	(@x[$a2],@x[$a2],@x[$b2])",
+	"&vpxor		(@x[$d2],@x[$a2],@x[$d2])",
+	"&vpshufb	(@x[$d2],@x[$d2],$t0)",
+	 "&vpaddd	(@x[$a3],@x[$a3],@x[$b3])",
+	 "&vpxor	(@x[$d3],@x[$a3],@x[$d3])",
+	 "&vpshufb	(@x[$d3],@x[$d3],$t0)",
+
+	"&vpaddd	($xc,$xc,@x[$d2])",
+	"&vpxor		(@x[$b2],$xc,@x[$b2])",
+	"&vpslld	($t1,@x[$b2],7)",
+	"&vpsrld	(@x[$b2],@x[$b2],25)",
+	"&vpor		(@x[$b2],$t1,@x[$b2])",
+	"&vbroadcasti128($t1,'(%r10)')",		# .Lrot16(%rip)
+	 "&vpaddd	($xc_,$xc_,@x[$d3])",
+	 "&vpxor	(@x[$b3],$xc_,@x[$b3])",
+	 "&vpslld	($t0,@x[$b3],7)",
+	 "&vpsrld	(@x[$b3],@x[$b3],25)",
+	 "&vpor		(@x[$b3],$t0,@x[$b3])"
+	);
+}
+
+my $xframe = $win64 ? 0xa8 : 8;
+
+$code.=<<___;
+.global chacha20_avx2
+.type	chacha20_avx2,\@function,5
+.align	32
+chacha20_avx2:
+.cfi_startproc
+.Lchacha20_avx2:
+	mov		%rsp,%r9		# frame register
+.cfi_def_cfa_register	%r9
+	sub		\$0x280+$xframe,%rsp
+	and		\$-32,%rsp
+___
+$code.=<<___	if ($win64);
+	movaps		%xmm6,-0xa8(%r9)
+	movaps		%xmm7,-0x98(%r9)
+	movaps		%xmm8,-0x88(%r9)
+	movaps		%xmm9,-0x78(%r9)
+	movaps		%xmm10,-0x68(%r9)
+	movaps		%xmm11,-0x58(%r9)
+	movaps		%xmm12,-0x48(%r9)
+	movaps		%xmm13,-0x38(%r9)
+	movaps		%xmm14,-0x28(%r9)
+	movaps		%xmm15,-0x18(%r9)
+.L8x_body:
+___
+	################ stack layout
+	# +0x00		SIMD equivalent of @x[8-12]
+	# ...
+	# +0x80		constant copy of key[0-2] smashed by lanes
+	# ...
+	# +0x200	SIMD counters (with nonce smashed by lanes)
+	# ...
+	# +0x280
+
+$code.=<<___;
+	vzeroupper
+
+	vbroadcasti128	.Lsigma(%rip),$xa3	# key[0]
+	vbroadcasti128	($key),$xb3		# key[1]
+	vbroadcasti128	16($key),$xt3		# key[2]
+	vbroadcasti128	($counter),$xd3		# key[3]
+	lea		0x100(%rsp),%rcx	# size optimization
+	lea		0x200(%rsp),%rax	# size optimization
+	lea		.Lrot16(%rip),%r10
+	lea		.Lrot24(%rip),%r11
+
+	vpshufd		\$0x00,$xa3,$xa0	# smash key by lanes...
+	vpshufd		\$0x55,$xa3,$xa1
+	vmovdqa		$xa0,0x80-0x100(%rcx)	# ... and offload
+	vpshufd		\$0xaa,$xa3,$xa2
+	vmovdqa		$xa1,0xa0-0x100(%rcx)
+	vpshufd		\$0xff,$xa3,$xa3
+	vmovdqa		$xa2,0xc0-0x100(%rcx)
+	vmovdqa		$xa3,0xe0-0x100(%rcx)
+
+	vpshufd		\$0x00,$xb3,$xb0
+	vpshufd		\$0x55,$xb3,$xb1
+	vmovdqa		$xb0,0x100-0x100(%rcx)
+	vpshufd		\$0xaa,$xb3,$xb2
+	vmovdqa		$xb1,0x120-0x100(%rcx)
+	vpshufd		\$0xff,$xb3,$xb3
+	vmovdqa		$xb2,0x140-0x100(%rcx)
+	vmovdqa		$xb3,0x160-0x100(%rcx)
+
+	vpshufd		\$0x00,$xt3,$xt0	# "xc0"
+	vpshufd		\$0x55,$xt3,$xt1	# "xc1"
+	vmovdqa		$xt0,0x180-0x200(%rax)
+	vpshufd		\$0xaa,$xt3,$xt2	# "xc2"
+	vmovdqa		$xt1,0x1a0-0x200(%rax)
+	vpshufd		\$0xff,$xt3,$xt3	# "xc3"
+	vmovdqa		$xt2,0x1c0-0x200(%rax)
+	vmovdqa		$xt3,0x1e0-0x200(%rax)
+
+	vpshufd		\$0x00,$xd3,$xd0
+	vpshufd		\$0x55,$xd3,$xd1
+	vpaddd		.Lincy(%rip),$xd0,$xd0	# don't save counters yet
+	vpshufd		\$0xaa,$xd3,$xd2
+	vmovdqa		$xd1,0x220-0x200(%rax)
+	vpshufd		\$0xff,$xd3,$xd3
+	vmovdqa		$xd2,0x240-0x200(%rax)
+	vmovdqa		$xd3,0x260-0x200(%rax)
+
+	jmp		.Loop_enter8x
+
+.align	32
+.Loop_outer8x:
+	vmovdqa		0x80-0x100(%rcx),$xa0	# re-load smashed key
+	vmovdqa		0xa0-0x100(%rcx),$xa1
+	vmovdqa		0xc0-0x100(%rcx),$xa2
+	vmovdqa		0xe0-0x100(%rcx),$xa3
+	vmovdqa		0x100-0x100(%rcx),$xb0
+	vmovdqa		0x120-0x100(%rcx),$xb1
+	vmovdqa		0x140-0x100(%rcx),$xb2
+	vmovdqa		0x160-0x100(%rcx),$xb3
+	vmovdqa		0x180-0x200(%rax),$xt0	# "xc0"
+	vmovdqa		0x1a0-0x200(%rax),$xt1	# "xc1"
+	vmovdqa		0x1c0-0x200(%rax),$xt2	# "xc2"
+	vmovdqa		0x1e0-0x200(%rax),$xt3	# "xc3"
+	vmovdqa		0x200-0x200(%rax),$xd0
+	vmovdqa		0x220-0x200(%rax),$xd1
+	vmovdqa		0x240-0x200(%rax),$xd2
+	vmovdqa		0x260-0x200(%rax),$xd3
+	vpaddd		.Leight(%rip),$xd0,$xd0	# next SIMD counters
+
+.Loop_enter8x:
+	vmovdqa		$xt2,0x40(%rsp)		# SIMD equivalent of "@x[10]"
+	vmovdqa		$xt3,0x60(%rsp)		# SIMD equivalent of "@x[11]"
+	vbroadcasti128	(%r10),$xt3
+	vmovdqa		$xd0,0x200-0x200(%rax)	# save SIMD counters
+	mov		\$10,%eax
+	jmp		.Loop8x
+
+.align	32
+.Loop8x:
+___
+	foreach (&AVX2_lane_ROUND(0, 4, 8,12)) { eval; }
+	foreach (&AVX2_lane_ROUND(0, 5,10,15)) { eval; }
+$code.=<<___;
+	dec		%eax
+	jnz		.Loop8x
+
+	lea		0x200(%rsp),%rax	# size optimization
+	vpaddd		0x80-0x100(%rcx),$xa0,$xa0	# accumulate key
+	vpaddd		0xa0-0x100(%rcx),$xa1,$xa1
+	vpaddd		0xc0-0x100(%rcx),$xa2,$xa2
+	vpaddd		0xe0-0x100(%rcx),$xa3,$xa3
+
+	vpunpckldq	$xa1,$xa0,$xt2		# "de-interlace" data
+	vpunpckldq	$xa3,$xa2,$xt3
+	vpunpckhdq	$xa1,$xa0,$xa0
+	vpunpckhdq	$xa3,$xa2,$xa2
+	vpunpcklqdq	$xt3,$xt2,$xa1		# "a0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "a1"
+	vpunpcklqdq	$xa2,$xa0,$xa3		# "a2"
+	vpunpckhqdq	$xa2,$xa0,$xa0		# "a3"
+___
+	($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2);
+$code.=<<___;
+	vpaddd		0x100-0x100(%rcx),$xb0,$xb0
+	vpaddd		0x120-0x100(%rcx),$xb1,$xb1
+	vpaddd		0x140-0x100(%rcx),$xb2,$xb2
+	vpaddd		0x160-0x100(%rcx),$xb3,$xb3
+
+	vpunpckldq	$xb1,$xb0,$xt2
+	vpunpckldq	$xb3,$xb2,$xt3
+	vpunpckhdq	$xb1,$xb0,$xb0
+	vpunpckhdq	$xb3,$xb2,$xb2
+	vpunpcklqdq	$xt3,$xt2,$xb1		# "b0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "b1"
+	vpunpcklqdq	$xb2,$xb0,$xb3		# "b2"
+	vpunpckhqdq	$xb2,$xb0,$xb0		# "b3"
+___
+	($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2);
+$code.=<<___;
+	vperm2i128	\$0x20,$xb0,$xa0,$xt3	# "de-interlace" further
+	vperm2i128	\$0x31,$xb0,$xa0,$xb0
+	vperm2i128	\$0x20,$xb1,$xa1,$xa0
+	vperm2i128	\$0x31,$xb1,$xa1,$xb1
+	vperm2i128	\$0x20,$xb2,$xa2,$xa1
+	vperm2i128	\$0x31,$xb2,$xa2,$xb2
+	vperm2i128	\$0x20,$xb3,$xa3,$xa2
+	vperm2i128	\$0x31,$xb3,$xa3,$xb3
+___
+	($xa0,$xa1,$xa2,$xa3,$xt3)=($xt3,$xa0,$xa1,$xa2,$xa3);
+	my ($xc0,$xc1,$xc2,$xc3)=($xt0,$xt1,$xa0,$xa1);
+$code.=<<___;
+	vmovdqa		$xa0,0x00(%rsp)		# offload $xaN
+	vmovdqa		$xa1,0x20(%rsp)
+	vmovdqa		0x40(%rsp),$xc2		# $xa0
+	vmovdqa		0x60(%rsp),$xc3		# $xa1
+
+	vpaddd		0x180-0x200(%rax),$xc0,$xc0
+	vpaddd		0x1a0-0x200(%rax),$xc1,$xc1
+	vpaddd		0x1c0-0x200(%rax),$xc2,$xc2
+	vpaddd		0x1e0-0x200(%rax),$xc3,$xc3
+
+	vpunpckldq	$xc1,$xc0,$xt2
+	vpunpckldq	$xc3,$xc2,$xt3
+	vpunpckhdq	$xc1,$xc0,$xc0
+	vpunpckhdq	$xc3,$xc2,$xc2
+	vpunpcklqdq	$xt3,$xt2,$xc1		# "c0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "c1"
+	vpunpcklqdq	$xc2,$xc0,$xc3		# "c2"
+	vpunpckhqdq	$xc2,$xc0,$xc0		# "c3"
+___
+	($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2);
+$code.=<<___;
+	vpaddd		0x200-0x200(%rax),$xd0,$xd0
+	vpaddd		0x220-0x200(%rax),$xd1,$xd1
+	vpaddd		0x240-0x200(%rax),$xd2,$xd2
+	vpaddd		0x260-0x200(%rax),$xd3,$xd3
+
+	vpunpckldq	$xd1,$xd0,$xt2
+	vpunpckldq	$xd3,$xd2,$xt3
+	vpunpckhdq	$xd1,$xd0,$xd0
+	vpunpckhdq	$xd3,$xd2,$xd2
+	vpunpcklqdq	$xt3,$xt2,$xd1		# "d0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "d1"
+	vpunpcklqdq	$xd2,$xd0,$xd3		# "d2"
+	vpunpckhqdq	$xd2,$xd0,$xd0		# "d3"
+___
+	($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2);
+$code.=<<___;
+	vperm2i128	\$0x20,$xd0,$xc0,$xt3	# "de-interlace" further
+	vperm2i128	\$0x31,$xd0,$xc0,$xd0
+	vperm2i128	\$0x20,$xd1,$xc1,$xc0
+	vperm2i128	\$0x31,$xd1,$xc1,$xd1
+	vperm2i128	\$0x20,$xd2,$xc2,$xc1
+	vperm2i128	\$0x31,$xd2,$xc2,$xd2
+	vperm2i128	\$0x20,$xd3,$xc3,$xc2
+	vperm2i128	\$0x31,$xd3,$xc3,$xd3
+___
+	($xc0,$xc1,$xc2,$xc3,$xt3)=($xt3,$xc0,$xc1,$xc2,$xc3);
+	($xb0,$xb1,$xb2,$xb3,$xc0,$xc1,$xc2,$xc3)=
+	($xc0,$xc1,$xc2,$xc3,$xb0,$xb1,$xb2,$xb3);
+	($xa0,$xa1)=($xt2,$xt3);
+$code.=<<___;
+	vmovdqa		0x00(%rsp),$xa0		# $xaN was offloaded, remember?
+	vmovdqa		0x20(%rsp),$xa1
+
+	cmp		\$64*8,$len
+	jb		.Ltail8x
+
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vpxor		0x40($inp),$xc0,$xc0
+	vpxor		0x60($inp),$xd0,$xd0
+	lea		0x80($inp),$inp		# size optimization
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	vmovdqu		$xc0,0x40($out)
+	vmovdqu		$xd0,0x60($out)
+	lea		0x80($out),$out		# size optimization
+
+	vpxor		0x00($inp),$xa1,$xa1
+	vpxor		0x20($inp),$xb1,$xb1
+	vpxor		0x40($inp),$xc1,$xc1
+	vpxor		0x60($inp),$xd1,$xd1
+	lea		0x80($inp),$inp		# size optimization
+	vmovdqu		$xa1,0x00($out)
+	vmovdqu		$xb1,0x20($out)
+	vmovdqu		$xc1,0x40($out)
+	vmovdqu		$xd1,0x60($out)
+	lea		0x80($out),$out		# size optimization
+
+	vpxor		0x00($inp),$xa2,$xa2
+	vpxor		0x20($inp),$xb2,$xb2
+	vpxor		0x40($inp),$xc2,$xc2
+	vpxor		0x60($inp),$xd2,$xd2
+	lea		0x80($inp),$inp		# size optimization
+	vmovdqu		$xa2,0x00($out)
+	vmovdqu		$xb2,0x20($out)
+	vmovdqu		$xc2,0x40($out)
+	vmovdqu		$xd2,0x60($out)
+	lea		0x80($out),$out		# size optimization
+
+	vpxor		0x00($inp),$xa3,$xa3
+	vpxor		0x20($inp),$xb3,$xb3
+	vpxor		0x40($inp),$xc3,$xc3
+	vpxor		0x60($inp),$xd3,$xd3
+	lea		0x80($inp),$inp		# size optimization
+	vmovdqu		$xa3,0x00($out)
+	vmovdqu		$xb3,0x20($out)
+	vmovdqu		$xc3,0x40($out)
+	vmovdqu		$xd3,0x60($out)
+	lea		0x80($out),$out		# size optimization
+
+	sub		\$64*8,$len
+	jnz		.Loop_outer8x
+
+	jmp		.Ldone8x
+
+.Ltail8x:
+	cmp		\$448,$len
+	jae		.L448_or_more8x
+	cmp		\$384,$len
+	jae		.L384_or_more8x
+	cmp		\$320,$len
+	jae		.L320_or_more8x
+	cmp		\$256,$len
+	jae		.L256_or_more8x
+	cmp		\$192,$len
+	jae		.L192_or_more8x
+	cmp		\$128,$len
+	jae		.L128_or_more8x
+	cmp		\$64,$len
+	jae		.L64_or_more8x
+
+	xor		%r10,%r10
+	vmovdqa		$xa0,0x00(%rsp)
+	vmovdqa		$xb0,0x20(%rsp)
+	jmp		.Loop_tail8x
+
+.align	32
+.L64_or_more8x:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	je		.Ldone8x
+
+	lea		0x40($inp),$inp		# inp+=64*1
+	xor		%r10,%r10
+	vmovdqa		$xc0,0x00(%rsp)
+	lea		0x40($out),$out		# out+=64*1
+	sub		\$64,$len		# len-=64*1
+	vmovdqa		$xd0,0x20(%rsp)
+	jmp		.Loop_tail8x
+
+.align	32
+.L128_or_more8x:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vpxor		0x40($inp),$xc0,$xc0
+	vpxor		0x60($inp),$xd0,$xd0
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	vmovdqu		$xc0,0x40($out)
+	vmovdqu		$xd0,0x60($out)
+	je		.Ldone8x
+
+	lea		0x80($inp),$inp		# inp+=64*2
+	xor		%r10,%r10
+	vmovdqa		$xa1,0x00(%rsp)
+	lea		0x80($out),$out		# out+=64*2
+	sub		\$128,$len		# len-=64*2
+	vmovdqa		$xb1,0x20(%rsp)
+	jmp		.Loop_tail8x
+
+.align	32
+.L192_or_more8x:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vpxor		0x40($inp),$xc0,$xc0
+	vpxor		0x60($inp),$xd0,$xd0
+	vpxor		0x80($inp),$xa1,$xa1
+	vpxor		0xa0($inp),$xb1,$xb1
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	vmovdqu		$xc0,0x40($out)
+	vmovdqu		$xd0,0x60($out)
+	vmovdqu		$xa1,0x80($out)
+	vmovdqu		$xb1,0xa0($out)
+	je		.Ldone8x
+
+	lea		0xc0($inp),$inp		# inp+=64*3
+	xor		%r10,%r10
+	vmovdqa		$xc1,0x00(%rsp)
+	lea		0xc0($out),$out		# out+=64*3
+	sub		\$192,$len		# len-=64*3
+	vmovdqa		$xd1,0x20(%rsp)
+	jmp		.Loop_tail8x
+
+.align	32
+.L256_or_more8x:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vpxor		0x40($inp),$xc0,$xc0
+	vpxor		0x60($inp),$xd0,$xd0
+	vpxor		0x80($inp),$xa1,$xa1
+	vpxor		0xa0($inp),$xb1,$xb1
+	vpxor		0xc0($inp),$xc1,$xc1
+	vpxor		0xe0($inp),$xd1,$xd1
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	vmovdqu		$xc0,0x40($out)
+	vmovdqu		$xd0,0x60($out)
+	vmovdqu		$xa1,0x80($out)
+	vmovdqu		$xb1,0xa0($out)
+	vmovdqu		$xc1,0xc0($out)
+	vmovdqu		$xd1,0xe0($out)
+	je		.Ldone8x
+
+	lea		0x100($inp),$inp	# inp+=64*4
+	xor		%r10,%r10
+	vmovdqa		$xa2,0x00(%rsp)
+	lea		0x100($out),$out	# out+=64*4
+	sub		\$256,$len		# len-=64*4
+	vmovdqa		$xb2,0x20(%rsp)
+	jmp		.Loop_tail8x
+
+.align	32
+.L320_or_more8x:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vpxor		0x40($inp),$xc0,$xc0
+	vpxor		0x60($inp),$xd0,$xd0
+	vpxor		0x80($inp),$xa1,$xa1
+	vpxor		0xa0($inp),$xb1,$xb1
+	vpxor		0xc0($inp),$xc1,$xc1
+	vpxor		0xe0($inp),$xd1,$xd1
+	vpxor		0x100($inp),$xa2,$xa2
+	vpxor		0x120($inp),$xb2,$xb2
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	vmovdqu		$xc0,0x40($out)
+	vmovdqu		$xd0,0x60($out)
+	vmovdqu		$xa1,0x80($out)
+	vmovdqu		$xb1,0xa0($out)
+	vmovdqu		$xc1,0xc0($out)
+	vmovdqu		$xd1,0xe0($out)
+	vmovdqu		$xa2,0x100($out)
+	vmovdqu		$xb2,0x120($out)
+	je		.Ldone8x
+
+	lea		0x140($inp),$inp	# inp+=64*5
+	xor		%r10,%r10
+	vmovdqa		$xc2,0x00(%rsp)
+	lea		0x140($out),$out	# out+=64*5
+	sub		\$320,$len		# len-=64*5
+	vmovdqa		$xd2,0x20(%rsp)
+	jmp		.Loop_tail8x
+
+.align	32
+.L384_or_more8x:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vpxor		0x40($inp),$xc0,$xc0
+	vpxor		0x60($inp),$xd0,$xd0
+	vpxor		0x80($inp),$xa1,$xa1
+	vpxor		0xa0($inp),$xb1,$xb1
+	vpxor		0xc0($inp),$xc1,$xc1
+	vpxor		0xe0($inp),$xd1,$xd1
+	vpxor		0x100($inp),$xa2,$xa2
+	vpxor		0x120($inp),$xb2,$xb2
+	vpxor		0x140($inp),$xc2,$xc2
+	vpxor		0x160($inp),$xd2,$xd2
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	vmovdqu		$xc0,0x40($out)
+	vmovdqu		$xd0,0x60($out)
+	vmovdqu		$xa1,0x80($out)
+	vmovdqu		$xb1,0xa0($out)
+	vmovdqu		$xc1,0xc0($out)
+	vmovdqu		$xd1,0xe0($out)
+	vmovdqu		$xa2,0x100($out)
+	vmovdqu		$xb2,0x120($out)
+	vmovdqu		$xc2,0x140($out)
+	vmovdqu		$xd2,0x160($out)
+	je		.Ldone8x
+
+	lea		0x180($inp),$inp	# inp+=64*6
+	xor		%r10,%r10
+	vmovdqa		$xa3,0x00(%rsp)
+	lea		0x180($out),$out	# out+=64*6
+	sub		\$384,$len		# len-=64*6
+	vmovdqa		$xb3,0x20(%rsp)
+	jmp		.Loop_tail8x
+
+.align	32
+.L448_or_more8x:
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vpxor		0x40($inp),$xc0,$xc0
+	vpxor		0x60($inp),$xd0,$xd0
+	vpxor		0x80($inp),$xa1,$xa1
+	vpxor		0xa0($inp),$xb1,$xb1
+	vpxor		0xc0($inp),$xc1,$xc1
+	vpxor		0xe0($inp),$xd1,$xd1
+	vpxor		0x100($inp),$xa2,$xa2
+	vpxor		0x120($inp),$xb2,$xb2
+	vpxor		0x140($inp),$xc2,$xc2
+	vpxor		0x160($inp),$xd2,$xd2
+	vpxor		0x180($inp),$xa3,$xa3
+	vpxor		0x1a0($inp),$xb3,$xb3
+	vmovdqu		$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	vmovdqu		$xc0,0x40($out)
+	vmovdqu		$xd0,0x60($out)
+	vmovdqu		$xa1,0x80($out)
+	vmovdqu		$xb1,0xa0($out)
+	vmovdqu		$xc1,0xc0($out)
+	vmovdqu		$xd1,0xe0($out)
+	vmovdqu		$xa2,0x100($out)
+	vmovdqu		$xb2,0x120($out)
+	vmovdqu		$xc2,0x140($out)
+	vmovdqu		$xd2,0x160($out)
+	vmovdqu		$xa3,0x180($out)
+	vmovdqu		$xb3,0x1a0($out)
+	je		.Ldone8x
+
+	lea		0x1c0($inp),$inp	# inp+=64*7
+	xor		%r10,%r10
+	vmovdqa		$xc3,0x00(%rsp)
+	lea		0x1c0($out),$out	# out+=64*7
+	sub		\$448,$len		# len-=64*7
+	vmovdqa		$xd3,0x20(%rsp)
+
+.Loop_tail8x:
+	movzb		($inp,%r10),%eax
+	movzb		(%rsp,%r10),%ecx
+	lea		1(%r10),%r10
+	xor		%ecx,%eax
+	mov		%al,-1($out,%r10)
+	dec		$len
+	jnz		.Loop_tail8x
+
+.Ldone8x:
+	vzeroall
+___
+$code.=<<___	if ($win64);
+	movaps		-0xa8(%r9),%xmm6
+	movaps		-0x98(%r9),%xmm7
+	movaps		-0x88(%r9),%xmm8
+	movaps		-0x78(%r9),%xmm9
+	movaps		-0x68(%r9),%xmm10
+	movaps		-0x58(%r9),%xmm11
+	movaps		-0x48(%r9),%xmm12
+	movaps		-0x38(%r9),%xmm13
+	movaps		-0x28(%r9),%xmm14
+	movaps		-0x18(%r9),%xmm15
+___
+$code.=<<___;
+	lea		(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L8x_epilogue:
+	ret
+.cfi_endproc
+.size	chacha20_avx2,.-chacha20_avx2
+___
+}
+
+########################################################################
+# AVX512 code paths
+if ($avx>2) {
+# This one handles shorter inputs...
+
+my ($a,$b,$c,$d, $a_,$b_,$c_,$d_,$fourz) = map("%zmm$_",(0..3,16..20));
+my ($t0,$t1,$t2,$t3) = map("%xmm$_",(4..7));
+
+sub vpxord()		# size optimization
+{ my $opcode = "vpxor";	# adhere to vpxor when possible
+
+    foreach (@_) {
+	if (/%([zy])mm([0-9]+)/ && ($1 eq "z" || $2>=16)) {
+	    $opcode = "vpxord";
+	    last;
+	}
+    }
+
+    $code .= "\t$opcode\t".join(',',reverse @_)."\n";
+}
+
+sub AVX512ROUND {	# critical path is 14 "SIMD ticks" per round
+	&vpaddd	($a,$a,$b);
+	&vpxord	($d,$d,$a);
+	&vprold	($d,$d,16);
+
+	&vpaddd	($c,$c,$d);
+	&vpxord	($b,$b,$c);
+	&vprold	($b,$b,12);
+
+	&vpaddd	($a,$a,$b);
+	&vpxord	($d,$d,$a);
+	&vprold	($d,$d,8);
+
+	&vpaddd	($c,$c,$d);
+	&vpxord	($b,$b,$c);
+	&vprold	($b,$b,7);
+}
+
+my $xframe = $win64 ? 32+8 : 8;
+
+$code.=<<___;
+.global chacha20_avx512
+.type	chacha20_avx512,\@function,5
+.align	32
+chacha20_avx512:
+.cfi_startproc
+.Lchacha20_avx512:
+	mov	%rsp,%r9		# frame pointer
+.cfi_def_cfa_register	%r9
+	cmp	\$512,$len
+	ja	.Lchacha20_16x
+
+	sub	\$64+$xframe,%rsp
+___
+$code.=<<___	if ($win64);
+	movaps	%xmm6,-0x28(%r9)
+	movaps	%xmm7,-0x18(%r9)
+.Lavx512_body:
+___
+$code.=<<___;
+	vbroadcasti32x4	.Lsigma(%rip),$a
+	vbroadcasti32x4	($key),$b
+	vbroadcasti32x4	16($key),$c
+	vbroadcasti32x4	($counter),$d
+
+	vmovdqa32	$a,$a_
+	vmovdqa32	$b,$b_
+	vmovdqa32	$c,$c_
+	vpaddd		.Lzeroz(%rip),$d,$d
+	vmovdqa32	.Lfourz(%rip),$fourz
+	mov		\$10,$counter	# reuse $counter
+	vmovdqa32	$d,$d_
+	jmp		.Loop_avx512
+
+.align	16
+.Loop_outer_avx512:
+	vmovdqa32	$a_,$a
+	vmovdqa32	$b_,$b
+	vmovdqa32	$c_,$c
+	vpaddd		$fourz,$d_,$d
+	mov		\$10,$counter
+	vmovdqa32	$d,$d_
+	jmp		.Loop_avx512
+
+.align	32
+.Loop_avx512:
+___
+	&AVX512ROUND();
+	&vpshufd	($c,$c,0b01001110);
+	&vpshufd	($b,$b,0b00111001);
+	&vpshufd	($d,$d,0b10010011);
+
+	&AVX512ROUND();
+	&vpshufd	($c,$c,0b01001110);
+	&vpshufd	($b,$b,0b10010011);
+	&vpshufd	($d,$d,0b00111001);
+
+	&dec		($counter);
+	&jnz		(".Loop_avx512");
+
+$code.=<<___;
+	vpaddd		$a_,$a,$a
+	vpaddd		$b_,$b,$b
+	vpaddd		$c_,$c,$c
+	vpaddd		$d_,$d,$d
+
+	sub		\$64,$len
+	jb		.Ltail64_avx512
+
+	vpxor		0x00($inp),%x#$a,$t0	# xor with input
+	vpxor		0x10($inp),%x#$b,$t1
+	vpxor		0x20($inp),%x#$c,$t2
+	vpxor		0x30($inp),%x#$d,$t3
+	lea		0x40($inp),$inp		# inp+=64
+
+	vmovdqu		$t0,0x00($out)		# write output
+	vmovdqu		$t1,0x10($out)
+	vmovdqu		$t2,0x20($out)
+	vmovdqu		$t3,0x30($out)
+	lea		0x40($out),$out		# out+=64
+
+	jz		.Ldone_avx512
+
+	vextracti32x4	\$1,$a,$t0
+	vextracti32x4	\$1,$b,$t1
+	vextracti32x4	\$1,$c,$t2
+	vextracti32x4	\$1,$d,$t3
+
+	sub		\$64,$len
+	jb		.Ltail_avx512
+
+	vpxor		0x00($inp),$t0,$t0	# xor with input
+	vpxor		0x10($inp),$t1,$t1
+	vpxor		0x20($inp),$t2,$t2
+	vpxor		0x30($inp),$t3,$t3
+	lea		0x40($inp),$inp		# inp+=64
+
+	vmovdqu		$t0,0x00($out)		# write output
+	vmovdqu		$t1,0x10($out)
+	vmovdqu		$t2,0x20($out)
+	vmovdqu		$t3,0x30($out)
+	lea		0x40($out),$out		# out+=64
+
+	jz		.Ldone_avx512
+
+	vextracti32x4	\$2,$a,$t0
+	vextracti32x4	\$2,$b,$t1
+	vextracti32x4	\$2,$c,$t2
+	vextracti32x4	\$2,$d,$t3
+
+	sub		\$64,$len
+	jb		.Ltail_avx512
+
+	vpxor		0x00($inp),$t0,$t0	# xor with input
+	vpxor		0x10($inp),$t1,$t1
+	vpxor		0x20($inp),$t2,$t2
+	vpxor		0x30($inp),$t3,$t3
+	lea		0x40($inp),$inp		# inp+=64
+
+	vmovdqu		$t0,0x00($out)		# write output
+	vmovdqu		$t1,0x10($out)
+	vmovdqu		$t2,0x20($out)
+	vmovdqu		$t3,0x30($out)
+	lea		0x40($out),$out		# out+=64
+
+	jz		.Ldone_avx512
+
+	vextracti32x4	\$3,$a,$t0
+	vextracti32x4	\$3,$b,$t1
+	vextracti32x4	\$3,$c,$t2
+	vextracti32x4	\$3,$d,$t3
+
+	sub		\$64,$len
+	jb		.Ltail_avx512
+
+	vpxor		0x00($inp),$t0,$t0	# xor with input
+	vpxor		0x10($inp),$t1,$t1
+	vpxor		0x20($inp),$t2,$t2
+	vpxor		0x30($inp),$t3,$t3
+	lea		0x40($inp),$inp		# inp+=64
+
+	vmovdqu		$t0,0x00($out)		# write output
+	vmovdqu		$t1,0x10($out)
+	vmovdqu		$t2,0x20($out)
+	vmovdqu		$t3,0x30($out)
+	lea		0x40($out),$out		# out+=64
+
+	jnz		.Loop_outer_avx512
+
+	jmp		.Ldone_avx512
+
+.align	16
+.Ltail64_avx512:
+	vmovdqa		%x#$a,0x00(%rsp)
+	vmovdqa		%x#$b,0x10(%rsp)
+	vmovdqa		%x#$c,0x20(%rsp)
+	vmovdqa		%x#$d,0x30(%rsp)
+	add		\$64,$len
+	jmp		.Loop_tail_avx512
+
+.align	16
+.Ltail_avx512:
+	vmovdqa		$t0,0x00(%rsp)
+	vmovdqa		$t1,0x10(%rsp)
+	vmovdqa		$t2,0x20(%rsp)
+	vmovdqa		$t3,0x30(%rsp)
+	add		\$64,$len
+
+.Loop_tail_avx512:
+	movzb		($inp,$counter),%eax
+	movzb		(%rsp,$counter),%ecx
+	lea		1($counter),$counter
+	xor		%ecx,%eax
+	mov		%al,-1($out,$counter)
+	dec		$len
+	jnz		.Loop_tail_avx512
+
+	vmovdqu32	$a_,0x00(%rsp)
+
+.Ldone_avx512:
+	vzeroall
+___
+$code.=<<___	if ($win64);
+	movaps	-0x28(%r9),%xmm6
+	movaps	-0x18(%r9),%xmm7
+___
+$code.=<<___;
+	lea	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lavx512_epilogue:
+	ret
+.cfi_endproc
+.size	chacha20_avx512,.-chacha20_avx512
+___
+
+map(s/%z/%y/, $a,$b,$c,$d, $a_,$b_,$c_,$d_,$fourz);
+
+$code.=<<___;
+.global chacha20_avx512vl
+.type	chacha20_avx512vl,\@function,5
+.align	32
+chacha20_avx512vl:
+.cfi_startproc
+.Lchacha20_avx512vl:
+	mov	%rsp,%r9		# frame pointer
+.cfi_def_cfa_register	%r9
+	cmp	\$128,$len
+	ja	.Lchacha20_8xvl
+
+	sub	\$64+$xframe,%rsp
+___
+$code.=<<___	if ($win64);
+	movaps	%xmm6,-0x28(%r9)
+	movaps	%xmm7,-0x18(%r9)
+.Lavx512vl_body:
+___
+$code.=<<___;
+	vbroadcasti128	.Lsigma(%rip),$a
+	vbroadcasti128	($key),$b
+	vbroadcasti128	16($key),$c
+	vbroadcasti128	($counter),$d
+
+	vmovdqa32	$a,$a_
+	vmovdqa32	$b,$b_
+	vmovdqa32	$c,$c_
+	vpaddd		.Lzeroz(%rip),$d,$d
+	vmovdqa32	.Ltwoy(%rip),$fourz
+	mov		\$10,$counter	# reuse $counter
+	vmovdqa32	$d,$d_
+	jmp		.Loop_avx512vl
+
+.align	16
+.Loop_outer_avx512vl:
+	vmovdqa32	$c_,$c
+	vpaddd		$fourz,$d_,$d
+	mov		\$10,$counter
+	vmovdqa32	$d,$d_
+	jmp		.Loop_avx512vl
+
+.align	32
+.Loop_avx512vl:
+___
+	&AVX512ROUND();
+	&vpshufd	($c,$c,0b01001110);
+	&vpshufd	($b,$b,0b00111001);
+	&vpshufd	($d,$d,0b10010011);
+
+	&AVX512ROUND();
+	&vpshufd	($c,$c,0b01001110);
+	&vpshufd	($b,$b,0b10010011);
+	&vpshufd	($d,$d,0b00111001);
+
+	&dec		($counter);
+	&jnz		(".Loop_avx512vl");
+
+$code.=<<___;
+	vpaddd		$a_,$a,$a
+	vpaddd		$b_,$b,$b
+	vpaddd		$c_,$c,$c
+	vpaddd		$d_,$d,$d
+
+	sub		\$64,$len
+	jb		.Ltail64_avx512vl
+
+	vpxor		0x00($inp),%x#$a,$t0	# xor with input
+	vpxor		0x10($inp),%x#$b,$t1
+	vpxor		0x20($inp),%x#$c,$t2
+	vpxor		0x30($inp),%x#$d,$t3
+	lea		0x40($inp),$inp		# inp+=64
+
+	vmovdqu		$t0,0x00($out)		# write output
+	vmovdqu		$t1,0x10($out)
+	vmovdqu		$t2,0x20($out)
+	vmovdqu		$t3,0x30($out)
+	lea		0x40($out),$out		# out+=64
+
+	jz		.Ldone_avx512vl
+
+	vextracti128	\$1,$a,$t0
+	vextracti128	\$1,$b,$t1
+	vextracti128	\$1,$c,$t2
+	vextracti128	\$1,$d,$t3
+
+	sub		\$64,$len
+	jb		.Ltail_avx512vl
+
+	vpxor		0x00($inp),$t0,$t0	# xor with input
+	vpxor		0x10($inp),$t1,$t1
+	vpxor		0x20($inp),$t2,$t2
+	vpxor		0x30($inp),$t3,$t3
+	lea		0x40($inp),$inp		# inp+=64
+
+	vmovdqu		$t0,0x00($out)		# write output
+	vmovdqu		$t1,0x10($out)
+	vmovdqu		$t2,0x20($out)
+	vmovdqu		$t3,0x30($out)
+	lea		0x40($out),$out		# out+=64
+
+	vmovdqa32	$a_,$a
+	vmovdqa32	$b_,$b
+	jnz		.Loop_outer_avx512vl
+
+	jmp		.Ldone_avx512vl
+
+.align	16
+.Ltail64_avx512vl:
+	vmovdqa		%x#$a,0x00(%rsp)
+	vmovdqa		%x#$b,0x10(%rsp)
+	vmovdqa		%x#$c,0x20(%rsp)
+	vmovdqa		%x#$d,0x30(%rsp)
+	add		\$64,$len
+	jmp		.Loop_tail_avx512vl
+
+.align	16
+.Ltail_avx512vl:
+	vmovdqa		$t0,0x00(%rsp)
+	vmovdqa		$t1,0x10(%rsp)
+	vmovdqa		$t2,0x20(%rsp)
+	vmovdqa		$t3,0x30(%rsp)
+	add		\$64,$len
+
+.Loop_tail_avx512vl:
+	movzb		($inp,$counter),%eax
+	movzb		(%rsp,$counter),%ecx
+	lea		1($counter),$counter
+	xor		%ecx,%eax
+	mov		%al,-1($out,$counter)
+	dec		$len
+	jnz		.Loop_tail_avx512vl
+
+	vmovdqu32	$a_,0x00(%rsp)
+	vmovdqu32	$a_,0x20(%rsp)
+
+.Ldone_avx512vl:
+	vzeroall
+___
+$code.=<<___	if ($win64);
+	movaps	-0x28(%r9),%xmm6
+	movaps	-0x18(%r9),%xmm7
+___
+$code.=<<___;
+	lea	(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.Lavx512vl_epilogue:
+	ret
+.cfi_endproc
+.size	chacha20_avx512vl,.-chacha20_avx512vl
+___
+}
+if ($avx>2) {
+# This one handles longer inputs...
+
+my ($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
+    $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3)=map("%zmm$_",(0..15));
+my  @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
+	 $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3);
+my @key=map("%zmm$_",(16..31));
+my ($xt0,$xt1,$xt2,$xt3)=@key[0..3];
+
+sub AVX512_lane_ROUND {
+my ($a0,$b0,$c0,$d0)=@_;
+my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0));
+my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1));
+my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2));
+my @x=map("\"$_\"",@xx);
+
+	(
+	"&vpaddd	(@x[$a0],@x[$a0],@x[$b0])",	# Q1
+	 "&vpaddd	(@x[$a1],@x[$a1],@x[$b1])",	# Q2
+	  "&vpaddd	(@x[$a2],@x[$a2],@x[$b2])",	# Q3
+	   "&vpaddd	(@x[$a3],@x[$a3],@x[$b3])",	# Q4
+	"&vpxord	(@x[$d0],@x[$d0],@x[$a0])",
+	 "&vpxord	(@x[$d1],@x[$d1],@x[$a1])",
+	  "&vpxord	(@x[$d2],@x[$d2],@x[$a2])",
+	   "&vpxord	(@x[$d3],@x[$d3],@x[$a3])",
+	"&vprold	(@x[$d0],@x[$d0],16)",
+	 "&vprold	(@x[$d1],@x[$d1],16)",
+	  "&vprold	(@x[$d2],@x[$d2],16)",
+	   "&vprold	(@x[$d3],@x[$d3],16)",
+
+	"&vpaddd	(@x[$c0],@x[$c0],@x[$d0])",
+	 "&vpaddd	(@x[$c1],@x[$c1],@x[$d1])",
+	  "&vpaddd	(@x[$c2],@x[$c2],@x[$d2])",
+	   "&vpaddd	(@x[$c3],@x[$c3],@x[$d3])",
+	"&vpxord	(@x[$b0],@x[$b0],@x[$c0])",
+	 "&vpxord	(@x[$b1],@x[$b1],@x[$c1])",
+	  "&vpxord	(@x[$b2],@x[$b2],@x[$c2])",
+	   "&vpxord	(@x[$b3],@x[$b3],@x[$c3])",
+	"&vprold	(@x[$b0],@x[$b0],12)",
+	 "&vprold	(@x[$b1],@x[$b1],12)",
+	  "&vprold	(@x[$b2],@x[$b2],12)",
+	   "&vprold	(@x[$b3],@x[$b3],12)",
+
+	"&vpaddd	(@x[$a0],@x[$a0],@x[$b0])",
+	 "&vpaddd	(@x[$a1],@x[$a1],@x[$b1])",
+	  "&vpaddd	(@x[$a2],@x[$a2],@x[$b2])",
+	   "&vpaddd	(@x[$a3],@x[$a3],@x[$b3])",
+	"&vpxord	(@x[$d0],@x[$d0],@x[$a0])",
+	 "&vpxord	(@x[$d1],@x[$d1],@x[$a1])",
+	  "&vpxord	(@x[$d2],@x[$d2],@x[$a2])",
+	   "&vpxord	(@x[$d3],@x[$d3],@x[$a3])",
+	"&vprold	(@x[$d0],@x[$d0],8)",
+	 "&vprold	(@x[$d1],@x[$d1],8)",
+	  "&vprold	(@x[$d2],@x[$d2],8)",
+	   "&vprold	(@x[$d3],@x[$d3],8)",
+
+	"&vpaddd	(@x[$c0],@x[$c0],@x[$d0])",
+	 "&vpaddd	(@x[$c1],@x[$c1],@x[$d1])",
+	  "&vpaddd	(@x[$c2],@x[$c2],@x[$d2])",
+	   "&vpaddd	(@x[$c3],@x[$c3],@x[$d3])",
+	"&vpxord	(@x[$b0],@x[$b0],@x[$c0])",
+	 "&vpxord	(@x[$b1],@x[$b1],@x[$c1])",
+	  "&vpxord	(@x[$b2],@x[$b2],@x[$c2])",
+	   "&vpxord	(@x[$b3],@x[$b3],@x[$c3])",
+	"&vprold	(@x[$b0],@x[$b0],7)",
+	 "&vprold	(@x[$b1],@x[$b1],7)",
+	  "&vprold	(@x[$b2],@x[$b2],7)",
+	   "&vprold	(@x[$b3],@x[$b3],7)"
+	);
+}
+
+my $xframe = $win64 ? 0xa8 : 8;
+
+$code.=<<___;
+.global chacha20_16x
+.type	chacha20_16x,\@function,5
+.align	32
+chacha20_16x:
+.cfi_startproc
+.Lchacha20_16x:
+	mov		%rsp,%r9		# frame register
+.cfi_def_cfa_register	%r9
+	sub		\$64+$xframe,%rsp
+	and		\$-64,%rsp
+___
+$code.=<<___	if ($win64);
+	movaps		%xmm6,-0xa8(%r9)
+	movaps		%xmm7,-0x98(%r9)
+	movaps		%xmm8,-0x88(%r9)
+	movaps		%xmm9,-0x78(%r9)
+	movaps		%xmm10,-0x68(%r9)
+	movaps		%xmm11,-0x58(%r9)
+	movaps		%xmm12,-0x48(%r9)
+	movaps		%xmm13,-0x38(%r9)
+	movaps		%xmm14,-0x28(%r9)
+	movaps		%xmm15,-0x18(%r9)
+.L16x_body:
+___
+$code.=<<___;
+	vzeroupper
+
+	lea		.Lsigma(%rip),%r10
+	vbroadcasti32x4	(%r10),$xa3		# key[0]
+	vbroadcasti32x4	($key),$xb3		# key[1]
+	vbroadcasti32x4	16($key),$xc3		# key[2]
+	vbroadcasti32x4	($counter),$xd3		# key[3]
+
+	vpshufd		\$0x00,$xa3,$xa0	# smash key by lanes...
+	vpshufd		\$0x55,$xa3,$xa1
+	vpshufd		\$0xaa,$xa3,$xa2
+	vpshufd		\$0xff,$xa3,$xa3
+	vmovdqa64	$xa0,@key[0]
+	vmovdqa64	$xa1,@key[1]
+	vmovdqa64	$xa2,@key[2]
+	vmovdqa64	$xa3,@key[3]
+
+	vpshufd		\$0x00,$xb3,$xb0
+	vpshufd		\$0x55,$xb3,$xb1
+	vpshufd		\$0xaa,$xb3,$xb2
+	vpshufd		\$0xff,$xb3,$xb3
+	vmovdqa64	$xb0,@key[4]
+	vmovdqa64	$xb1,@key[5]
+	vmovdqa64	$xb2,@key[6]
+	vmovdqa64	$xb3,@key[7]
+
+	vpshufd		\$0x00,$xc3,$xc0
+	vpshufd		\$0x55,$xc3,$xc1
+	vpshufd		\$0xaa,$xc3,$xc2
+	vpshufd		\$0xff,$xc3,$xc3
+	vmovdqa64	$xc0,@key[8]
+	vmovdqa64	$xc1,@key[9]
+	vmovdqa64	$xc2,@key[10]
+	vmovdqa64	$xc3,@key[11]
+
+	vpshufd		\$0x00,$xd3,$xd0
+	vpshufd		\$0x55,$xd3,$xd1
+	vpshufd		\$0xaa,$xd3,$xd2
+	vpshufd		\$0xff,$xd3,$xd3
+	vpaddd		.Lincz(%rip),$xd0,$xd0	# don't save counters yet
+	vmovdqa64	$xd0,@key[12]
+	vmovdqa64	$xd1,@key[13]
+	vmovdqa64	$xd2,@key[14]
+	vmovdqa64	$xd3,@key[15]
+
+	mov		\$10,%eax
+	jmp		.Loop16x
+
+.align	32
+.Loop_outer16x:
+	vpbroadcastd	0(%r10),$xa0		# reload key
+	vpbroadcastd	4(%r10),$xa1
+	vpbroadcastd	8(%r10),$xa2
+	vpbroadcastd	12(%r10),$xa3
+	vpaddd		.Lsixteen(%rip),@key[12],@key[12]	# next SIMD counters
+	vmovdqa64	@key[4],$xb0
+	vmovdqa64	@key[5],$xb1
+	vmovdqa64	@key[6],$xb2
+	vmovdqa64	@key[7],$xb3
+	vmovdqa64	@key[8],$xc0
+	vmovdqa64	@key[9],$xc1
+	vmovdqa64	@key[10],$xc2
+	vmovdqa64	@key[11],$xc3
+	vmovdqa64	@key[12],$xd0
+	vmovdqa64	@key[13],$xd1
+	vmovdqa64	@key[14],$xd2
+	vmovdqa64	@key[15],$xd3
+
+	vmovdqa64	$xa0,@key[0]
+	vmovdqa64	$xa1,@key[1]
+	vmovdqa64	$xa2,@key[2]
+	vmovdqa64	$xa3,@key[3]
+
+	mov		\$10,%eax
+	jmp		.Loop16x
+
+.align	32
+.Loop16x:
+___
+	foreach (&AVX512_lane_ROUND(0, 4, 8,12)) { eval; }
+	foreach (&AVX512_lane_ROUND(0, 5,10,15)) { eval; }
+$code.=<<___;
+	dec		%eax
+	jnz		.Loop16x
+
+	vpaddd		@key[0],$xa0,$xa0	# accumulate key
+	vpaddd		@key[1],$xa1,$xa1
+	vpaddd		@key[2],$xa2,$xa2
+	vpaddd		@key[3],$xa3,$xa3
+
+	vpunpckldq	$xa1,$xa0,$xt2		# "de-interlace" data
+	vpunpckldq	$xa3,$xa2,$xt3
+	vpunpckhdq	$xa1,$xa0,$xa0
+	vpunpckhdq	$xa3,$xa2,$xa2
+	vpunpcklqdq	$xt3,$xt2,$xa1		# "a0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "a1"
+	vpunpcklqdq	$xa2,$xa0,$xa3		# "a2"
+	vpunpckhqdq	$xa2,$xa0,$xa0		# "a3"
+___
+	($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2);
+$code.=<<___;
+	vpaddd		@key[4],$xb0,$xb0
+	vpaddd		@key[5],$xb1,$xb1
+	vpaddd		@key[6],$xb2,$xb2
+	vpaddd		@key[7],$xb3,$xb3
+
+	vpunpckldq	$xb1,$xb0,$xt2
+	vpunpckldq	$xb3,$xb2,$xt3
+	vpunpckhdq	$xb1,$xb0,$xb0
+	vpunpckhdq	$xb3,$xb2,$xb2
+	vpunpcklqdq	$xt3,$xt2,$xb1		# "b0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "b1"
+	vpunpcklqdq	$xb2,$xb0,$xb3		# "b2"
+	vpunpckhqdq	$xb2,$xb0,$xb0		# "b3"
+___
+	($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2);
+$code.=<<___;
+	vshufi32x4	\$0x44,$xb0,$xa0,$xt3	# "de-interlace" further
+	vshufi32x4	\$0xee,$xb0,$xa0,$xb0
+	vshufi32x4	\$0x44,$xb1,$xa1,$xa0
+	vshufi32x4	\$0xee,$xb1,$xa1,$xb1
+	vshufi32x4	\$0x44,$xb2,$xa2,$xa1
+	vshufi32x4	\$0xee,$xb2,$xa2,$xb2
+	vshufi32x4	\$0x44,$xb3,$xa3,$xa2
+	vshufi32x4	\$0xee,$xb3,$xa3,$xb3
+___
+	($xa0,$xa1,$xa2,$xa3,$xt3)=($xt3,$xa0,$xa1,$xa2,$xa3);
+$code.=<<___;
+	vpaddd		@key[8],$xc0,$xc0
+	vpaddd		@key[9],$xc1,$xc1
+	vpaddd		@key[10],$xc2,$xc2
+	vpaddd		@key[11],$xc3,$xc3
+
+	vpunpckldq	$xc1,$xc0,$xt2
+	vpunpckldq	$xc3,$xc2,$xt3
+	vpunpckhdq	$xc1,$xc0,$xc0
+	vpunpckhdq	$xc3,$xc2,$xc2
+	vpunpcklqdq	$xt3,$xt2,$xc1		# "c0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "c1"
+	vpunpcklqdq	$xc2,$xc0,$xc3		# "c2"
+	vpunpckhqdq	$xc2,$xc0,$xc0		# "c3"
+___
+	($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2);
+$code.=<<___;
+	vpaddd		@key[12],$xd0,$xd0
+	vpaddd		@key[13],$xd1,$xd1
+	vpaddd		@key[14],$xd2,$xd2
+	vpaddd		@key[15],$xd3,$xd3
+
+	vpunpckldq	$xd1,$xd0,$xt2
+	vpunpckldq	$xd3,$xd2,$xt3
+	vpunpckhdq	$xd1,$xd0,$xd0
+	vpunpckhdq	$xd3,$xd2,$xd2
+	vpunpcklqdq	$xt3,$xt2,$xd1		# "d0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "d1"
+	vpunpcklqdq	$xd2,$xd0,$xd3		# "d2"
+	vpunpckhqdq	$xd2,$xd0,$xd0		# "d3"
+___
+	($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2);
+$code.=<<___;
+	vshufi32x4	\$0x44,$xd0,$xc0,$xt3	# "de-interlace" further
+	vshufi32x4	\$0xee,$xd0,$xc0,$xd0
+	vshufi32x4	\$0x44,$xd1,$xc1,$xc0
+	vshufi32x4	\$0xee,$xd1,$xc1,$xd1
+	vshufi32x4	\$0x44,$xd2,$xc2,$xc1
+	vshufi32x4	\$0xee,$xd2,$xc2,$xd2
+	vshufi32x4	\$0x44,$xd3,$xc3,$xc2
+	vshufi32x4	\$0xee,$xd3,$xc3,$xd3
+___
+	($xc0,$xc1,$xc2,$xc3,$xt3)=($xt3,$xc0,$xc1,$xc2,$xc3);
+$code.=<<___;
+	vshufi32x4	\$0x88,$xc0,$xa0,$xt0	# "de-interlace" further
+	vshufi32x4	\$0xdd,$xc0,$xa0,$xa0
+	 vshufi32x4	\$0x88,$xd0,$xb0,$xc0
+	 vshufi32x4	\$0xdd,$xd0,$xb0,$xd0
+	vshufi32x4	\$0x88,$xc1,$xa1,$xt1
+	vshufi32x4	\$0xdd,$xc1,$xa1,$xa1
+	 vshufi32x4	\$0x88,$xd1,$xb1,$xc1
+	 vshufi32x4	\$0xdd,$xd1,$xb1,$xd1
+	vshufi32x4	\$0x88,$xc2,$xa2,$xt2
+	vshufi32x4	\$0xdd,$xc2,$xa2,$xa2
+	 vshufi32x4	\$0x88,$xd2,$xb2,$xc2
+	 vshufi32x4	\$0xdd,$xd2,$xb2,$xd2
+	vshufi32x4	\$0x88,$xc3,$xa3,$xt3
+	vshufi32x4	\$0xdd,$xc3,$xa3,$xa3
+	 vshufi32x4	\$0x88,$xd3,$xb3,$xc3
+	 vshufi32x4	\$0xdd,$xd3,$xb3,$xd3
+___
+	($xa0,$xa1,$xa2,$xa3,$xb0,$xb1,$xb2,$xb3)=
+	($xt0,$xt1,$xt2,$xt3,$xa0,$xa1,$xa2,$xa3);
+
+	($xa0,$xb0,$xc0,$xd0, $xa1,$xb1,$xc1,$xd1,
+	 $xa2,$xb2,$xc2,$xd2, $xa3,$xb3,$xc3,$xd3) =
+	($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
+	 $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3);
+$code.=<<___;
+	cmp		\$64*16,$len
+	jb		.Ltail16x
+
+	vpxord		0x00($inp),$xa0,$xa0	# xor with input
+	vpxord		0x40($inp),$xb0,$xb0
+	vpxord		0x80($inp),$xc0,$xc0
+	vpxord		0xc0($inp),$xd0,$xd0
+	vmovdqu32	$xa0,0x00($out)
+	vmovdqu32	$xb0,0x40($out)
+	vmovdqu32	$xc0,0x80($out)
+	vmovdqu32	$xd0,0xc0($out)
+
+	vpxord		0x100($inp),$xa1,$xa1
+	vpxord		0x140($inp),$xb1,$xb1
+	vpxord		0x180($inp),$xc1,$xc1
+	vpxord		0x1c0($inp),$xd1,$xd1
+	vmovdqu32	$xa1,0x100($out)
+	vmovdqu32	$xb1,0x140($out)
+	vmovdqu32	$xc1,0x180($out)
+	vmovdqu32	$xd1,0x1c0($out)
+
+	vpxord		0x200($inp),$xa2,$xa2
+	vpxord		0x240($inp),$xb2,$xb2
+	vpxord		0x280($inp),$xc2,$xc2
+	vpxord		0x2c0($inp),$xd2,$xd2
+	vmovdqu32	$xa2,0x200($out)
+	vmovdqu32	$xb2,0x240($out)
+	vmovdqu32	$xc2,0x280($out)
+	vmovdqu32	$xd2,0x2c0($out)
+
+	vpxord		0x300($inp),$xa3,$xa3
+	vpxord		0x340($inp),$xb3,$xb3
+	vpxord		0x380($inp),$xc3,$xc3
+	vpxord		0x3c0($inp),$xd3,$xd3
+	lea		0x400($inp),$inp
+	vmovdqu32	$xa3,0x300($out)
+	vmovdqu32	$xb3,0x340($out)
+	vmovdqu32	$xc3,0x380($out)
+	vmovdqu32	$xd3,0x3c0($out)
+	lea		0x400($out),$out
+
+	sub		\$64*16,$len
+	jnz		.Loop_outer16x
+
+	jmp		.Ldone16x
+
+.align	32
+.Ltail16x:
+	xor		%r10,%r10
+	sub		$inp,$out
+	cmp		\$64*1,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xa0,$xa0	# xor with input
+	vmovdqu32	$xa0,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xb0,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*2,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xb0,$xb0
+	vmovdqu32	$xb0,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xc0,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*3,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xc0,$xc0
+	vmovdqu32	$xc0,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xd0,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*4,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xd0,$xd0
+	vmovdqu32	$xd0,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xa1,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*5,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xa1,$xa1
+	vmovdqu32	$xa1,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xb1,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*6,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xb1,$xb1
+	vmovdqu32	$xb1,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xc1,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*7,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xc1,$xc1
+	vmovdqu32	$xc1,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xd1,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*8,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xd1,$xd1
+	vmovdqu32	$xd1,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xa2,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*9,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xa2,$xa2
+	vmovdqu32	$xa2,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xb2,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*10,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xb2,$xb2
+	vmovdqu32	$xb2,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xc2,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*11,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xc2,$xc2
+	vmovdqu32	$xc2,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xd2,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*12,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xd2,$xd2
+	vmovdqu32	$xd2,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xa3,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*13,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xa3,$xa3
+	vmovdqu32	$xa3,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xb3,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*14,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xb3,$xb3
+	vmovdqu32	$xb3,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xc3,$xa0
+	lea		64($inp),$inp
+
+	cmp		\$64*15,$len
+	jb		.Less_than_64_16x
+	vpxord		($inp),$xc3,$xc3
+	vmovdqu32	$xc3,($out,$inp)
+	je		.Ldone16x
+	vmovdqa32	$xd3,$xa0
+	lea		64($inp),$inp
+
+.Less_than_64_16x:
+	vmovdqa32	$xa0,0x00(%rsp)
+	lea		($out,$inp),$out
+	and		\$63,$len
+
+.Loop_tail16x:
+	movzb		($inp,%r10),%eax
+	movzb		(%rsp,%r10),%ecx
+	lea		1(%r10),%r10
+	xor		%ecx,%eax
+	mov		%al,-1($out,%r10)
+	dec		$len
+	jnz		.Loop_tail16x
+
+	vpxord		$xa0,$xa0,$xa0
+	vmovdqa32	$xa0,0(%rsp)
+
+.Ldone16x:
+	vzeroall
+___
+$code.=<<___	if ($win64);
+	movaps		-0xa8(%r9),%xmm6
+	movaps		-0x98(%r9),%xmm7
+	movaps		-0x88(%r9),%xmm8
+	movaps		-0x78(%r9),%xmm9
+	movaps		-0x68(%r9),%xmm10
+	movaps		-0x58(%r9),%xmm11
+	movaps		-0x48(%r9),%xmm12
+	movaps		-0x38(%r9),%xmm13
+	movaps		-0x28(%r9),%xmm14
+	movaps		-0x18(%r9),%xmm15
+___
+$code.=<<___;
+	lea		(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L16x_epilogue:
+	ret
+.cfi_endproc
+.size	chacha20_16x,.-chacha20_16x
+___
+
+# switch to %ymm domain
+($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
+ $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3)=map("%ymm$_",(0..15));
+@xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
+     $xc0,$xc1,$xc2,$xc3, $xd0,$xd1,$xd2,$xd3);
+@key=map("%ymm$_",(16..31));
+($xt0,$xt1,$xt2,$xt3)=@key[0..3];
+
+$code.=<<___;
+.global chacha20_8xvl
+.type	chacha20_8xvl,\@function,5
+.align	32
+chacha20_8xvl:
+.cfi_startproc
+.Lchacha20_8xvl:
+	mov		%rsp,%r9		# frame register
+.cfi_def_cfa_register	%r9
+	sub		\$64+$xframe,%rsp
+	and		\$-64,%rsp
+___
+$code.=<<___	if ($win64);
+	movaps		%xmm6,-0xa8(%r9)
+	movaps		%xmm7,-0x98(%r9)
+	movaps		%xmm8,-0x88(%r9)
+	movaps		%xmm9,-0x78(%r9)
+	movaps		%xmm10,-0x68(%r9)
+	movaps		%xmm11,-0x58(%r9)
+	movaps		%xmm12,-0x48(%r9)
+	movaps		%xmm13,-0x38(%r9)
+	movaps		%xmm14,-0x28(%r9)
+	movaps		%xmm15,-0x18(%r9)
+.L8xvl_body:
+___
+$code.=<<___;
+	vzeroupper
+
+	lea		.Lsigma(%rip),%r10
+	vbroadcasti128	(%r10),$xa3		# key[0]
+	vbroadcasti128	($key),$xb3		# key[1]
+	vbroadcasti128	16($key),$xc3		# key[2]
+	vbroadcasti128	($counter),$xd3		# key[3]
+
+	vpshufd		\$0x00,$xa3,$xa0	# smash key by lanes...
+	vpshufd		\$0x55,$xa3,$xa1
+	vpshufd		\$0xaa,$xa3,$xa2
+	vpshufd		\$0xff,$xa3,$xa3
+	vmovdqa64	$xa0,@key[0]
+	vmovdqa64	$xa1,@key[1]
+	vmovdqa64	$xa2,@key[2]
+	vmovdqa64	$xa3,@key[3]
+
+	vpshufd		\$0x00,$xb3,$xb0
+	vpshufd		\$0x55,$xb3,$xb1
+	vpshufd		\$0xaa,$xb3,$xb2
+	vpshufd		\$0xff,$xb3,$xb3
+	vmovdqa64	$xb0,@key[4]
+	vmovdqa64	$xb1,@key[5]
+	vmovdqa64	$xb2,@key[6]
+	vmovdqa64	$xb3,@key[7]
+
+	vpshufd		\$0x00,$xc3,$xc0
+	vpshufd		\$0x55,$xc3,$xc1
+	vpshufd		\$0xaa,$xc3,$xc2
+	vpshufd		\$0xff,$xc3,$xc3
+	vmovdqa64	$xc0,@key[8]
+	vmovdqa64	$xc1,@key[9]
+	vmovdqa64	$xc2,@key[10]
+	vmovdqa64	$xc3,@key[11]
+
+	vpshufd		\$0x00,$xd3,$xd0
+	vpshufd		\$0x55,$xd3,$xd1
+	vpshufd		\$0xaa,$xd3,$xd2
+	vpshufd		\$0xff,$xd3,$xd3
+	vpaddd		.Lincy(%rip),$xd0,$xd0	# don't save counters yet
+	vmovdqa64	$xd0,@key[12]
+	vmovdqa64	$xd1,@key[13]
+	vmovdqa64	$xd2,@key[14]
+	vmovdqa64	$xd3,@key[15]
+
+	mov		\$10,%eax
+	jmp		.Loop8xvl
+
+.align	32
+.Loop_outer8xvl:
+	#vpbroadcastd	0(%r10),$xa0		# reload key
+	#vpbroadcastd	4(%r10),$xa1
+	vpbroadcastd	8(%r10),$xa2
+	vpbroadcastd	12(%r10),$xa3
+	vpaddd		.Leight(%rip),@key[12],@key[12]	# next SIMD counters
+	vmovdqa64	@key[4],$xb0
+	vmovdqa64	@key[5],$xb1
+	vmovdqa64	@key[6],$xb2
+	vmovdqa64	@key[7],$xb3
+	vmovdqa64	@key[8],$xc0
+	vmovdqa64	@key[9],$xc1
+	vmovdqa64	@key[10],$xc2
+	vmovdqa64	@key[11],$xc3
+	vmovdqa64	@key[12],$xd0
+	vmovdqa64	@key[13],$xd1
+	vmovdqa64	@key[14],$xd2
+	vmovdqa64	@key[15],$xd3
+
+	vmovdqa64	$xa0,@key[0]
+	vmovdqa64	$xa1,@key[1]
+	vmovdqa64	$xa2,@key[2]
+	vmovdqa64	$xa3,@key[3]
+
+	mov		\$10,%eax
+	jmp		.Loop8xvl
+
+.align	32
+.Loop8xvl:
+___
+	foreach (&AVX512_lane_ROUND(0, 4, 8,12)) { eval; }
+	foreach (&AVX512_lane_ROUND(0, 5,10,15)) { eval; }
+$code.=<<___;
+	dec		%eax
+	jnz		.Loop8xvl
+
+	vpaddd		@key[0],$xa0,$xa0	# accumulate key
+	vpaddd		@key[1],$xa1,$xa1
+	vpaddd		@key[2],$xa2,$xa2
+	vpaddd		@key[3],$xa3,$xa3
+
+	vpunpckldq	$xa1,$xa0,$xt2		# "de-interlace" data
+	vpunpckldq	$xa3,$xa2,$xt3
+	vpunpckhdq	$xa1,$xa0,$xa0
+	vpunpckhdq	$xa3,$xa2,$xa2
+	vpunpcklqdq	$xt3,$xt2,$xa1		# "a0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "a1"
+	vpunpcklqdq	$xa2,$xa0,$xa3		# "a2"
+	vpunpckhqdq	$xa2,$xa0,$xa0		# "a3"
+___
+	($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2);
+$code.=<<___;
+	vpaddd		@key[4],$xb0,$xb0
+	vpaddd		@key[5],$xb1,$xb1
+	vpaddd		@key[6],$xb2,$xb2
+	vpaddd		@key[7],$xb3,$xb3
+
+	vpunpckldq	$xb1,$xb0,$xt2
+	vpunpckldq	$xb3,$xb2,$xt3
+	vpunpckhdq	$xb1,$xb0,$xb0
+	vpunpckhdq	$xb3,$xb2,$xb2
+	vpunpcklqdq	$xt3,$xt2,$xb1		# "b0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "b1"
+	vpunpcklqdq	$xb2,$xb0,$xb3		# "b2"
+	vpunpckhqdq	$xb2,$xb0,$xb0		# "b3"
+___
+	($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2);
+$code.=<<___;
+	vshufi32x4	\$0,$xb0,$xa0,$xt3	# "de-interlace" further
+	vshufi32x4	\$3,$xb0,$xa0,$xb0
+	vshufi32x4	\$0,$xb1,$xa1,$xa0
+	vshufi32x4	\$3,$xb1,$xa1,$xb1
+	vshufi32x4	\$0,$xb2,$xa2,$xa1
+	vshufi32x4	\$3,$xb2,$xa2,$xb2
+	vshufi32x4	\$0,$xb3,$xa3,$xa2
+	vshufi32x4	\$3,$xb3,$xa3,$xb3
+___
+	($xa0,$xa1,$xa2,$xa3,$xt3)=($xt3,$xa0,$xa1,$xa2,$xa3);
+$code.=<<___;
+	vpaddd		@key[8],$xc0,$xc0
+	vpaddd		@key[9],$xc1,$xc1
+	vpaddd		@key[10],$xc2,$xc2
+	vpaddd		@key[11],$xc3,$xc3
+
+	vpunpckldq	$xc1,$xc0,$xt2
+	vpunpckldq	$xc3,$xc2,$xt3
+	vpunpckhdq	$xc1,$xc0,$xc0
+	vpunpckhdq	$xc3,$xc2,$xc2
+	vpunpcklqdq	$xt3,$xt2,$xc1		# "c0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "c1"
+	vpunpcklqdq	$xc2,$xc0,$xc3		# "c2"
+	vpunpckhqdq	$xc2,$xc0,$xc0		# "c3"
+___
+	($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2);
+$code.=<<___;
+	vpaddd		@key[12],$xd0,$xd0
+	vpaddd		@key[13],$xd1,$xd1
+	vpaddd		@key[14],$xd2,$xd2
+	vpaddd		@key[15],$xd3,$xd3
+
+	vpunpckldq	$xd1,$xd0,$xt2
+	vpunpckldq	$xd3,$xd2,$xt3
+	vpunpckhdq	$xd1,$xd0,$xd0
+	vpunpckhdq	$xd3,$xd2,$xd2
+	vpunpcklqdq	$xt3,$xt2,$xd1		# "d0"
+	vpunpckhqdq	$xt3,$xt2,$xt2		# "d1"
+	vpunpcklqdq	$xd2,$xd0,$xd3		# "d2"
+	vpunpckhqdq	$xd2,$xd0,$xd0		# "d3"
+___
+	($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2);
+$code.=<<___;
+	vperm2i128	\$0x20,$xd0,$xc0,$xt3	# "de-interlace" further
+	vperm2i128	\$0x31,$xd0,$xc0,$xd0
+	vperm2i128	\$0x20,$xd1,$xc1,$xc0
+	vperm2i128	\$0x31,$xd1,$xc1,$xd1
+	vperm2i128	\$0x20,$xd2,$xc2,$xc1
+	vperm2i128	\$0x31,$xd2,$xc2,$xd2
+	vperm2i128	\$0x20,$xd3,$xc3,$xc2
+	vperm2i128	\$0x31,$xd3,$xc3,$xd3
+___
+	($xc0,$xc1,$xc2,$xc3,$xt3)=($xt3,$xc0,$xc1,$xc2,$xc3);
+	($xb0,$xb1,$xb2,$xb3,$xc0,$xc1,$xc2,$xc3)=
+	($xc0,$xc1,$xc2,$xc3,$xb0,$xb1,$xb2,$xb3);
+$code.=<<___;
+	cmp		\$64*8,$len
+	jb		.Ltail8xvl
+
+	mov		\$0x80,%eax		# size optimization
+	vpxord		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vpxor		0x40($inp),$xc0,$xc0
+	vpxor		0x60($inp),$xd0,$xd0
+	lea		($inp,%rax),$inp	# size optimization
+	vmovdqu32	$xa0,0x00($out)
+	vmovdqu		$xb0,0x20($out)
+	vmovdqu		$xc0,0x40($out)
+	vmovdqu		$xd0,0x60($out)
+	lea		($out,%rax),$out	# size optimization
+
+	vpxor		0x00($inp),$xa1,$xa1
+	vpxor		0x20($inp),$xb1,$xb1
+	vpxor		0x40($inp),$xc1,$xc1
+	vpxor		0x60($inp),$xd1,$xd1
+	lea		($inp,%rax),$inp	# size optimization
+	vmovdqu		$xa1,0x00($out)
+	vmovdqu		$xb1,0x20($out)
+	vmovdqu		$xc1,0x40($out)
+	vmovdqu		$xd1,0x60($out)
+	lea		($out,%rax),$out	# size optimization
+
+	vpxord		0x00($inp),$xa2,$xa2
+	vpxor		0x20($inp),$xb2,$xb2
+	vpxor		0x40($inp),$xc2,$xc2
+	vpxor		0x60($inp),$xd2,$xd2
+	lea		($inp,%rax),$inp	# size optimization
+	vmovdqu32	$xa2,0x00($out)
+	vmovdqu		$xb2,0x20($out)
+	vmovdqu		$xc2,0x40($out)
+	vmovdqu		$xd2,0x60($out)
+	lea		($out,%rax),$out	# size optimization
+
+	vpxor		0x00($inp),$xa3,$xa3
+	vpxor		0x20($inp),$xb3,$xb3
+	vpxor		0x40($inp),$xc3,$xc3
+	vpxor		0x60($inp),$xd3,$xd3
+	lea		($inp,%rax),$inp	# size optimization
+	vmovdqu		$xa3,0x00($out)
+	vmovdqu		$xb3,0x20($out)
+	vmovdqu		$xc3,0x40($out)
+	vmovdqu		$xd3,0x60($out)
+	lea		($out,%rax),$out	# size optimization
+
+	vpbroadcastd	0(%r10),%ymm0		# reload key
+	vpbroadcastd	4(%r10),%ymm1
+
+	sub		\$64*8,$len
+	jnz		.Loop_outer8xvl
+
+	jmp		.Ldone8xvl
+
+.align	32
+.Ltail8xvl:
+	vmovdqa64	$xa0,%ymm8		# size optimization
+___
+$xa0 = "%ymm8";
+$code.=<<___;
+	xor		%r10,%r10
+	sub		$inp,$out
+	cmp		\$64*1,$len
+	jb		.Less_than_64_8xvl
+	vpxor		0x00($inp),$xa0,$xa0	# xor with input
+	vpxor		0x20($inp),$xb0,$xb0
+	vmovdqu		$xa0,0x00($out,$inp)
+	vmovdqu		$xb0,0x20($out,$inp)
+	je		.Ldone8xvl
+	vmovdqa		$xc0,$xa0
+	vmovdqa		$xd0,$xb0
+	lea		64($inp),$inp
+
+	cmp		\$64*2,$len
+	jb		.Less_than_64_8xvl
+	vpxor		0x00($inp),$xc0,$xc0
+	vpxor		0x20($inp),$xd0,$xd0
+	vmovdqu		$xc0,0x00($out,$inp)
+	vmovdqu		$xd0,0x20($out,$inp)
+	je		.Ldone8xvl
+	vmovdqa		$xa1,$xa0
+	vmovdqa		$xb1,$xb0
+	lea		64($inp),$inp
+
+	cmp		\$64*3,$len
+	jb		.Less_than_64_8xvl
+	vpxor		0x00($inp),$xa1,$xa1
+	vpxor		0x20($inp),$xb1,$xb1
+	vmovdqu		$xa1,0x00($out,$inp)
+	vmovdqu		$xb1,0x20($out,$inp)
+	je		.Ldone8xvl
+	vmovdqa		$xc1,$xa0
+	vmovdqa		$xd1,$xb0
+	lea		64($inp),$inp
+
+	cmp		\$64*4,$len
+	jb		.Less_than_64_8xvl
+	vpxor		0x00($inp),$xc1,$xc1
+	vpxor		0x20($inp),$xd1,$xd1
+	vmovdqu		$xc1,0x00($out,$inp)
+	vmovdqu		$xd1,0x20($out,$inp)
+	je		.Ldone8xvl
+	vmovdqa32	$xa2,$xa0
+	vmovdqa		$xb2,$xb0
+	lea		64($inp),$inp
+
+	cmp		\$64*5,$len
+	jb		.Less_than_64_8xvl
+	vpxord		0x00($inp),$xa2,$xa2
+	vpxor		0x20($inp),$xb2,$xb2
+	vmovdqu32	$xa2,0x00($out,$inp)
+	vmovdqu		$xb2,0x20($out,$inp)
+	je		.Ldone8xvl
+	vmovdqa		$xc2,$xa0
+	vmovdqa		$xd2,$xb0
+	lea		64($inp),$inp
+
+	cmp		\$64*6,$len
+	jb		.Less_than_64_8xvl
+	vpxor		0x00($inp),$xc2,$xc2
+	vpxor		0x20($inp),$xd2,$xd2
+	vmovdqu		$xc2,0x00($out,$inp)
+	vmovdqu		$xd2,0x20($out,$inp)
+	je		.Ldone8xvl
+	vmovdqa		$xa3,$xa0
+	vmovdqa		$xb3,$xb0
+	lea		64($inp),$inp
+
+	cmp		\$64*7,$len
+	jb		.Less_than_64_8xvl
+	vpxor		0x00($inp),$xa3,$xa3
+	vpxor		0x20($inp),$xb3,$xb3
+	vmovdqu		$xa3,0x00($out,$inp)
+	vmovdqu		$xb3,0x20($out,$inp)
+	je		.Ldone8xvl
+	vmovdqa		$xc3,$xa0
+	vmovdqa		$xd3,$xb0
+	lea		64($inp),$inp
+
+.Less_than_64_8xvl:
+	vmovdqa		$xa0,0x00(%rsp)
+	vmovdqa		$xb0,0x20(%rsp)
+	lea		($out,$inp),$out
+	and		\$63,$len
+
+.Loop_tail8xvl:
+	movzb		($inp,%r10),%eax
+	movzb		(%rsp,%r10),%ecx
+	lea		1(%r10),%r10
+	xor		%ecx,%eax
+	mov		%al,-1($out,%r10)
+	dec		$len
+	jnz		.Loop_tail8xvl
+
+	vpxor		$xa0,$xa0,$xa0
+	vmovdqa		$xa0,0x00(%rsp)
+	vmovdqa		$xa0,0x20(%rsp)
+
+.Ldone8xvl:
+	vzeroall
+___
+$code.=<<___	if ($win64);
+	movaps		-0xa8(%r9),%xmm6
+	movaps		-0x98(%r9),%xmm7
+	movaps		-0x88(%r9),%xmm8
+	movaps		-0x78(%r9),%xmm9
+	movaps		-0x68(%r9),%xmm10
+	movaps		-0x58(%r9),%xmm11
+	movaps		-0x48(%r9),%xmm12
+	movaps		-0x38(%r9),%xmm13
+	movaps		-0x28(%r9),%xmm14
+	movaps		-0x18(%r9),%xmm15
+___
+$code.=<<___;
+	lea		(%r9),%rsp
+.cfi_def_cfa_register	%rsp
+.L8xvl_epilogue:
+	ret
+.cfi_endproc
+.size	chacha20_8xvl,.-chacha20_8xvl
+___
+}
+
+# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
+#		CONTEXT *context,DISPATCHER_CONTEXT *disp)
+if ($win64) {
+$rec="%rcx";
+$frame="%rdx";
+$context="%r8";
+$disp="%r9";
+
+$code.=<<___;
+.extern	__imp_RtlVirtualUnwind
+.type	ssse3_handler,\@abi-omnipotent
+.align	16
+ssse3_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue label
+	cmp	%r10,%rbx		# context->Rip<prologue label
+	jb	.Lcommon_seh_tail
+
+	mov	192($context),%rax	# pull context->R9
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=epilogue label
+	jae	.Lcommon_seh_tail
+
+	lea	-0x28(%rax),%rsi
+	lea	512($context),%rdi	# &context.Xmm6
+	mov	\$4,%ecx
+	.long	0xa548f3fc		# cld; rep movsq
+
+.Lcommon_seh_tail:
+	mov	8(%rax),%rdi
+	mov	16(%rax),%rsi
+	mov	%rax,152($context)	# restore context->Rsp
+	mov	%rsi,168($context)	# restore context->Rsi
+	mov	%rdi,176($context)	# restore context->Rdi
+
+	mov	40($disp),%rdi		# disp->ContextRecord
+	mov	$context,%rsi		# context
+	mov	\$154,%ecx		# sizeof(CONTEXT)
+	.long	0xa548f3fc		# cld; rep movsq
+
+	mov	$disp,%rsi
+	xor	%rcx,%rcx		# arg1, UNW_FLAG_NHANDLER
+	mov	8(%rsi),%rdx		# arg2, disp->ImageBase
+	mov	0(%rsi),%r8		# arg3, disp->ControlPc
+	mov	16(%rsi),%r9		# arg4, disp->FunctionEntry
+	mov	40(%rsi),%r10		# disp->ContextRecord
+	lea	56(%rsi),%r11		# &disp->HandlerData
+	lea	24(%rsi),%r12		# &disp->EstablisherFrame
+	mov	%r10,32(%rsp)		# arg5
+	mov	%r11,40(%rsp)		# arg6
+	mov	%r12,48(%rsp)		# arg7
+	mov	%rcx,56(%rsp)		# arg8, (NULL)
+	call	*__imp_RtlVirtualUnwind(%rip)
+
+	mov	\$1,%eax		# ExceptionContinueSearch
+	add	\$64,%rsp
+	popfq
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+	pop	%rbp
+	pop	%rbx
+	pop	%rdi
+	pop	%rsi
+	ret
+.size	ssse3_handler,.-ssse3_handler
+
+.type	full_handler,\@abi-omnipotent
+.align	16
+full_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue label
+	cmp	%r10,%rbx		# context->Rip<prologue label
+	jb	.Lcommon_seh_tail
+
+	mov	192($context),%rax	# pull context->R9
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=epilogue label
+	jae	.Lcommon_seh_tail
+
+	lea	-0xa8(%rax),%rsi
+	lea	512($context),%rdi	# &context.Xmm6
+	mov	\$20,%ecx
+	.long	0xa548f3fc		# cld; rep movsq
+
+	jmp	.Lcommon_seh_tail
+.size	full_handler,.-full_handler
+
+.section	.pdata
+.align	4
+	.rva	.LSEH_begin_chacha20_ssse3
+	.rva	.LSEH_end_chacha20_ssse3
+	.rva	.LSEH_info_chacha20_ssse3
+
+	.rva	.LSEH_begin_chacha20_4x
+	.rva	.LSEH_end_chacha20_4x
+	.rva	.LSEH_info_chacha20_4x
+___
+$code.=<<___ if ($avx && 0);
+	.rva	.LSEH_begin_chacha20_4xop
+	.rva	.LSEH_end_chacha20_4xop
+	.rva	.LSEH_info_chacha20_4xop
+___
+$code.=<<___ if ($avx>1);
+	.rva	.LSEH_begin_chacha20_avx2
+	.rva	.LSEH_end_chacha20_avx2
+	.rva	.LSEH_info_chacha20_avx2
+___
+$code.=<<___ if ($avx>2);
+	.rva	.LSEH_begin_chacha20_avx512
+	.rva	.LSEH_end_chacha20_avx512
+	.rva	.LSEH_info_chacha20_avx512
+
+	.rva	.LSEH_begin_chacha20_avx512vl
+	.rva	.LSEH_end_chacha20_avx512vl
+	.rva	.LSEH_info_chacha20_avx512vl
+
+	.rva	.LSEH_begin_chacha20_16x
+	.rva	.LSEH_end_chacha20_16x
+	.rva	.LSEH_info_chacha20_16x
+
+	.rva	.LSEH_begin_chacha20_8xvl
+	.rva	.LSEH_end_chacha20_8xvl
+	.rva	.LSEH_info_chacha20_8xvl
+___
+$code.=<<___;
+.section	.xdata
+.align	8
+.LSEH_info_chacha20_ssse3:
+	.byte	9,0,0,0
+	.rva	ssse3_handler
+	.rva	.Lssse3_body,.Lssse3_epilogue
+
+.LSEH_info_chacha20_4x:
+	.byte	9,0,0,0
+	.rva	full_handler
+	.rva	.L4x_body,.L4x_epilogue
+___
+$code.=<<___ if ($avx&&0);
+.LSEH_info_chacha20_4xop:
+	.byte	9,0,0,0
+	.rva	full_handler
+	.rva	.L4xop_body,.L4xop_epilogue		# HandlerData[]
+___
+$code.=<<___ if ($avx>1);
+.LSEH_info_chacha20_avx2:
+	.byte	9,0,0,0
+	.rva	full_handler
+	.rva	.L8x_body,.L8x_epilogue			# HandlerData[]
+___
+$code.=<<___ if ($avx>2);
+.LSEH_info_chacha20_avx512:
+	.byte	9,0,0,0
+	.rva	ssse3_handler
+	.rva	.Lavx512_body,.Lavx512_epilogue		# HandlerData[]
+
+.LSEH_info_chacha20_avx512vl:
+	.byte	9,0,0,0
+	.rva	ssse3_handler
+	.rva	.Lavx512vl_body,.Lavx512vl_epilogue	# HandlerData[]
+
+.LSEH_info_chacha20_16x:
+	.byte	9,0,0,0
+	.rva	full_handler
+	.rva	.L16x_body,.L16x_epilogue		# HandlerData[]
+
+.LSEH_info_chacha20_8xvl:
+	.byte	9,0,0,0
+	.rva	full_handler
+	.rva	.L8xvl_body,.L8xvl_epilogue		# HandlerData[]
+___
+}
+
+foreach (split("\n",$code)) {
+	s/\`([^\`]*)\`/eval $1/ge;
+
+	s/%x#%[yz]/%x/g;	# "down-shift"
+
+	print $_,"\n";
+}
+
+close STDOUT;
diff --git a/crypto/make_poly1305_x64.pl b/crypto/make_poly1305_x64.pl
new file mode 100644
index 0000000..f7a2ab7
--- /dev/null
+++ b/crypto/make_poly1305_x64.pl
@@ -0,0 +1,4719 @@
+#! /usr/bin/env perl
+# Copyright 2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# This module implements Poly1305 hash for x86_64.
+#
+# March 2015
+#
+# Initial release.
+#
+# December 2016
+#
+# Add AVX512F+VL+BW code path.
+#
+# November 2017
+#
+# Convert AVX512F+VL+BW code path to pure AVX512F, so that it can be
+# executed even on Knights Landing. Trigger for modification was
+# observation that AVX512 code paths can negatively affect overall
+# Skylake-X system performance. Since we are likely to suppress
+# AVX512F capability flag [at least on Skylake-X], conversion serves
+# as kind of "investment protection". Note that next *lake processor,
+# Cannolake, has AVX512IFMA code path to execute...
+#
+# Numbers are cycles per processed byte with poly1305_blocks_x86_64 alone,
+# measured with rdtsc at fixed clock frequency.
+#
+#		IALU/gcc-4.8(*)	AVX(**)		AVX2	AVX-512
+# P4		4.46/+120%	-
+# Core 2	2.41/+90%	-
+# Westmere	1.88/+120%	-
+# Sandy Bridge	1.39/+140%	1.10
+# Haswell	1.14/+175%	1.11		0.65
+# Skylake[-X]	1.13/+120%	0.96		0.51	[0.35]
+# Silvermont	2.83/+95%	-
+# Knights L	3.60/?		1.65		1.10	0.41(***)
+# Goldmont	1.70/+180%	-
+# VIA Nano	1.82/+150%	-
+# Sledgehammer	1.38/+160%	-
+# Bulldozer	2.30/+130%	0.97
+# Ryzen		1.15/+200%	1.08		1.18
+#
+# (*)	improvement coefficients relative to clang are more modest and
+#	are ~50% on most processors, in both cases we are comparing to
+#	__int128 code;
+# (**)	SSE2 implementation was attempted, but among non-AVX processors
+#	it was faster than integer-only code only on older Intel P4 and
+#	Core processors, 50-30%, less newer processor is, but slower on
+#	contemporary ones, for example almost 2x slower on Atom, and as
+#	former are naturally disappearing, SSE2 is deemed unnecessary;
+# (***)	strangely enough performance seems to vary from core to core,
+#	listed result is best case;
+
+$flavour = shift;
+$output  = shift;
+if ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
+
+$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/);
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or
+die "can't locate x86_64-xlate.pl";
+
+$avx = 3;
+$avx = 2 if ($flavour =~ /macosx/);
+
+open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
+*STDOUT=*OUT;
+
+my ($ctx,$inp,$len,$padbit)=("%rdi","%rsi","%rdx","%rcx");
+my ($mac,$nonce)=($inp,$len);	# *_emit arguments
+my ($d1,$d2,$d3, $r0,$r1,$s1)=map("%r$_",(8..13));
+my ($h0,$h1,$h2)=("%r14","%rbx","%rbp");
+
+sub poly1305_iteration {
+# input:	copy of $r1 in %rax, $h0-$h2, $r0-$r1
+# output:	$h0-$h2 *= $r0-$r1
+$code.=<<___;
+	mulq	$h0			# h0*r1
+	mov	%rax,$d2
+	 mov	$r0,%rax
+	mov	%rdx,$d3
+
+	mulq	$h0			# h0*r0
+	mov	%rax,$h0		# future $h0
+	 mov	$r0,%rax
+	mov	%rdx,$d1
+
+	mulq	$h1			# h1*r0
+	add	%rax,$d2
+	 mov	$s1,%rax
+	adc	%rdx,$d3
+
+	mulq	$h1			# h1*s1
+	 mov	$h2,$h1			# borrow $h1
+	add	%rax,$h0
+	adc	%rdx,$d1
+
+	imulq	$s1,$h1			# h2*s1
+	add	$h1,$d2
+	 mov	$d1,$h1
+	adc	\$0,$d3
+
+	imulq	$r0,$h2			# h2*r0
+	add	$d2,$h1
+	mov	\$-4,%rax		# mask value
+	adc	$h2,$d3
+
+	and	$d3,%rax		# last reduction step
+	mov	$d3,$h2
+	shr	\$2,$d3
+	and	\$3,$h2
+	add	$d3,%rax
+	add	%rax,$h0
+	adc	\$0,$h1
+	adc	\$0,$h2
+___
+}
+
+########################################################################
+# Layout of opaque area is following.
+#
+#	unsigned __int64 h[3];		# current hash value base 2^64
+#	unsigned __int64 r[2];		# key value base 2^64
+
+
+$code.=<<___;
+.align	64
+.Lconst:
+.Lmask24:
+.long	0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0
+.L129:
+.long	`1<<24`,0,`1<<24`,0,`1<<24`,0,`1<<24`,0
+.Lmask26:
+.long	0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0
+.Lpermd_avx2:
+.long	2,2,2,3,2,0,2,1
+.Lpermd_avx512:
+.long	0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7
+
+.L2_44_inp_permd:
+.long	0,1,1,2,2,3,7,7
+.L2_44_inp_shift:
+.quad	0,12,24,64
+.L2_44_mask:
+.quad	0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff
+.L2_44_shift_rgt:
+.quad	44,44,42,64
+.L2_44_shift_lft:
+.quad	8,8,10,64
+
+.align	64
+.Lx_mask44:
+.quad	0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff
+.quad	0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff
+.Lx_mask42:
+.quad	0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff
+.quad	0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff
+
+.text
+
+
+.global poly1305_init_x86_64
+.global poly1305_blocks_x86_64
+.global poly1305_emit_x86_64
+.global poly1305_emit_avx
+.global poly1305_blocks_avx
+.global poly1305_blocks_avx2
+.global poly1305_blocks_avx512 
+
+
+.type	poly1305_init_x86_64,\@function,3
+.align	32
+poly1305_init_x86_64:
+	xor	%rax,%rax
+	mov	%rax,0($ctx)		# initialize hash value
+	mov	%rax,8($ctx)
+	mov	%rax,16($ctx)
+
+	cmp	\$0,$inp
+	je	.Lno_key
+
+#	lea	poly1305_blocks_x86_64(%rip),%r10
+#	lea	poly1305_emit_x86_64(%rip),%r11
+___
+#$code.=<<___	if ($avx);
+#	mov	OPENSSL_ia32cap_P+4(%rip),%r9
+#	lea	poly1305_blocks_avx(%rip),%rax
+#	lea	poly1305_emit_avx(%rip),%rcx
+#	bt	\$`60-32`,%r9		# AVX?
+#	cmovc	%rax,%r10
+#	cmovc	%rcx,%r11
+#___
+#$code.=<<___	if ($avx>1);
+#	lea	poly1305_blocks_avx2(%rip),%rax
+#	bt	\$`5+32`,%r9		# AVX2?
+#	cmovc	%rax,%r10
+#___
+#$code.=<<___	if ($avx>3);
+#	mov	\$`(1<<31|1<<21|1<<16)`,%rax
+#	shr	\$32,%r9
+#	and	%rax,%r9
+#	cmp	%rax,%r9
+#	je	.Linit_base2_44
+#___
+$code.=<<___;
+	mov	\$0x0ffffffc0fffffff,%rax
+	mov	\$0x0ffffffc0ffffffc,%rcx
+	and	0($inp),%rax
+	and	8($inp),%rcx
+	mov	%rax,24($ctx)
+	mov	%rcx,32($ctx)
+___
+#$code.=<<___	if ($flavour !~ /elf32/);
+#	mov	%r10,0(%rdx)
+#	mov	%r11,8(%rdx)
+#___
+#$code.=<<___	if ($flavour =~ /elf32/);
+#	mov	%r10d,0(%rdx)
+#	mov	%r11d,4(%rdx)
+#___
+$code.=<<___;
+	mov	\$1,%eax
+.Lno_key:
+	ret
+.size	poly1305_init_x86_64,.-poly1305_init_x86_64
+
+.type	poly1305_blocks_x86_64,\@function,4
+.align	32
+poly1305_blocks_x86_64:
+.cfi_startproc
+.Lblocks:
+	shr	\$4,$len
+	jz	.Lno_data		# too short
+
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+.Lblocks_body:
+
+	mov	$len,%r15		# reassign $len
+
+	mov	24($ctx),$r0		# load r
+	mov	32($ctx),$s1
+
+	mov	0($ctx),$h0		# load hash value
+	mov	8($ctx),$h1
+	mov	16($ctx),$h2
+
+	mov	$s1,$r1
+	shr	\$2,$s1
+	mov	$r1,%rax
+	add	$r1,$s1			# s1 = r1 + (r1 >> 2)
+	jmp	.Loop
+
+.align	32
+.Loop:
+	add	0($inp),$h0		# accumulate input
+	adc	8($inp),$h1
+	lea	16($inp),$inp
+	adc	$padbit,$h2
+___
+	&poly1305_iteration();
+$code.=<<___;
+	mov	$r1,%rax
+	dec	%r15			# len-=16
+	jnz	.Loop
+
+	mov	$h0,0($ctx)		# store hash value
+	mov	$h1,8($ctx)
+	mov	$h2,16($ctx)
+
+	mov	0(%rsp),%r15
+.cfi_restore	%r15
+	mov	8(%rsp),%r14
+.cfi_restore	%r14
+	mov	16(%rsp),%r13
+.cfi_restore	%r13
+	mov	24(%rsp),%r12
+.cfi_restore	%r12
+	mov	32(%rsp),%rbp
+.cfi_restore	%rbp
+	mov	40(%rsp),%rbx
+.cfi_restore	%rbx
+	lea	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lno_data:
+.Lblocks_epilogue:
+	ret
+.cfi_endproc
+.size	poly1305_blocks_x86_64,.-poly1305_blocks_x86_64
+
+.type	poly1305_emit_x86_64,\@function,3
+.align	32
+poly1305_emit_x86_64:
+.Lemit:
+	mov	0($ctx),%r8	# load hash value
+	mov	8($ctx),%r9
+	mov	16($ctx),%r10
+
+	mov	%r8,%rax
+	add	\$5,%r8		# compare to modulus
+	mov	%r9,%rcx
+	adc	\$0,%r9
+	adc	\$0,%r10
+	shr	\$2,%r10	# did 130-bit value overflow?
+	cmovnz	%r8,%rax
+	cmovnz	%r9,%rcx
+
+	add	0($nonce),%rax	# accumulate nonce
+	adc	8($nonce),%rcx
+	mov	%rax,0($mac)	# write result
+	mov	%rcx,8($mac)
+
+	ret
+.size	poly1305_emit_x86_64,.-poly1305_emit_x86_64
+___
+if ($avx) {
+
+########################################################################
+# Layout of opaque area is following.
+#
+#	unsigned __int32 h[5];		# current hash value base 2^26
+#	unsigned __int32 is_base2_26;
+#	unsigned __int64 r[2];		# key value base 2^64
+#	unsigned __int64 pad;
+#	struct { unsigned __int32 r^2, r^1, r^4, r^3; } r[9];
+#
+# where r^n are base 2^26 digits of degrees of multiplier key. There are
+# 5 digits, but last four are interleaved with multiples of 5, totalling
+# in 9 elements: r0, r1, 5*r1, r2, 5*r2, r3, 5*r3, r4, 5*r4.
+
+my ($H0,$H1,$H2,$H3,$H4, $T0,$T1,$T2,$T3,$T4, $D0,$D1,$D2,$D3,$D4, $MASK) =
+    map("%xmm$_",(0..15));
+
+$code.=<<___;
+.type	__poly1305_block,\@abi-omnipotent
+.align	32
+__poly1305_block:
+___
+	&poly1305_iteration();
+$code.=<<___;
+	ret
+.size	__poly1305_block,.-__poly1305_block
+
+.type	__poly1305_init_avx,\@abi-omnipotent
+.align	32
+__poly1305_init_avx:
+	mov	$r0,$h0
+	mov	$r1,$h1
+	xor	$h2,$h2
+
+	lea	48+64($ctx),$ctx	# size optimization
+
+	mov	$r1,%rax
+	call	__poly1305_block	# r^2
+
+	mov	\$0x3ffffff,%eax	# save interleaved r^2 and r base 2^26
+	mov	\$0x3ffffff,%edx
+	mov	$h0,$d1
+	and	$h0#d,%eax
+	mov	$r0,$d2
+	and	$r0#d,%edx
+	mov	%eax,`16*0+0-64`($ctx)
+	shr	\$26,$d1
+	mov	%edx,`16*0+4-64`($ctx)
+	shr	\$26,$d2
+
+	mov	\$0x3ffffff,%eax
+	mov	\$0x3ffffff,%edx
+	and	$d1#d,%eax
+	and	$d2#d,%edx
+	mov	%eax,`16*1+0-64`($ctx)
+	lea	(%rax,%rax,4),%eax	# *5
+	mov	%edx,`16*1+4-64`($ctx)
+	lea	(%rdx,%rdx,4),%edx	# *5
+	mov	%eax,`16*2+0-64`($ctx)
+	shr	\$26,$d1
+	mov	%edx,`16*2+4-64`($ctx)
+	shr	\$26,$d2
+
+	mov	$h1,%rax
+	mov	$r1,%rdx
+	shl	\$12,%rax
+	shl	\$12,%rdx
+	or	$d1,%rax
+	or	$d2,%rdx
+	and	\$0x3ffffff,%eax
+	and	\$0x3ffffff,%edx
+	mov	%eax,`16*3+0-64`($ctx)
+	lea	(%rax,%rax,4),%eax	# *5
+	mov	%edx,`16*3+4-64`($ctx)
+	lea	(%rdx,%rdx,4),%edx	# *5
+	mov	%eax,`16*4+0-64`($ctx)
+	mov	$h1,$d1
+	mov	%edx,`16*4+4-64`($ctx)
+	mov	$r1,$d2
+
+	mov	\$0x3ffffff,%eax
+	mov	\$0x3ffffff,%edx
+	shr	\$14,$d1
+	shr	\$14,$d2
+	and	$d1#d,%eax
+	and	$d2#d,%edx
+	mov	%eax,`16*5+0-64`($ctx)
+	lea	(%rax,%rax,4),%eax	# *5
+	mov	%edx,`16*5+4-64`($ctx)
+	lea	(%rdx,%rdx,4),%edx	# *5
+	mov	%eax,`16*6+0-64`($ctx)
+	shr	\$26,$d1
+	mov	%edx,`16*6+4-64`($ctx)
+	shr	\$26,$d2
+
+	mov	$h2,%rax
+	shl	\$24,%rax
+	or	%rax,$d1
+	mov	$d1#d,`16*7+0-64`($ctx)
+	lea	($d1,$d1,4),$d1		# *5
+	mov	$d2#d,`16*7+4-64`($ctx)
+	lea	($d2,$d2,4),$d2		# *5
+	mov	$d1#d,`16*8+0-64`($ctx)
+	mov	$d2#d,`16*8+4-64`($ctx)
+
+	mov	$r1,%rax
+	call	__poly1305_block	# r^3
+
+	mov	\$0x3ffffff,%eax	# save r^3 base 2^26
+	mov	$h0,$d1
+	and	$h0#d,%eax
+	shr	\$26,$d1
+	mov	%eax,`16*0+12-64`($ctx)
+
+	mov	\$0x3ffffff,%edx
+	and	$d1#d,%edx
+	mov	%edx,`16*1+12-64`($ctx)
+	lea	(%rdx,%rdx,4),%edx	# *5
+	shr	\$26,$d1
+	mov	%edx,`16*2+12-64`($ctx)
+
+	mov	$h1,%rax
+	shl	\$12,%rax
+	or	$d1,%rax
+	and	\$0x3ffffff,%eax
+	mov	%eax,`16*3+12-64`($ctx)
+	lea	(%rax,%rax,4),%eax	# *5
+	mov	$h1,$d1
+	mov	%eax,`16*4+12-64`($ctx)
+
+	mov	\$0x3ffffff,%edx
+	shr	\$14,$d1
+	and	$d1#d,%edx
+	mov	%edx,`16*5+12-64`($ctx)
+	lea	(%rdx,%rdx,4),%edx	# *5
+	shr	\$26,$d1
+	mov	%edx,`16*6+12-64`($ctx)
+
+	mov	$h2,%rax
+	shl	\$24,%rax
+	or	%rax,$d1
+	mov	$d1#d,`16*7+12-64`($ctx)
+	lea	($d1,$d1,4),$d1		# *5
+	mov	$d1#d,`16*8+12-64`($ctx)
+
+	mov	$r1,%rax
+	call	__poly1305_block	# r^4
+
+	mov	\$0x3ffffff,%eax	# save r^4 base 2^26
+	mov	$h0,$d1
+	and	$h0#d,%eax
+	shr	\$26,$d1
+	mov	%eax,`16*0+8-64`($ctx)
+
+	mov	\$0x3ffffff,%edx
+	and	$d1#d,%edx
+	mov	%edx,`16*1+8-64`($ctx)
+	lea	(%rdx,%rdx,4),%edx	# *5
+	shr	\$26,$d1
+	mov	%edx,`16*2+8-64`($ctx)
+
+	mov	$h1,%rax
+	shl	\$12,%rax
+	or	$d1,%rax
+	and	\$0x3ffffff,%eax
+	mov	%eax,`16*3+8-64`($ctx)
+	lea	(%rax,%rax,4),%eax	# *5
+	mov	$h1,$d1
+	mov	%eax,`16*4+8-64`($ctx)
+
+	mov	\$0x3ffffff,%edx
+	shr	\$14,$d1
+	and	$d1#d,%edx
+	mov	%edx,`16*5+8-64`($ctx)
+	lea	(%rdx,%rdx,4),%edx	# *5
+	shr	\$26,$d1
+	mov	%edx,`16*6+8-64`($ctx)
+
+	mov	$h2,%rax
+	shl	\$24,%rax
+	or	%rax,$d1
+	mov	$d1#d,`16*7+8-64`($ctx)
+	lea	($d1,$d1,4),$d1		# *5
+	mov	$d1#d,`16*8+8-64`($ctx)
+
+	lea	-48-64($ctx),$ctx	# size [de-]optimization
+	ret
+.size	__poly1305_init_avx,.-__poly1305_init_avx
+
+.type	poly1305_blocks_avx,\@function,4
+.align	32
+poly1305_blocks_avx:
+.cfi_startproc
+	mov	20($ctx),%r8d		# is_base2_26
+	cmp	\$128,$len
+	jae	.Lblocks_avx
+	test	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx:
+	and	\$-16,$len
+	jz	.Lno_data_avx
+
+	vzeroupper
+
+	test	%r8d,%r8d
+	jz	.Lbase2_64_avx
+
+	test	\$31,$len
+	jz	.Leven_avx
+
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+.Lblocks_avx_body:
+
+	mov	$len,%r15		# reassign $len
+
+	mov	0($ctx),$d1		# load hash value
+	mov	8($ctx),$d2
+	mov	16($ctx),$h2#d
+
+	mov	24($ctx),$r0		# load r
+	mov	32($ctx),$s1
+
+	################################# base 2^26 -> base 2^64
+	mov	$d1#d,$h0#d
+	and	\$`-1*(1<<31)`,$d1
+	mov	$d2,$r1			# borrow $r1
+	mov	$d2#d,$h1#d
+	and	\$`-1*(1<<31)`,$d2
+
+	shr	\$6,$d1
+	shl	\$52,$r1
+	add	$d1,$h0
+	shr	\$12,$h1
+	shr	\$18,$d2
+	add	$r1,$h0
+	adc	$d2,$h1
+
+	mov	$h2,$d1
+	shl	\$40,$d1
+	shr	\$24,$h2
+	add	$d1,$h1
+	adc	\$0,$h2			# can be partially reduced...
+
+	mov	\$-4,$d2		# ... so reduce
+	mov	$h2,$d1
+	and	$h2,$d2
+	shr	\$2,$d1
+	and	\$3,$h2
+	add	$d2,$d1			# =*5
+	add	$d1,$h0
+	adc	\$0,$h1
+	adc	\$0,$h2
+
+	mov	$s1,$r1
+	mov	$s1,%rax
+	shr	\$2,$s1
+	add	$r1,$s1			# s1 = r1 + (r1 >> 2)
+
+	add	0($inp),$h0		# accumulate input
+	adc	8($inp),$h1
+	lea	16($inp),$inp
+	adc	$padbit,$h2
+
+	call	__poly1305_block
+
+	test	$padbit,$padbit		# if $padbit is zero,
+	jz	.Lstore_base2_64_avx	# store hash in base 2^64 format
+
+	################################# base 2^64 -> base 2^26
+	mov	$h0,%rax
+	mov	$h0,%rdx
+	shr	\$52,$h0
+	mov	$h1,$r0
+	mov	$h1,$r1
+	shr	\$26,%rdx
+	and	\$0x3ffffff,%rax	# h[0]
+	shl	\$12,$r0
+	and	\$0x3ffffff,%rdx	# h[1]
+	shr	\$14,$h1
+	or	$r0,$h0
+	shl	\$24,$h2
+	and	\$0x3ffffff,$h0		# h[2]
+	shr	\$40,$r1
+	and	\$0x3ffffff,$h1		# h[3]
+	or	$r1,$h2			# h[4]
+
+	sub	\$16,%r15
+	jz	.Lstore_base2_26_avx
+
+	vmovd	%rax#d,$H0
+	vmovd	%rdx#d,$H1
+	vmovd	$h0#d,$H2
+	vmovd	$h1#d,$H3
+	vmovd	$h2#d,$H4
+	jmp	.Lproceed_avx
+
+.align	32
+.Lstore_base2_64_avx:
+	mov	$h0,0($ctx)
+	mov	$h1,8($ctx)
+	mov	$h2,16($ctx)		# note that is_base2_26 is zeroed
+	jmp	.Ldone_avx
+
+.align	16
+.Lstore_base2_26_avx:
+	mov	%rax#d,0($ctx)		# store hash value base 2^26
+	mov	%rdx#d,4($ctx)
+	mov	$h0#d,8($ctx)
+	mov	$h1#d,12($ctx)
+	mov	$h2#d,16($ctx)
+.align	16
+.Ldone_avx:
+	mov	0(%rsp),%r15
+.cfi_restore	%r15
+	mov	8(%rsp),%r14
+.cfi_restore	%r14
+	mov	16(%rsp),%r13
+.cfi_restore	%r13
+	mov	24(%rsp),%r12
+.cfi_restore	%r12
+	mov	32(%rsp),%rbp
+.cfi_restore	%rbp
+	mov	40(%rsp),%rbx
+.cfi_restore	%rbx
+	lea	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lno_data_avx:
+.Lblocks_avx_epilogue:
+	ret
+.cfi_endproc
+
+.align	32
+.Lbase2_64_avx:
+.cfi_startproc
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+.Lbase2_64_avx_body:
+
+	mov	$len,%r15		# reassign $len
+
+	mov	24($ctx),$r0		# load r
+	mov	32($ctx),$s1
+
+	mov	0($ctx),$h0		# load hash value
+	mov	8($ctx),$h1
+	mov	16($ctx),$h2#d
+
+	mov	$s1,$r1
+	mov	$s1,%rax
+	shr	\$2,$s1
+	add	$r1,$s1			# s1 = r1 + (r1 >> 2)
+
+	test	\$31,$len
+	jz	.Linit_avx
+
+	add	0($inp),$h0		# accumulate input
+	adc	8($inp),$h1
+	lea	16($inp),$inp
+	adc	$padbit,$h2
+	sub	\$16,%r15
+
+	call	__poly1305_block
+
+.Linit_avx:
+	################################# base 2^64 -> base 2^26
+	mov	$h0,%rax
+	mov	$h0,%rdx
+	shr	\$52,$h0
+	mov	$h1,$d1
+	mov	$h1,$d2
+	shr	\$26,%rdx
+	and	\$0x3ffffff,%rax	# h[0]
+	shl	\$12,$d1
+	and	\$0x3ffffff,%rdx	# h[1]
+	shr	\$14,$h1
+	or	$d1,$h0
+	shl	\$24,$h2
+	and	\$0x3ffffff,$h0		# h[2]
+	shr	\$40,$d2
+	and	\$0x3ffffff,$h1		# h[3]
+	or	$d2,$h2			# h[4]
+
+	vmovd	%rax#d,$H0
+	vmovd	%rdx#d,$H1
+	vmovd	$h0#d,$H2
+	vmovd	$h1#d,$H3
+	vmovd	$h2#d,$H4
+	movl	\$1,20($ctx)		# set is_base2_26
+
+	call	__poly1305_init_avx
+
+.Lproceed_avx:
+	mov	%r15,$len
+
+	mov	0(%rsp),%r15
+.cfi_restore	%r15
+	mov	8(%rsp),%r14
+.cfi_restore	%r14
+	mov	16(%rsp),%r13
+.cfi_restore	%r13
+	mov	24(%rsp),%r12
+.cfi_restore	%r12
+	mov	32(%rsp),%rbp
+.cfi_restore	%rbp
+	mov	40(%rsp),%rbx
+.cfi_restore	%rbx
+	lea	48(%rsp),%rax
+	lea	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lbase2_64_avx_epilogue:
+	jmp	.Ldo_avx
+.cfi_endproc
+
+.align	32
+.Leven_avx:
+.cfi_startproc
+	vmovd		4*0($ctx),$H0		# load hash value
+	vmovd		4*1($ctx),$H1
+	vmovd		4*2($ctx),$H2
+	vmovd		4*3($ctx),$H3
+	vmovd		4*4($ctx),$H4
+
+.Ldo_avx:
+___
+$code.=<<___	if (!$win64);
+	lea		-0x58(%rsp),%r11
+.cfi_def_cfa		%r11,0x60
+	sub		\$0x178,%rsp
+___
+$code.=<<___	if ($win64);
+	lea		-0xf8(%rsp),%r11
+	sub		\$0x218,%rsp
+	vmovdqa		%xmm6,0x50(%r11)
+	vmovdqa		%xmm7,0x60(%r11)
+	vmovdqa		%xmm8,0x70(%r11)
+	vmovdqa		%xmm9,0x80(%r11)
+	vmovdqa		%xmm10,0x90(%r11)
+	vmovdqa		%xmm11,0xa0(%r11)
+	vmovdqa		%xmm12,0xb0(%r11)
+	vmovdqa		%xmm13,0xc0(%r11)
+	vmovdqa		%xmm14,0xd0(%r11)
+	vmovdqa		%xmm15,0xe0(%r11)
+.Ldo_avx_body:
+___
+$code.=<<___;
+	sub		\$64,$len
+	lea		-32($inp),%rax
+	cmovc		%rax,$inp
+
+	vmovdqu		`16*3`($ctx),$D4	# preload r0^2
+	lea		`16*3+64`($ctx),$ctx	# size optimization
+	lea		.Lconst(%rip),%rcx
+
+	################################################################
+	# load input
+	vmovdqu		16*2($inp),$T0
+	vmovdqu		16*3($inp),$T1
+	vmovdqa		64(%rcx),$MASK		# .Lmask26
+
+	vpsrldq		\$6,$T0,$T2		# splat input
+	vpsrldq		\$6,$T1,$T3
+	vpunpckhqdq	$T1,$T0,$T4		# 4
+	vpunpcklqdq	$T1,$T0,$T0		# 0:1
+	vpunpcklqdq	$T3,$T2,$T3		# 2:3
+
+	vpsrlq		\$40,$T4,$T4		# 4
+	vpsrlq		\$26,$T0,$T1
+	vpand		$MASK,$T0,$T0		# 0
+	vpsrlq		\$4,$T3,$T2
+	vpand		$MASK,$T1,$T1		# 1
+	vpsrlq		\$30,$T3,$T3
+	vpand		$MASK,$T2,$T2		# 2
+	vpand		$MASK,$T3,$T3		# 3
+	vpor		32(%rcx),$T4,$T4	# padbit, yes, always
+
+	jbe		.Lskip_loop_avx
+
+	# expand and copy pre-calculated table to stack
+	vmovdqu		`16*1-64`($ctx),$D1
+	vmovdqu		`16*2-64`($ctx),$D2
+	vpshufd		\$0xEE,$D4,$D3		# 34xx -> 3434
+	vpshufd		\$0x44,$D4,$D0		# xx12 -> 1212
+	vmovdqa		$D3,-0x90(%r11)
+	vmovdqa		$D0,0x00(%rsp)
+	vpshufd		\$0xEE,$D1,$D4
+	vmovdqu		`16*3-64`($ctx),$D0
+	vpshufd		\$0x44,$D1,$D1
+	vmovdqa		$D4,-0x80(%r11)
+	vmovdqa		$D1,0x10(%rsp)
+	vpshufd		\$0xEE,$D2,$D3
+	vmovdqu		`16*4-64`($ctx),$D1
+	vpshufd		\$0x44,$D2,$D2
+	vmovdqa		$D3,-0x70(%r11)
+	vmovdqa		$D2,0x20(%rsp)
+	vpshufd		\$0xEE,$D0,$D4
+	vmovdqu		`16*5-64`($ctx),$D2
+	vpshufd		\$0x44,$D0,$D0
+	vmovdqa		$D4,-0x60(%r11)
+	vmovdqa		$D0,0x30(%rsp)
+	vpshufd		\$0xEE,$D1,$D3
+	vmovdqu		`16*6-64`($ctx),$D0
+	vpshufd		\$0x44,$D1,$D1
+	vmovdqa		$D3,-0x50(%r11)
+	vmovdqa		$D1,0x40(%rsp)
+	vpshufd		\$0xEE,$D2,$D4
+	vmovdqu		`16*7-64`($ctx),$D1
+	vpshufd		\$0x44,$D2,$D2
+	vmovdqa		$D4,-0x40(%r11)
+	vmovdqa		$D2,0x50(%rsp)
+	vpshufd		\$0xEE,$D0,$D3
+	vmovdqu		`16*8-64`($ctx),$D2
+	vpshufd		\$0x44,$D0,$D0
+	vmovdqa		$D3,-0x30(%r11)
+	vmovdqa		$D0,0x60(%rsp)
+	vpshufd		\$0xEE,$D1,$D4
+	vpshufd		\$0x44,$D1,$D1
+	vmovdqa		$D4,-0x20(%r11)
+	vmovdqa		$D1,0x70(%rsp)
+	vpshufd		\$0xEE,$D2,$D3
+	 vmovdqa	0x00(%rsp),$D4		# preload r0^2
+	vpshufd		\$0x44,$D2,$D2
+	vmovdqa		$D3,-0x10(%r11)
+	vmovdqa		$D2,0x80(%rsp)
+
+	jmp		.Loop_avx
+
+.align	32
+.Loop_avx:
+	################################################################
+	# ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	# ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	#   \___________________/
+	# ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	# ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	#   \___________________/ \____________________/
+	#
+	# Note that we start with inp[2:3]*r^2. This is because it
+	# doesn't depend on reduction in previous iteration.
+	################################################################
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	#
+	# though note that $Tx and $Hx are "reversed" in this section,
+	# and $D4 is preloaded with r0^2...
+
+	vpmuludq	$T0,$D4,$D0		# d0 = h0*r0
+	vpmuludq	$T1,$D4,$D1		# d1 = h1*r0
+	  vmovdqa	$H2,0x20(%r11)				# offload hash
+	vpmuludq	$T2,$D4,$D2		# d3 = h2*r0
+	 vmovdqa	0x10(%rsp),$H2		# r1^2
+	vpmuludq	$T3,$D4,$D3		# d3 = h3*r0
+	vpmuludq	$T4,$D4,$D4		# d4 = h4*r0
+
+	  vmovdqa	$H0,0x00(%r11)				#
+	vpmuludq	0x20(%rsp),$T4,$H0	# h4*s1
+	  vmovdqa	$H1,0x10(%r11)				#
+	vpmuludq	$T3,$H2,$H1		# h3*r1
+	vpaddq		$H0,$D0,$D0		# d0 += h4*s1
+	vpaddq		$H1,$D4,$D4		# d4 += h3*r1
+	  vmovdqa	$H3,0x30(%r11)				#
+	vpmuludq	$T2,$H2,$H0		# h2*r1
+	vpmuludq	$T1,$H2,$H1		# h1*r1
+	vpaddq		$H0,$D3,$D3		# d3 += h2*r1
+	 vmovdqa	0x30(%rsp),$H3		# r2^2
+	vpaddq		$H1,$D2,$D2		# d2 += h1*r1
+	  vmovdqa	$H4,0x40(%r11)				#
+	vpmuludq	$T0,$H2,$H2		# h0*r1
+	 vpmuludq	$T2,$H3,$H0		# h2*r2
+	vpaddq		$H2,$D1,$D1		# d1 += h0*r1
+
+	 vmovdqa	0x40(%rsp),$H4		# s2^2
+	vpaddq		$H0,$D4,$D4		# d4 += h2*r2
+	vpmuludq	$T1,$H3,$H1		# h1*r2
+	vpmuludq	$T0,$H3,$H3		# h0*r2
+	vpaddq		$H1,$D3,$D3		# d3 += h1*r2
+	 vmovdqa	0x50(%rsp),$H2		# r3^2
+	vpaddq		$H3,$D2,$D2		# d2 += h0*r2
+	vpmuludq	$T4,$H4,$H0		# h4*s2
+	vpmuludq	$T3,$H4,$H4		# h3*s2
+	vpaddq		$H0,$D1,$D1		# d1 += h4*s2
+	 vmovdqa	0x60(%rsp),$H3		# s3^2
+	vpaddq		$H4,$D0,$D0		# d0 += h3*s2
+
+	 vmovdqa	0x80(%rsp),$H4		# s4^2
+	vpmuludq	$T1,$H2,$H1		# h1*r3
+	vpmuludq	$T0,$H2,$H2		# h0*r3
+	vpaddq		$H1,$D4,$D4		# d4 += h1*r3
+	vpaddq		$H2,$D3,$D3		# d3 += h0*r3
+	vpmuludq	$T4,$H3,$H0		# h4*s3
+	vpmuludq	$T3,$H3,$H1		# h3*s3
+	vpaddq		$H0,$D2,$D2		# d2 += h4*s3
+	 vmovdqu	16*0($inp),$H0				# load input
+	vpaddq		$H1,$D1,$D1		# d1 += h3*s3
+	vpmuludq	$T2,$H3,$H3		# h2*s3
+	 vpmuludq	$T2,$H4,$T2		# h2*s4
+	vpaddq		$H3,$D0,$D0		# d0 += h2*s3
+
+	 vmovdqu	16*1($inp),$H1				#
+	vpaddq		$T2,$D1,$D1		# d1 += h2*s4
+	vpmuludq	$T3,$H4,$T3		# h3*s4
+	vpmuludq	$T4,$H4,$T4		# h4*s4
+	 vpsrldq	\$6,$H0,$H2				# splat input
+	vpaddq		$T3,$D2,$D2		# d2 += h3*s4
+	vpaddq		$T4,$D3,$D3		# d3 += h4*s4
+	 vpsrldq	\$6,$H1,$H3				#
+	vpmuludq	0x70(%rsp),$T0,$T4	# h0*r4
+	vpmuludq	$T1,$H4,$T0		# h1*s4
+	 vpunpckhqdq	$H1,$H0,$H4		# 4
+	vpaddq		$T4,$D4,$D4		# d4 += h0*r4
+	 vmovdqa	-0x90(%r11),$T4		# r0^4
+	vpaddq		$T0,$D0,$D0		# d0 += h1*s4
+
+	vpunpcklqdq	$H1,$H0,$H0		# 0:1
+	vpunpcklqdq	$H3,$H2,$H3		# 2:3
+
+	#vpsrlq		\$40,$H4,$H4		# 4
+	vpsrldq		\$`40/8`,$H4,$H4	# 4
+	vpsrlq		\$26,$H0,$H1
+	vpand		$MASK,$H0,$H0		# 0
+	vpsrlq		\$4,$H3,$H2
+	vpand		$MASK,$H1,$H1		# 1
+	vpand		0(%rcx),$H4,$H4		# .Lmask24
+	vpsrlq		\$30,$H3,$H3
+	vpand		$MASK,$H2,$H2		# 2
+	vpand		$MASK,$H3,$H3		# 3
+	vpor		32(%rcx),$H4,$H4	# padbit, yes, always
+
+	vpaddq		0x00(%r11),$H0,$H0	# add hash value
+	vpaddq		0x10(%r11),$H1,$H1
+	vpaddq		0x20(%r11),$H2,$H2
+	vpaddq		0x30(%r11),$H3,$H3
+	vpaddq		0x40(%r11),$H4,$H4
+
+	lea		16*2($inp),%rax
+	lea		16*4($inp),$inp
+	sub		\$64,$len
+	cmovc		%rax,$inp
+
+	################################################################
+	# Now we accumulate (inp[0:1]+hash)*r^4
+	################################################################
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	vpmuludq	$H0,$T4,$T0		# h0*r0
+	vpmuludq	$H1,$T4,$T1		# h1*r0
+	vpaddq		$T0,$D0,$D0
+	vpaddq		$T1,$D1,$D1
+	 vmovdqa	-0x80(%r11),$T2		# r1^4
+	vpmuludq	$H2,$T4,$T0		# h2*r0
+	vpmuludq	$H3,$T4,$T1		# h3*r0
+	vpaddq		$T0,$D2,$D2
+	vpaddq		$T1,$D3,$D3
+	vpmuludq	$H4,$T4,$T4		# h4*r0
+	 vpmuludq	-0x70(%r11),$H4,$T0	# h4*s1
+	vpaddq		$T4,$D4,$D4
+
+	vpaddq		$T0,$D0,$D0		# d0 += h4*s1
+	vpmuludq	$H2,$T2,$T1		# h2*r1
+	vpmuludq	$H3,$T2,$T0		# h3*r1
+	vpaddq		$T1,$D3,$D3		# d3 += h2*r1
+	 vmovdqa	-0x60(%r11),$T3		# r2^4
+	vpaddq		$T0,$D4,$D4		# d4 += h3*r1
+	vpmuludq	$H1,$T2,$T1		# h1*r1
+	vpmuludq	$H0,$T2,$T2		# h0*r1
+	vpaddq		$T1,$D2,$D2		# d2 += h1*r1
+	vpaddq		$T2,$D1,$D1		# d1 += h0*r1
+
+	 vmovdqa	-0x50(%r11),$T4		# s2^4
+	vpmuludq	$H2,$T3,$T0		# h2*r2
+	vpmuludq	$H1,$T3,$T1		# h1*r2
+	vpaddq		$T0,$D4,$D4		# d4 += h2*r2
+	vpaddq		$T1,$D3,$D3		# d3 += h1*r2
+	 vmovdqa	-0x40(%r11),$T2		# r3^4
+	vpmuludq	$H0,$T3,$T3		# h0*r2
+	vpmuludq	$H4,$T4,$T0		# h4*s2
+	vpaddq		$T3,$D2,$D2		# d2 += h0*r2
+	vpaddq		$T0,$D1,$D1		# d1 += h4*s2
+	 vmovdqa	-0x30(%r11),$T3		# s3^4
+	vpmuludq	$H3,$T4,$T4		# h3*s2
+	 vpmuludq	$H1,$T2,$T1		# h1*r3
+	vpaddq		$T4,$D0,$D0		# d0 += h3*s2
+
+	 vmovdqa	-0x10(%r11),$T4		# s4^4
+	vpaddq		$T1,$D4,$D4		# d4 += h1*r3
+	vpmuludq	$H0,$T2,$T2		# h0*r3
+	vpmuludq	$H4,$T3,$T0		# h4*s3
+	vpaddq		$T2,$D3,$D3		# d3 += h0*r3
+	vpaddq		$T0,$D2,$D2		# d2 += h4*s3
+	 vmovdqu	16*2($inp),$T0				# load input
+	vpmuludq	$H3,$T3,$T2		# h3*s3
+	vpmuludq	$H2,$T3,$T3		# h2*s3
+	vpaddq		$T2,$D1,$D1		# d1 += h3*s3
+	 vmovdqu	16*3($inp),$T1				#
+	vpaddq		$T3,$D0,$D0		# d0 += h2*s3
+
+	vpmuludq	$H2,$T4,$H2		# h2*s4
+	vpmuludq	$H3,$T4,$H3		# h3*s4
+	 vpsrldq	\$6,$T0,$T2				# splat input
+	vpaddq		$H2,$D1,$D1		# d1 += h2*s4
+	vpmuludq	$H4,$T4,$H4		# h4*s4
+	 vpsrldq	\$6,$T1,$T3				#
+	vpaddq		$H3,$D2,$H2		# h2 = d2 + h3*s4
+	vpaddq		$H4,$D3,$H3		# h3 = d3 + h4*s4
+	vpmuludq	-0x20(%r11),$H0,$H4	# h0*r4
+	vpmuludq	$H1,$T4,$H0
+	 vpunpckhqdq	$T1,$T0,$T4		# 4
+	vpaddq		$H4,$D4,$H4		# h4 = d4 + h0*r4
+	vpaddq		$H0,$D0,$H0		# h0 = d0 + h1*s4
+
+	vpunpcklqdq	$T1,$T0,$T0		# 0:1
+	vpunpcklqdq	$T3,$T2,$T3		# 2:3
+
+	#vpsrlq		\$40,$T4,$T4		# 4
+	vpsrldq		\$`40/8`,$T4,$T4	# 4
+	vpsrlq		\$26,$T0,$T1
+	 vmovdqa	0x00(%rsp),$D4		# preload r0^2
+	vpand		$MASK,$T0,$T0		# 0
+	vpsrlq		\$4,$T3,$T2
+	vpand		$MASK,$T1,$T1		# 1
+	vpand		0(%rcx),$T4,$T4		# .Lmask24
+	vpsrlq		\$30,$T3,$T3
+	vpand		$MASK,$T2,$T2		# 2
+	vpand		$MASK,$T3,$T3		# 3
+	vpor		32(%rcx),$T4,$T4	# padbit, yes, always
+
+	################################################################
+	# lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	# and P. Schwabe
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$D1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H4,$D0
+	vpand		$MASK,$H4,$H4
+
+	vpsrlq		\$26,$H1,$D1
+	vpand		$MASK,$H1,$H1
+	vpaddq		$D1,$H2,$H2		# h1 -> h2
+
+	vpaddq		$D0,$H0,$H0
+	vpsllq		\$2,$D0,$D0
+	vpaddq		$D0,$H0,$H0		# h4 -> h0
+
+	vpsrlq		\$26,$H2,$D2
+	vpand		$MASK,$H2,$H2
+	vpaddq		$D2,$H3,$H3		# h2 -> h3
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	ja		.Loop_avx
+
+.Lskip_loop_avx:
+	################################################################
+	# multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	vpshufd		\$0x10,$D4,$D4		# r0^n, xx12 -> x1x2
+	add		\$32,$len
+	jnz		.Long_tail_avx
+
+	vpaddq		$H2,$T2,$T2
+	vpaddq		$H0,$T0,$T0
+	vpaddq		$H1,$T1,$T1
+	vpaddq		$H3,$T3,$T3
+	vpaddq		$H4,$T4,$T4
+
+.Long_tail_avx:
+	vmovdqa		$H2,0x20(%r11)
+	vmovdqa		$H0,0x00(%r11)
+	vmovdqa		$H1,0x10(%r11)
+	vmovdqa		$H3,0x30(%r11)
+	vmovdqa		$H4,0x40(%r11)
+
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	vpmuludq	$T2,$D4,$D2		# d2 = h2*r0
+	vpmuludq	$T0,$D4,$D0		# d0 = h0*r0
+	 vpshufd	\$0x10,`16*1-64`($ctx),$H2		# r1^n
+	vpmuludq	$T1,$D4,$D1		# d1 = h1*r0
+	vpmuludq	$T3,$D4,$D3		# d3 = h3*r0
+	vpmuludq	$T4,$D4,$D4		# d4 = h4*r0
+
+	vpmuludq	$T3,$H2,$H0		# h3*r1
+	vpaddq		$H0,$D4,$D4		# d4 += h3*r1
+	 vpshufd	\$0x10,`16*2-64`($ctx),$H3		# s1^n
+	vpmuludq	$T2,$H2,$H1		# h2*r1
+	vpaddq		$H1,$D3,$D3		# d3 += h2*r1
+	 vpshufd	\$0x10,`16*3-64`($ctx),$H4		# r2^n
+	vpmuludq	$T1,$H2,$H0		# h1*r1
+	vpaddq		$H0,$D2,$D2		# d2 += h1*r1
+	vpmuludq	$T0,$H2,$H2		# h0*r1
+	vpaddq		$H2,$D1,$D1		# d1 += h0*r1
+	vpmuludq	$T4,$H3,$H3		# h4*s1
+	vpaddq		$H3,$D0,$D0		# d0 += h4*s1
+
+	 vpshufd	\$0x10,`16*4-64`($ctx),$H2		# s2^n
+	vpmuludq	$T2,$H4,$H1		# h2*r2
+	vpaddq		$H1,$D4,$D4		# d4 += h2*r2
+	vpmuludq	$T1,$H4,$H0		# h1*r2
+	vpaddq		$H0,$D3,$D3		# d3 += h1*r2
+	 vpshufd	\$0x10,`16*5-64`($ctx),$H3		# r3^n
+	vpmuludq	$T0,$H4,$H4		# h0*r2
+	vpaddq		$H4,$D2,$D2		# d2 += h0*r2
+	vpmuludq	$T4,$H2,$H1		# h4*s2
+	vpaddq		$H1,$D1,$D1		# d1 += h4*s2
+	 vpshufd	\$0x10,`16*6-64`($ctx),$H4		# s3^n
+	vpmuludq	$T3,$H2,$H2		# h3*s2
+	vpaddq		$H2,$D0,$D0		# d0 += h3*s2
+
+	vpmuludq	$T1,$H3,$H0		# h1*r3
+	vpaddq		$H0,$D4,$D4		# d4 += h1*r3
+	vpmuludq	$T0,$H3,$H3		# h0*r3
+	vpaddq		$H3,$D3,$D3		# d3 += h0*r3
+	 vpshufd	\$0x10,`16*7-64`($ctx),$H2		# r4^n
+	vpmuludq	$T4,$H4,$H1		# h4*s3
+	vpaddq		$H1,$D2,$D2		# d2 += h4*s3
+	 vpshufd	\$0x10,`16*8-64`($ctx),$H3		# s4^n
+	vpmuludq	$T3,$H4,$H0		# h3*s3
+	vpaddq		$H0,$D1,$D1		# d1 += h3*s3
+	vpmuludq	$T2,$H4,$H4		# h2*s3
+	vpaddq		$H4,$D0,$D0		# d0 += h2*s3
+
+	vpmuludq	$T0,$H2,$H2		# h0*r4
+	vpaddq		$H2,$D4,$D4		# h4 = d4 + h0*r4
+	vpmuludq	$T4,$H3,$H1		# h4*s4
+	vpaddq		$H1,$D3,$D3		# h3 = d3 + h4*s4
+	vpmuludq	$T3,$H3,$H0		# h3*s4
+	vpaddq		$H0,$D2,$D2		# h2 = d2 + h3*s4
+	vpmuludq	$T2,$H3,$H1		# h2*s4
+	vpaddq		$H1,$D1,$D1		# h1 = d1 + h2*s4
+	vpmuludq	$T1,$H3,$H3		# h1*s4
+	vpaddq		$H3,$D0,$D0		# h0 = d0 + h1*s4
+
+	jz		.Lshort_tail_avx
+
+	vmovdqu		16*0($inp),$H0		# load input
+	vmovdqu		16*1($inp),$H1
+
+	vpsrldq		\$6,$H0,$H2		# splat input
+	vpsrldq		\$6,$H1,$H3
+	vpunpckhqdq	$H1,$H0,$H4		# 4
+	vpunpcklqdq	$H1,$H0,$H0		# 0:1
+	vpunpcklqdq	$H3,$H2,$H3		# 2:3
+
+	vpsrlq		\$40,$H4,$H4		# 4
+	vpsrlq		\$26,$H0,$H1
+	vpand		$MASK,$H0,$H0		# 0
+	vpsrlq		\$4,$H3,$H2
+	vpand		$MASK,$H1,$H1		# 1
+	vpsrlq		\$30,$H3,$H3
+	vpand		$MASK,$H2,$H2		# 2
+	vpand		$MASK,$H3,$H3		# 3
+	vpor		32(%rcx),$H4,$H4	# padbit, yes, always
+
+	vpshufd		\$0x32,`16*0-64`($ctx),$T4	# r0^n, 34xx -> x3x4
+	vpaddq		0x00(%r11),$H0,$H0
+	vpaddq		0x10(%r11),$H1,$H1
+	vpaddq		0x20(%r11),$H2,$H2
+	vpaddq		0x30(%r11),$H3,$H3
+	vpaddq		0x40(%r11),$H4,$H4
+
+	################################################################
+	# multiply (inp[0:1]+hash) by r^4:r^3 and accumulate
+
+	vpmuludq	$H0,$T4,$T0		# h0*r0
+	vpaddq		$T0,$D0,$D0		# d0 += h0*r0
+	vpmuludq	$H1,$T4,$T1		# h1*r0
+	vpaddq		$T1,$D1,$D1		# d1 += h1*r0
+	vpmuludq	$H2,$T4,$T0		# h2*r0
+	vpaddq		$T0,$D2,$D2		# d2 += h2*r0
+	 vpshufd	\$0x32,`16*1-64`($ctx),$T2		# r1^n
+	vpmuludq	$H3,$T4,$T1		# h3*r0
+	vpaddq		$T1,$D3,$D3		# d3 += h3*r0
+	vpmuludq	$H4,$T4,$T4		# h4*r0
+	vpaddq		$T4,$D4,$D4		# d4 += h4*r0
+
+	vpmuludq	$H3,$T2,$T0		# h3*r1
+	vpaddq		$T0,$D4,$D4		# d4 += h3*r1
+	 vpshufd	\$0x32,`16*2-64`($ctx),$T3		# s1
+	vpmuludq	$H2,$T2,$T1		# h2*r1
+	vpaddq		$T1,$D3,$D3		# d3 += h2*r1
+	 vpshufd	\$0x32,`16*3-64`($ctx),$T4		# r2
+	vpmuludq	$H1,$T2,$T0		# h1*r1
+	vpaddq		$T0,$D2,$D2		# d2 += h1*r1
+	vpmuludq	$H0,$T2,$T2		# h0*r1
+	vpaddq		$T2,$D1,$D1		# d1 += h0*r1
+	vpmuludq	$H4,$T3,$T3		# h4*s1
+	vpaddq		$T3,$D0,$D0		# d0 += h4*s1
+
+	 vpshufd	\$0x32,`16*4-64`($ctx),$T2		# s2
+	vpmuludq	$H2,$T4,$T1		# h2*r2
+	vpaddq		$T1,$D4,$D4		# d4 += h2*r2
+	vpmuludq	$H1,$T4,$T0		# h1*r2
+	vpaddq		$T0,$D3,$D3		# d3 += h1*r2
+	 vpshufd	\$0x32,`16*5-64`($ctx),$T3		# r3
+	vpmuludq	$H0,$T4,$T4		# h0*r2
+	vpaddq		$T4,$D2,$D2		# d2 += h0*r2
+	vpmuludq	$H4,$T2,$T1		# h4*s2
+	vpaddq		$T1,$D1,$D1		# d1 += h4*s2
+	 vpshufd	\$0x32,`16*6-64`($ctx),$T4		# s3
+	vpmuludq	$H3,$T2,$T2		# h3*s2
+	vpaddq		$T2,$D0,$D0		# d0 += h3*s2
+
+	vpmuludq	$H1,$T3,$T0		# h1*r3
+	vpaddq		$T0,$D4,$D4		# d4 += h1*r3
+	vpmuludq	$H0,$T3,$T3		# h0*r3
+	vpaddq		$T3,$D3,$D3		# d3 += h0*r3
+	 vpshufd	\$0x32,`16*7-64`($ctx),$T2		# r4
+	vpmuludq	$H4,$T4,$T1		# h4*s3
+	vpaddq		$T1,$D2,$D2		# d2 += h4*s3
+	 vpshufd	\$0x32,`16*8-64`($ctx),$T3		# s4
+	vpmuludq	$H3,$T4,$T0		# h3*s3
+	vpaddq		$T0,$D1,$D1		# d1 += h3*s3
+	vpmuludq	$H2,$T4,$T4		# h2*s3
+	vpaddq		$T4,$D0,$D0		# d0 += h2*s3
+
+	vpmuludq	$H0,$T2,$T2		# h0*r4
+	vpaddq		$T2,$D4,$D4		# d4 += h0*r4
+	vpmuludq	$H4,$T3,$T1		# h4*s4
+	vpaddq		$T1,$D3,$D3		# d3 += h4*s4
+	vpmuludq	$H3,$T3,$T0		# h3*s4
+	vpaddq		$T0,$D2,$D2		# d2 += h3*s4
+	vpmuludq	$H2,$T3,$T1		# h2*s4
+	vpaddq		$T1,$D1,$D1		# d1 += h2*s4
+	vpmuludq	$H1,$T3,$T3		# h1*s4
+	vpaddq		$T3,$D0,$D0		# d0 += h1*s4
+
+.Lshort_tail_avx:
+	################################################################
+	# horizontal addition
+
+	vpsrldq		\$8,$D4,$T4
+	vpsrldq		\$8,$D3,$T3
+	vpsrldq		\$8,$D1,$T1
+	vpsrldq		\$8,$D0,$T0
+	vpsrldq		\$8,$D2,$T2
+	vpaddq		$T3,$D3,$D3
+	vpaddq		$T4,$D4,$D4
+	vpaddq		$T0,$D0,$D0
+	vpaddq		$T1,$D1,$D1
+	vpaddq		$T2,$D2,$D2
+
+	################################################################
+	# lazy reduction
+
+	vpsrlq		\$26,$D3,$H3
+	vpand		$MASK,$D3,$D3
+	vpaddq		$H3,$D4,$D4		# h3 -> h4
+
+	vpsrlq		\$26,$D0,$H0
+	vpand		$MASK,$D0,$D0
+	vpaddq		$H0,$D1,$D1		# h0 -> h1
+
+	vpsrlq		\$26,$D4,$H4
+	vpand		$MASK,$D4,$D4
+
+	vpsrlq		\$26,$D1,$H1
+	vpand		$MASK,$D1,$D1
+	vpaddq		$H1,$D2,$D2		# h1 -> h2
+
+	vpaddq		$H4,$D0,$D0
+	vpsllq		\$2,$H4,$H4
+	vpaddq		$H4,$D0,$D0		# h4 -> h0
+
+	vpsrlq		\$26,$D2,$H2
+	vpand		$MASK,$D2,$D2
+	vpaddq		$H2,$D3,$D3		# h2 -> h3
+
+	vpsrlq		\$26,$D0,$H0
+	vpand		$MASK,$D0,$D0
+	vpaddq		$H0,$D1,$D1		# h0 -> h1
+
+	vpsrlq		\$26,$D3,$H3
+	vpand		$MASK,$D3,$D3
+	vpaddq		$H3,$D4,$D4		# h3 -> h4
+
+	vmovd		$D0,`4*0-48-64`($ctx)	# save partially reduced
+	vmovd		$D1,`4*1-48-64`($ctx)
+	vmovd		$D2,`4*2-48-64`($ctx)
+	vmovd		$D3,`4*3-48-64`($ctx)
+	vmovd		$D4,`4*4-48-64`($ctx)
+___
+$code.=<<___	if ($win64);
+	vmovdqa		0x50(%r11),%xmm6
+	vmovdqa		0x60(%r11),%xmm7
+	vmovdqa		0x70(%r11),%xmm8
+	vmovdqa		0x80(%r11),%xmm9
+	vmovdqa		0x90(%r11),%xmm10
+	vmovdqa		0xa0(%r11),%xmm11
+	vmovdqa		0xb0(%r11),%xmm12
+	vmovdqa		0xc0(%r11),%xmm13
+	vmovdqa		0xd0(%r11),%xmm14
+	vmovdqa		0xe0(%r11),%xmm15
+	lea		0xf8(%r11),%rsp
+.Ldo_avx_epilogue:
+___
+$code.=<<___	if (!$win64);
+	lea		0x58(%r11),%rsp
+.cfi_def_cfa		%rsp,8
+___
+$code.=<<___;
+	vzeroupper
+	ret
+.cfi_endproc
+.size	poly1305_blocks_avx,.-poly1305_blocks_avx
+
+.type	poly1305_emit_avx,\@function,3
+.align	32
+poly1305_emit_avx:
+	cmpl	\$0,20($ctx)	# is_base2_26?
+	je	.Lemit
+
+	mov	0($ctx),%eax	# load hash value base 2^26
+	mov	4($ctx),%ecx
+	mov	8($ctx),%r8d
+	mov	12($ctx),%r11d
+	mov	16($ctx),%r10d
+
+	shl	\$26,%rcx	# base 2^26 -> base 2^64
+	mov	%r8,%r9
+	shl	\$52,%r8
+	add	%rcx,%rax
+	shr	\$12,%r9
+	add	%rax,%r8	# h0
+	adc	\$0,%r9
+
+	shl	\$14,%r11
+	mov	%r10,%rax
+	shr	\$24,%r10
+	add	%r11,%r9
+	shl	\$40,%rax
+	add	%rax,%r9	# h1
+	adc	\$0,%r10	# h2
+
+	mov	%r10,%rax	# could be partially reduced, so reduce
+	mov	%r10,%rcx
+	and	\$3,%r10
+	shr	\$2,%rax
+	and	\$-4,%rcx
+	add	%rcx,%rax
+	add	%rax,%r8
+	adc	\$0,%r9
+	adc	\$0,%r10
+
+	mov	%r8,%rax
+	add	\$5,%r8		# compare to modulus
+	mov	%r9,%rcx
+	adc	\$0,%r9
+	adc	\$0,%r10
+	shr	\$2,%r10	# did 130-bit value overflow?
+	cmovnz	%r8,%rax
+	cmovnz	%r9,%rcx
+
+	add	0($nonce),%rax	# accumulate nonce
+	adc	8($nonce),%rcx
+	mov	%rax,0($mac)	# write result
+	mov	%rcx,8($mac)
+
+	ret
+.size	poly1305_emit_avx,.-poly1305_emit_avx
+___
+
+if ($avx>1) {
+my ($H0,$H1,$H2,$H3,$H4, $MASK, $T4,$T0,$T1,$T2,$T3, $D0,$D1,$D2,$D3,$D4) =
+    map("%ymm$_",(0..15));
+my $S4=$MASK;
+
+$code.=<<___;
+.type	poly1305_blocks_avx2,\@function,4
+.align	32
+poly1305_blocks_avx2:
+.cfi_startproc
+	mov	20($ctx),%r8d		# is_base2_26
+	cmp	\$128,$len
+	jae	.Lblocks_avx2
+	test	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx2:
+	and	\$-16,$len
+	jz	.Lno_data_avx2
+
+	vzeroupper
+
+	test	%r8d,%r8d
+	jz	.Lbase2_64_avx2
+
+	test	\$63,$len
+	jz	.Leven_avx2
+
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+.Lblocks_avx2_body:
+
+	mov	$len,%r15		# reassign $len
+
+	mov	0($ctx),$d1		# load hash value
+	mov	8($ctx),$d2
+	mov	16($ctx),$h2#d
+
+	mov	24($ctx),$r0		# load r
+	mov	32($ctx),$s1
+
+	################################# base 2^26 -> base 2^64
+	mov	$d1#d,$h0#d
+	and	\$`-1*(1<<31)`,$d1
+	mov	$d2,$r1			# borrow $r1
+	mov	$d2#d,$h1#d
+	and	\$`-1*(1<<31)`,$d2
+
+	shr	\$6,$d1
+	shl	\$52,$r1
+	add	$d1,$h0
+	shr	\$12,$h1
+	shr	\$18,$d2
+	add	$r1,$h0
+	adc	$d2,$h1
+
+	mov	$h2,$d1
+	shl	\$40,$d1
+	shr	\$24,$h2
+	add	$d1,$h1
+	adc	\$0,$h2			# can be partially reduced...
+
+	mov	\$-4,$d2		# ... so reduce
+	mov	$h2,$d1
+	and	$h2,$d2
+	shr	\$2,$d1
+	and	\$3,$h2
+	add	$d2,$d1			# =*5
+	add	$d1,$h0
+	adc	\$0,$h1
+	adc	\$0,$h2
+
+	mov	$s1,$r1
+	mov	$s1,%rax
+	shr	\$2,$s1
+	add	$r1,$s1			# s1 = r1 + (r1 >> 2)
+
+.Lbase2_26_pre_avx2:
+	add	0($inp),$h0		# accumulate input
+	adc	8($inp),$h1
+	lea	16($inp),$inp
+	adc	$padbit,$h2
+	sub	\$16,%r15
+
+	call	__poly1305_block
+	mov	$r1,%rax
+
+	test	\$63,%r15
+	jnz	.Lbase2_26_pre_avx2
+
+	test	$padbit,$padbit		# if $padbit is zero,
+	jz	.Lstore_base2_64_avx2	# store hash in base 2^64 format
+
+	################################# base 2^64 -> base 2^26
+	mov	$h0,%rax
+	mov	$h0,%rdx
+	shr	\$52,$h0
+	mov	$h1,$r0
+	mov	$h1,$r1
+	shr	\$26,%rdx
+	and	\$0x3ffffff,%rax	# h[0]
+	shl	\$12,$r0
+	and	\$0x3ffffff,%rdx	# h[1]
+	shr	\$14,$h1
+	or	$r0,$h0
+	shl	\$24,$h2
+	and	\$0x3ffffff,$h0		# h[2]
+	shr	\$40,$r1
+	and	\$0x3ffffff,$h1		# h[3]
+	or	$r1,$h2			# h[4]
+
+	test	%r15,%r15
+	jz	.Lstore_base2_26_avx2
+
+	vmovd	%rax#d,%x#$H0
+	vmovd	%rdx#d,%x#$H1
+	vmovd	$h0#d,%x#$H2
+	vmovd	$h1#d,%x#$H3
+	vmovd	$h2#d,%x#$H4
+	jmp	.Lproceed_avx2
+
+.align	32
+.Lstore_base2_64_avx2:
+	mov	$h0,0($ctx)
+	mov	$h1,8($ctx)
+	mov	$h2,16($ctx)		# note that is_base2_26 is zeroed
+	jmp	.Ldone_avx2
+
+.align	16
+.Lstore_base2_26_avx2:
+	mov	%rax#d,0($ctx)		# store hash value base 2^26
+	mov	%rdx#d,4($ctx)
+	mov	$h0#d,8($ctx)
+	mov	$h1#d,12($ctx)
+	mov	$h2#d,16($ctx)
+.align	16
+.Ldone_avx2:
+	mov	0(%rsp),%r15
+.cfi_restore	%r15
+	mov	8(%rsp),%r14
+.cfi_restore	%r14
+	mov	16(%rsp),%r13
+.cfi_restore	%r13
+	mov	24(%rsp),%r12
+.cfi_restore	%r12
+	mov	32(%rsp),%rbp
+.cfi_restore	%rbp
+	mov	40(%rsp),%rbx
+.cfi_restore	%rbx
+	lea	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lno_data_avx2:
+.Lblocks_avx2_epilogue:
+	ret
+.cfi_endproc
+
+.align	32
+.Lbase2_64_avx2:
+.cfi_startproc
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+.Lbase2_64_avx2_body:
+
+	mov	$len,%r15		# reassign $len
+
+	mov	24($ctx),$r0		# load r
+	mov	32($ctx),$s1
+
+	mov	0($ctx),$h0		# load hash value
+	mov	8($ctx),$h1
+	mov	16($ctx),$h2#d
+
+	mov	$s1,$r1
+	mov	$s1,%rax
+	shr	\$2,$s1
+	add	$r1,$s1			# s1 = r1 + (r1 >> 2)
+
+	test	\$63,$len
+	jz	.Linit_avx2
+
+.Lbase2_64_pre_avx2:
+	add	0($inp),$h0		# accumulate input
+	adc	8($inp),$h1
+	lea	16($inp),$inp
+	adc	$padbit,$h2
+	sub	\$16,%r15
+
+	call	__poly1305_block
+	mov	$r1,%rax
+
+	test	\$63,%r15
+	jnz	.Lbase2_64_pre_avx2
+
+.Linit_avx2:
+	################################# base 2^64 -> base 2^26
+	mov	$h0,%rax
+	mov	$h0,%rdx
+	shr	\$52,$h0
+	mov	$h1,$d1
+	mov	$h1,$d2
+	shr	\$26,%rdx
+	and	\$0x3ffffff,%rax	# h[0]
+	shl	\$12,$d1
+	and	\$0x3ffffff,%rdx	# h[1]
+	shr	\$14,$h1
+	or	$d1,$h0
+	shl	\$24,$h2
+	and	\$0x3ffffff,$h0		# h[2]
+	shr	\$40,$d2
+	and	\$0x3ffffff,$h1		# h[3]
+	or	$d2,$h2			# h[4]
+
+	vmovd	%rax#d,%x#$H0
+	vmovd	%rdx#d,%x#$H1
+	vmovd	$h0#d,%x#$H2
+	vmovd	$h1#d,%x#$H3
+	vmovd	$h2#d,%x#$H4
+	movl	\$1,20($ctx)		# set is_base2_26
+
+	call	__poly1305_init_avx
+
+.Lproceed_avx2:
+	mov	%r15,$len			# restore $len
+#	mov	OPENSSL_ia32cap_P+8(%rip),%r10d
+#	mov	\$`(1<<31|1<<30|1<<16)`,%r11d
+
+	mov	0(%rsp),%r15
+.cfi_restore	%r15
+	mov	8(%rsp),%r14
+.cfi_restore	%r14
+	mov	16(%rsp),%r13
+.cfi_restore	%r13
+	mov	24(%rsp),%r12
+.cfi_restore	%r12
+	mov	32(%rsp),%rbp
+.cfi_restore	%rbp
+	mov	40(%rsp),%rbx
+.cfi_restore	%rbx
+	lea	48(%rsp),%rax
+	lea	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lbase2_64_avx2_epilogue:
+	jmp	.Ldo_avx2
+.cfi_endproc
+
+.align	32
+.Leven_avx2:
+.cfi_startproc
+#	mov		OPENSSL_ia32cap_P+8(%rip),%r10d
+	vmovd		4*0($ctx),%x#$H0	# load hash value base 2^26
+	vmovd		4*1($ctx),%x#$H1
+	vmovd		4*2($ctx),%x#$H2
+	vmovd		4*3($ctx),%x#$H3
+	vmovd		4*4($ctx),%x#$H4
+
+.Ldo_avx2:
+___
+#$code.=<<___		if ($avx>2);
+#	cmp		\$512,$len
+#	jb		.Lskip_avx512
+#	and		%r11d,%r10d
+#	test		\$`1<<16`,%r10d		# check for AVX512F
+#	jnz		.Lblocks_avx512
+#.Lskip_avx512:
+#___
+$code.=<<___	if (!$win64);
+	lea		-8(%rsp),%r11
+.cfi_def_cfa		%r11,16
+	sub		\$0x128,%rsp
+___
+$code.=<<___	if ($win64);
+	lea		-0xf8(%rsp),%r11
+	sub		\$0x1c8,%rsp
+	vmovdqa		%xmm6,0x50(%r11)
+	vmovdqa		%xmm7,0x60(%r11)
+	vmovdqa		%xmm8,0x70(%r11)
+	vmovdqa		%xmm9,0x80(%r11)
+	vmovdqa		%xmm10,0x90(%r11)
+	vmovdqa		%xmm11,0xa0(%r11)
+	vmovdqa		%xmm12,0xb0(%r11)
+	vmovdqa		%xmm13,0xc0(%r11)
+	vmovdqa		%xmm14,0xd0(%r11)
+	vmovdqa		%xmm15,0xe0(%r11)
+.Ldo_avx2_body:
+___
+$code.=<<___;
+	lea		.Lconst(%rip),%rcx
+	lea		48+64($ctx),$ctx	# size optimization
+	vmovdqa		96(%rcx),$T0		# .Lpermd_avx2
+
+	# expand and copy pre-calculated table to stack
+	vmovdqu		`16*0-64`($ctx),%x#$T2
+	and		\$-512,%rsp
+	vmovdqu		`16*1-64`($ctx),%x#$T3
+	vmovdqu		`16*2-64`($ctx),%x#$T4
+	vmovdqu		`16*3-64`($ctx),%x#$D0
+	vmovdqu		`16*4-64`($ctx),%x#$D1
+	vmovdqu		`16*5-64`($ctx),%x#$D2
+	lea		0x90(%rsp),%rax		# size optimization
+	vmovdqu		`16*6-64`($ctx),%x#$D3
+	vpermd		$T2,$T0,$T2		# 00003412 -> 14243444
+	vmovdqu		`16*7-64`($ctx),%x#$D4
+	vpermd		$T3,$T0,$T3
+	vmovdqu		`16*8-64`($ctx),%x#$MASK
+	vpermd		$T4,$T0,$T4
+	vmovdqa		$T2,0x00(%rsp)
+	vpermd		$D0,$T0,$D0
+	vmovdqa		$T3,0x20-0x90(%rax)
+	vpermd		$D1,$T0,$D1
+	vmovdqa		$T4,0x40-0x90(%rax)
+	vpermd		$D2,$T0,$D2
+	vmovdqa		$D0,0x60-0x90(%rax)
+	vpermd		$D3,$T0,$D3
+	vmovdqa		$D1,0x80-0x90(%rax)
+	vpermd		$D4,$T0,$D4
+	vmovdqa		$D2,0xa0-0x90(%rax)
+	vpermd		$MASK,$T0,$MASK
+	vmovdqa		$D3,0xc0-0x90(%rax)
+	vmovdqa		$D4,0xe0-0x90(%rax)
+	vmovdqa		$MASK,0x100-0x90(%rax)
+	vmovdqa		64(%rcx),$MASK		# .Lmask26
+
+	################################################################
+	# load input
+	vmovdqu		16*0($inp),%x#$T0
+	vmovdqu		16*1($inp),%x#$T1
+	vinserti128	\$1,16*2($inp),$T0,$T0
+	vinserti128	\$1,16*3($inp),$T1,$T1
+	lea		16*4($inp),$inp
+
+	vpsrldq		\$6,$T0,$T2		# splat input
+	vpsrldq		\$6,$T1,$T3
+	vpunpckhqdq	$T1,$T0,$T4		# 4
+	vpunpcklqdq	$T3,$T2,$T2		# 2:3
+	vpunpcklqdq	$T1,$T0,$T0		# 0:1
+
+	vpsrlq		\$30,$T2,$T3
+	vpsrlq		\$4,$T2,$T2
+	vpsrlq		\$26,$T0,$T1
+	vpsrlq		\$40,$T4,$T4		# 4
+	vpand		$MASK,$T2,$T2		# 2
+	vpand		$MASK,$T0,$T0		# 0
+	vpand		$MASK,$T1,$T1		# 1
+	vpand		$MASK,$T3,$T3		# 3
+	vpor		32(%rcx),$T4,$T4	# padbit, yes, always
+
+	vpaddq		$H2,$T2,$H2		# accumulate input
+	sub		\$64,$len
+	jz		.Ltail_avx2
+	jmp		.Loop_avx2
+
+.align	32
+.Loop_avx2:
+	################################################################
+	# ((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4
+	# ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3
+	# ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2
+	# ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1
+	#   \________/\__________/
+	################################################################
+	#vpaddq		$H2,$T2,$H2		# accumulate input
+	vpaddq		$H0,$T0,$H0
+	vmovdqa		`32*0`(%rsp),$T0	# r0^4
+	vpaddq		$H1,$T1,$H1
+	vmovdqa		`32*1`(%rsp),$T1	# r1^4
+	vpaddq		$H3,$T3,$H3
+	vmovdqa		`32*3`(%rsp),$T2	# r2^4
+	vpaddq		$H4,$T4,$H4
+	vmovdqa		`32*6-0x90`(%rax),$T3	# s3^4
+	vmovdqa		`32*8-0x90`(%rax),$S4	# s4^4
+
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	#
+	# however, as h2 is "chronologically" first one available pull
+	# corresponding operations up, so it's
+	#
+	# d4 = h2*r2   + h4*r0 + h3*r1             + h1*r3   + h0*r4
+	# d3 = h2*r1   + h3*r0           + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0           + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h2*5*r4 + h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3
+	# d0 = h2*5*r3 + h0*r0 + h4*5*r1 + h3*5*r2           + h1*5*r4
+
+	vpmuludq	$H2,$T0,$D2		# d2 = h2*r0
+	vpmuludq	$H2,$T1,$D3		# d3 = h2*r1
+	vpmuludq	$H2,$T2,$D4		# d4 = h2*r2
+	vpmuludq	$H2,$T3,$D0		# d0 = h2*s3
+	vpmuludq	$H2,$S4,$D1		# d1 = h2*s4
+
+	vpmuludq	$H0,$T1,$T4		# h0*r1
+	vpmuludq	$H1,$T1,$H2		# h1*r1, borrow $H2 as temp
+	vpaddq		$T4,$D1,$D1		# d1 += h0*r1
+	vpaddq		$H2,$D2,$D2		# d2 += h1*r1
+	vpmuludq	$H3,$T1,$T4		# h3*r1
+	vpmuludq	`32*2`(%rsp),$H4,$H2	# h4*s1
+	vpaddq		$T4,$D4,$D4		# d4 += h3*r1
+	vpaddq		$H2,$D0,$D0		# d0 += h4*s1
+	 vmovdqa	`32*4-0x90`(%rax),$T1	# s2
+
+	vpmuludq	$H0,$T0,$T4		# h0*r0
+	vpmuludq	$H1,$T0,$H2		# h1*r0
+	vpaddq		$T4,$D0,$D0		# d0 += h0*r0
+	vpaddq		$H2,$D1,$D1		# d1 += h1*r0
+	vpmuludq	$H3,$T0,$T4		# h3*r0
+	vpmuludq	$H4,$T0,$H2		# h4*r0
+	 vmovdqu	16*0($inp),%x#$T0	# load input
+	vpaddq		$T4,$D3,$D3		# d3 += h3*r0
+	vpaddq		$H2,$D4,$D4		# d4 += h4*r0
+	 vinserti128	\$1,16*2($inp),$T0,$T0
+
+	vpmuludq	$H3,$T1,$T4		# h3*s2
+	vpmuludq	$H4,$T1,$H2		# h4*s2
+	 vmovdqu	16*1($inp),%x#$T1
+	vpaddq		$T4,$D0,$D0		# d0 += h3*s2
+	vpaddq		$H2,$D1,$D1		# d1 += h4*s2
+	 vmovdqa	`32*5-0x90`(%rax),$H2	# r3
+	vpmuludq	$H1,$T2,$T4		# h1*r2
+	vpmuludq	$H0,$T2,$T2		# h0*r2
+	vpaddq		$T4,$D3,$D3		# d3 += h1*r2
+	vpaddq		$T2,$D2,$D2		# d2 += h0*r2
+	 vinserti128	\$1,16*3($inp),$T1,$T1
+	 lea		16*4($inp),$inp
+
+	vpmuludq	$H1,$H2,$T4		# h1*r3
+	vpmuludq	$H0,$H2,$H2		# h0*r3
+	 vpsrldq	\$6,$T0,$T2		# splat input
+	vpaddq		$T4,$D4,$D4		# d4 += h1*r3
+	vpaddq		$H2,$D3,$D3		# d3 += h0*r3
+	vpmuludq	$H3,$T3,$T4		# h3*s3
+	vpmuludq	$H4,$T3,$H2		# h4*s3
+	 vpsrldq	\$6,$T1,$T3
+	vpaddq		$T4,$D1,$D1		# d1 += h3*s3
+	vpaddq		$H2,$D2,$D2		# d2 += h4*s3
+	 vpunpckhqdq	$T1,$T0,$T4		# 4
+
+	vpmuludq	$H3,$S4,$H3		# h3*s4
+	vpmuludq	$H4,$S4,$H4		# h4*s4
+	 vpunpcklqdq	$T1,$T0,$T0		# 0:1
+	vpaddq		$H3,$D2,$H2		# h2 = d2 + h3*r4
+	vpaddq		$H4,$D3,$H3		# h3 = d3 + h4*r4
+	 vpunpcklqdq	$T3,$T2,$T3		# 2:3
+	vpmuludq	`32*7-0x90`(%rax),$H0,$H4	# h0*r4
+	vpmuludq	$H1,$S4,$H0		# h1*s4
+	vmovdqa		64(%rcx),$MASK		# .Lmask26
+	vpaddq		$H4,$D4,$H4		# h4 = d4 + h0*r4
+	vpaddq		$H0,$D0,$H0		# h0 = d0 + h1*s4
+
+	################################################################
+	# lazy reduction (interleaved with tail of input splat)
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$D1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H4,$D4
+	vpand		$MASK,$H4,$H4
+
+	 vpsrlq		\$4,$T3,$T2
+
+	vpsrlq		\$26,$H1,$D1
+	vpand		$MASK,$H1,$H1
+	vpaddq		$D1,$H2,$H2		# h1 -> h2
+
+	vpaddq		$D4,$H0,$H0
+	vpsllq		\$2,$D4,$D4
+	vpaddq		$D4,$H0,$H0		# h4 -> h0
+
+	 vpand		$MASK,$T2,$T2		# 2
+	 vpsrlq		\$26,$T0,$T1
+
+	vpsrlq		\$26,$H2,$D2
+	vpand		$MASK,$H2,$H2
+	vpaddq		$D2,$H3,$H3		# h2 -> h3
+
+	 vpaddq		$T2,$H2,$H2		# modulo-scheduled
+	 vpsrlq		\$30,$T3,$T3
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	 vpsrlq		\$40,$T4,$T4		# 4
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	 vpand		$MASK,$T0,$T0		# 0
+	 vpand		$MASK,$T1,$T1		# 1
+	 vpand		$MASK,$T3,$T3		# 3
+	 vpor		32(%rcx),$T4,$T4	# padbit, yes, always
+
+	sub		\$64,$len
+	jnz		.Loop_avx2
+
+	.byte		0x66,0x90
+.Ltail_avx2:
+	################################################################
+	# while above multiplications were by r^4 in all lanes, in last
+	# iteration we multiply least significant lane by r^4 and most
+	# significant one by r, so copy of above except that references
+	# to the precomputed table are displaced by 4...
+
+	#vpaddq		$H2,$T2,$H2		# accumulate input
+	vpaddq		$H0,$T0,$H0
+	vmovdqu		`32*0+4`(%rsp),$T0	# r0^4
+	vpaddq		$H1,$T1,$H1
+	vmovdqu		`32*1+4`(%rsp),$T1	# r1^4
+	vpaddq		$H3,$T3,$H3
+	vmovdqu		`32*3+4`(%rsp),$T2	# r2^4
+	vpaddq		$H4,$T4,$H4
+	vmovdqu		`32*6+4-0x90`(%rax),$T3	# s3^4
+	vmovdqu		`32*8+4-0x90`(%rax),$S4	# s4^4
+
+	vpmuludq	$H2,$T0,$D2		# d2 = h2*r0
+	vpmuludq	$H2,$T1,$D3		# d3 = h2*r1
+	vpmuludq	$H2,$T2,$D4		# d4 = h2*r2
+	vpmuludq	$H2,$T3,$D0		# d0 = h2*s3
+	vpmuludq	$H2,$S4,$D1		# d1 = h2*s4
+
+	vpmuludq	$H0,$T1,$T4		# h0*r1
+	vpmuludq	$H1,$T1,$H2		# h1*r1
+	vpaddq		$T4,$D1,$D1		# d1 += h0*r1
+	vpaddq		$H2,$D2,$D2		# d2 += h1*r1
+	vpmuludq	$H3,$T1,$T4		# h3*r1
+	vpmuludq	`32*2+4`(%rsp),$H4,$H2	# h4*s1
+	vpaddq		$T4,$D4,$D4		# d4 += h3*r1
+	vpaddq		$H2,$D0,$D0		# d0 += h4*s1
+
+	vpmuludq	$H0,$T0,$T4		# h0*r0
+	vpmuludq	$H1,$T0,$H2		# h1*r0
+	vpaddq		$T4,$D0,$D0		# d0 += h0*r0
+	 vmovdqu	`32*4+4-0x90`(%rax),$T1	# s2
+	vpaddq		$H2,$D1,$D1		# d1 += h1*r0
+	vpmuludq	$H3,$T0,$T4		# h3*r0
+	vpmuludq	$H4,$T0,$H2		# h4*r0
+	vpaddq		$T4,$D3,$D3		# d3 += h3*r0
+	vpaddq		$H2,$D4,$D4		# d4 += h4*r0
+
+	vpmuludq	$H3,$T1,$T4		# h3*s2
+	vpmuludq	$H4,$T1,$H2		# h4*s2
+	vpaddq		$T4,$D0,$D0		# d0 += h3*s2
+	vpaddq		$H2,$D1,$D1		# d1 += h4*s2
+	 vmovdqu	`32*5+4-0x90`(%rax),$H2	# r3
+	vpmuludq	$H1,$T2,$T4		# h1*r2
+	vpmuludq	$H0,$T2,$T2		# h0*r2
+	vpaddq		$T4,$D3,$D3		# d3 += h1*r2
+	vpaddq		$T2,$D2,$D2		# d2 += h0*r2
+
+	vpmuludq	$H1,$H2,$T4		# h1*r3
+	vpmuludq	$H0,$H2,$H2		# h0*r3
+	vpaddq		$T4,$D4,$D4		# d4 += h1*r3
+	vpaddq		$H2,$D3,$D3		# d3 += h0*r3
+	vpmuludq	$H3,$T3,$T4		# h3*s3
+	vpmuludq	$H4,$T3,$H2		# h4*s3
+	vpaddq		$T4,$D1,$D1		# d1 += h3*s3
+	vpaddq		$H2,$D2,$D2		# d2 += h4*s3
+
+	vpmuludq	$H3,$S4,$H3		# h3*s4
+	vpmuludq	$H4,$S4,$H4		# h4*s4
+	vpaddq		$H3,$D2,$H2		# h2 = d2 + h3*r4
+	vpaddq		$H4,$D3,$H3		# h3 = d3 + h4*r4
+	vpmuludq	`32*7+4-0x90`(%rax),$H0,$H4		# h0*r4
+	vpmuludq	$H1,$S4,$H0		# h1*s4
+	vmovdqa		64(%rcx),$MASK		# .Lmask26
+	vpaddq		$H4,$D4,$H4		# h4 = d4 + h0*r4
+	vpaddq		$H0,$D0,$H0		# h0 = d0 + h1*s4
+
+	################################################################
+	# horizontal addition
+
+	vpsrldq		\$8,$D1,$T1
+	vpsrldq		\$8,$H2,$T2
+	vpsrldq		\$8,$H3,$T3
+	vpsrldq		\$8,$H4,$T4
+	vpsrldq		\$8,$H0,$T0
+	vpaddq		$T1,$D1,$D1
+	vpaddq		$T2,$H2,$H2
+	vpaddq		$T3,$H3,$H3
+	vpaddq		$T4,$H4,$H4
+	vpaddq		$T0,$H0,$H0
+
+	vpermq		\$0x2,$H3,$T3
+	vpermq		\$0x2,$H4,$T4
+	vpermq		\$0x2,$H0,$T0
+	vpermq		\$0x2,$D1,$T1
+	vpermq		\$0x2,$H2,$T2
+	vpaddq		$T3,$H3,$H3
+	vpaddq		$T4,$H4,$H4
+	vpaddq		$T0,$H0,$H0
+	vpaddq		$T1,$D1,$D1
+	vpaddq		$T2,$H2,$H2
+
+	################################################################
+	# lazy reduction
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$D1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H4,$D4
+	vpand		$MASK,$H4,$H4
+
+	vpsrlq		\$26,$H1,$D1
+	vpand		$MASK,$H1,$H1
+	vpaddq		$D1,$H2,$H2		# h1 -> h2
+
+	vpaddq		$D4,$H0,$H0
+	vpsllq		\$2,$D4,$D4
+	vpaddq		$D4,$H0,$H0		# h4 -> h0
+
+	vpsrlq		\$26,$H2,$D2
+	vpand		$MASK,$H2,$H2
+	vpaddq		$D2,$H3,$H3		# h2 -> h3
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	vmovd		%x#$H0,`4*0-48-64`($ctx)# save partially reduced
+	vmovd		%x#$H1,`4*1-48-64`($ctx)
+	vmovd		%x#$H2,`4*2-48-64`($ctx)
+	vmovd		%x#$H3,`4*3-48-64`($ctx)
+	vmovd		%x#$H4,`4*4-48-64`($ctx)
+___
+$code.=<<___	if ($win64);
+	vmovdqa		0x50(%r11),%xmm6
+	vmovdqa		0x60(%r11),%xmm7
+	vmovdqa		0x70(%r11),%xmm8
+	vmovdqa		0x80(%r11),%xmm9
+	vmovdqa		0x90(%r11),%xmm10
+	vmovdqa		0xa0(%r11),%xmm11
+	vmovdqa		0xb0(%r11),%xmm12
+	vmovdqa		0xc0(%r11),%xmm13
+	vmovdqa		0xd0(%r11),%xmm14
+	vmovdqa		0xe0(%r11),%xmm15
+	lea		0xf8(%r11),%rsp
+.Ldo_avx2_epilogue:
+___
+$code.=<<___	if (!$win64);
+	lea		8(%r11),%rsp
+.cfi_def_cfa		%rsp,8
+___
+$code.=<<___;
+	vzeroupper
+	ret
+.cfi_endproc
+.size	poly1305_blocks_avx2,.-poly1305_blocks_avx2
+___
+#######################################################################
+if ($avx>2) {
+# On entry we have input length divisible by 64. But since inner loop
+# processes 128 bytes per iteration, cases when length is not divisible
+# by 128 are handled by passing tail 64 bytes to .Ltail_avx2. For this
+# reason stack layout is kept identical to poly1305_blocks_avx2. If not
+# for this tail, we wouldn't have to even allocate stack frame...
+
+
+$code.=<<___;
+.type	poly1305_blocks_avx512,\@function,4
+.align	32
+poly1305_blocks_avx512:
+.cfi_startproc
+	mov	20($ctx),%r8d		# is_base2_26
+	cmp	\$128,$len
+	jae	.Lblocks_avx2_512
+	test	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx2_512:
+	and	\$-16,$len
+	jz	.Lno_data_avx2_512
+
+	vzeroupper
+
+	test	%r8d,%r8d
+	jz	.Lbase2_64_avx2_512
+
+	test	\$63,$len
+	jz	.Leven_avx2_512
+
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+.Lblocks_avx2_body_512:
+
+	mov	$len,%r15		# reassign $len
+
+	mov	0($ctx),$d1		# load hash value
+	mov	8($ctx),$d2
+	mov	16($ctx),$h2#d
+
+	mov	24($ctx),$r0		# load r
+	mov	32($ctx),$s1
+
+	################################# base 2^26 -> base 2^64
+	mov	$d1#d,$h0#d
+	and	\$`-1*(1<<31)`,$d1
+	mov	$d2,$r1			# borrow $r1
+	mov	$d2#d,$h1#d
+	and	\$`-1*(1<<31)`,$d2
+
+	shr	\$6,$d1
+	shl	\$52,$r1
+	add	$d1,$h0
+	shr	\$12,$h1
+	shr	\$18,$d2
+	add	$r1,$h0
+	adc	$d2,$h1
+
+	mov	$h2,$d1
+	shl	\$40,$d1
+	shr	\$24,$h2
+	add	$d1,$h1
+	adc	\$0,$h2			# can be partially reduced...
+
+	mov	\$-4,$d2		# ... so reduce
+	mov	$h2,$d1
+	and	$h2,$d2
+	shr	\$2,$d1
+	and	\$3,$h2
+	add	$d2,$d1			# =*5
+	add	$d1,$h0
+	adc	\$0,$h1
+	adc	\$0,$h2
+
+	mov	$s1,$r1
+	mov	$s1,%rax
+	shr	\$2,$s1
+	add	$r1,$s1			# s1 = r1 + (r1 >> 2)
+
+.Lbase2_26_pre_avx2_512:
+	add	0($inp),$h0		# accumulate input
+	adc	8($inp),$h1
+	lea	16($inp),$inp
+	adc	$padbit,$h2
+	sub	\$16,%r15
+
+	call	__poly1305_block
+	mov	$r1,%rax
+
+	test	\$63,%r15
+	jnz	.Lbase2_26_pre_avx2_512
+
+	test	$padbit,$padbit		# if $padbit is zero,
+	jz	.Lstore_base2_64_avx2_512	# store hash in base 2^64 format
+
+	################################# base 2^64 -> base 2^26
+	mov	$h0,%rax
+	mov	$h0,%rdx
+	shr	\$52,$h0
+	mov	$h1,$r0
+	mov	$h1,$r1
+	shr	\$26,%rdx
+	and	\$0x3ffffff,%rax	# h[0]
+	shl	\$12,$r0
+	and	\$0x3ffffff,%rdx	# h[1]
+	shr	\$14,$h1
+	or	$r0,$h0
+	shl	\$24,$h2
+	and	\$0x3ffffff,$h0		# h[2]
+	shr	\$40,$r1
+	and	\$0x3ffffff,$h1		# h[3]
+	or	$r1,$h2			# h[4]
+
+	test	%r15,%r15
+	jz	.Lstore_base2_26_avx2_512
+
+	vmovd	%rax#d,%x#$H0
+	vmovd	%rdx#d,%x#$H1
+	vmovd	$h0#d,%x#$H2
+	vmovd	$h1#d,%x#$H3
+	vmovd	$h2#d,%x#$H4
+	jmp	.Lproceed_avx2_512
+
+.align	32
+.Lstore_base2_64_avx2_512:
+	mov	$h0,0($ctx)
+	mov	$h1,8($ctx)
+	mov	$h2,16($ctx)		# note that is_base2_26 is zeroed
+	jmp	.Ldone_avx2_512
+
+.align	16
+.Lstore_base2_26_avx2_512:
+	mov	%rax#d,0($ctx)		# store hash value base 2^26
+	mov	%rdx#d,4($ctx)
+	mov	$h0#d,8($ctx)
+	mov	$h1#d,12($ctx)
+	mov	$h2#d,16($ctx)
+.align	16
+.Ldone_avx2_512:
+	mov	0(%rsp),%r15
+.cfi_restore	%r15
+	mov	8(%rsp),%r14
+.cfi_restore	%r14
+	mov	16(%rsp),%r13
+.cfi_restore	%r13
+	mov	24(%rsp),%r12
+.cfi_restore	%r12
+	mov	32(%rsp),%rbp
+.cfi_restore	%rbp
+	mov	40(%rsp),%rbx
+.cfi_restore	%rbx
+	lea	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lno_data_avx2_512:
+.Lblocks_avx2_epilogue_512:
+	ret
+.cfi_endproc
+
+.align	32
+.Lbase2_64_avx2_512:
+.cfi_startproc
+	push	%rbx
+.cfi_push	%rbx
+	push	%rbp
+.cfi_push	%rbp
+	push	%r12
+.cfi_push	%r12
+	push	%r13
+.cfi_push	%r13
+	push	%r14
+.cfi_push	%r14
+	push	%r15
+.cfi_push	%r15
+.Lbase2_64_avx2_body_512:
+
+	mov	$len,%r15		# reassign $len
+
+	mov	24($ctx),$r0		# load r
+	mov	32($ctx),$s1
+
+	mov	0($ctx),$h0		# load hash value
+	mov	8($ctx),$h1
+	mov	16($ctx),$h2#d
+
+	mov	$s1,$r1
+	mov	$s1,%rax
+	shr	\$2,$s1
+	add	$r1,$s1			# s1 = r1 + (r1 >> 2)
+
+	test	\$63,$len
+	jz	.Linit_avx2_512
+
+.Lbase2_64_pre_avx2_512:
+	add	0($inp),$h0		# accumulate input
+	adc	8($inp),$h1
+	lea	16($inp),$inp
+	adc	$padbit,$h2
+	sub	\$16,%r15
+
+	call	__poly1305_block
+	mov	$r1,%rax
+
+	test	\$63,%r15
+	jnz	.Lbase2_64_pre_avx2_512
+
+.Linit_avx2_512:
+	################################# base 2^64 -> base 2^26
+	mov	$h0,%rax
+	mov	$h0,%rdx
+	shr	\$52,$h0
+	mov	$h1,$d1
+	mov	$h1,$d2
+	shr	\$26,%rdx
+	and	\$0x3ffffff,%rax	# h[0]
+	shl	\$12,$d1
+	and	\$0x3ffffff,%rdx	# h[1]
+	shr	\$14,$h1
+	or	$d1,$h0
+	shl	\$24,$h2
+	and	\$0x3ffffff,$h0		# h[2]
+	shr	\$40,$d2
+	and	\$0x3ffffff,$h1		# h[3]
+	or	$d2,$h2			# h[4]
+
+	vmovd	%rax#d,%x#$H0
+	vmovd	%rdx#d,%x#$H1
+	vmovd	$h0#d,%x#$H2
+	vmovd	$h1#d,%x#$H3
+	vmovd	$h2#d,%x#$H4
+	movl	\$1,20($ctx)		# set is_base2_26
+
+	call	__poly1305_init_avx
+
+.Lproceed_avx2_512:
+	mov	%r15,$len			# restore $len
+#	mov	OPENSSL_ia32cap_P+8(%rip),%r10d
+#	mov	\$`(1<<31|1<<30|1<<16)`,%r11d
+
+	mov	0(%rsp),%r15
+.cfi_restore	%r15
+	mov	8(%rsp),%r14
+.cfi_restore	%r14
+	mov	16(%rsp),%r13
+.cfi_restore	%r13
+	mov	24(%rsp),%r12
+.cfi_restore	%r12
+	mov	32(%rsp),%rbp
+.cfi_restore	%rbp
+	mov	40(%rsp),%rbx
+.cfi_restore	%rbx
+	lea	48(%rsp),%rax
+	lea	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lbase2_64_avx2_epilogue_512:
+	jmp	.Ldo_avx2_512
+.cfi_endproc
+
+.align	32
+.Leven_avx2_512:
+.cfi_startproc
+#	mov		OPENSSL_ia32cap_P+8(%rip),%r10d
+	vmovd		4*0($ctx),%x#$H0	# load hash value base 2^26
+	vmovd		4*1($ctx),%x#$H1
+	vmovd		4*2($ctx),%x#$H2
+	vmovd		4*3($ctx),%x#$H3
+	vmovd		4*4($ctx),%x#$H4
+
+.Ldo_avx2_512:
+	cmp		\$512,$len
+	jae		.Lblocks_avx512
+.Lskip_avx512:
+___
+#$code.=<<___		if ($avx>2);
+#	cmp		\$512,$len
+#	jb		.Lskip_avx512
+#	and		%r11d,%r10d
+#	test		\$`1<<16`,%r10d		# check for AVX512F
+#	jnz		.Lblocks_avx512
+#.Lskip_avx512:
+#___
+$code.=<<___	if (!$win64);
+	lea		-8(%rsp),%r11
+.cfi_def_cfa		%r11,16
+	sub		\$0x128,%rsp
+___
+$code.=<<___	if ($win64);
+	lea		-0xf8(%rsp),%r11
+	sub		\$0x1c8,%rsp
+	vmovdqa		%xmm6,0x50(%r11)
+	vmovdqa		%xmm7,0x60(%r11)
+	vmovdqa		%xmm8,0x70(%r11)
+	vmovdqa		%xmm9,0x80(%r11)
+	vmovdqa		%xmm10,0x90(%r11)
+	vmovdqa		%xmm11,0xa0(%r11)
+	vmovdqa		%xmm12,0xb0(%r11)
+	vmovdqa		%xmm13,0xc0(%r11)
+	vmovdqa		%xmm14,0xd0(%r11)
+	vmovdqa		%xmm15,0xe0(%r11)
+.Ldo_avx2_body_512:
+___
+$code.=<<___;
+	lea		.Lconst(%rip),%rcx
+	lea		48+64($ctx),$ctx	# size optimization
+	vmovdqa		96(%rcx),$T0		# .Lpermd_avx2
+
+	# expand and copy pre-calculated table to stack
+	vmovdqu		`16*0-64`($ctx),%x#$T2
+	and		\$-512,%rsp
+	vmovdqu		`16*1-64`($ctx),%x#$T3
+	vmovdqu		`16*2-64`($ctx),%x#$T4
+	vmovdqu		`16*3-64`($ctx),%x#$D0
+	vmovdqu		`16*4-64`($ctx),%x#$D1
+	vmovdqu		`16*5-64`($ctx),%x#$D2
+	lea		0x90(%rsp),%rax		# size optimization
+	vmovdqu		`16*6-64`($ctx),%x#$D3
+	vpermd		$T2,$T0,$T2		# 00003412 -> 14243444
+	vmovdqu		`16*7-64`($ctx),%x#$D4
+	vpermd		$T3,$T0,$T3
+	vmovdqu		`16*8-64`($ctx),%x#$MASK
+	vpermd		$T4,$T0,$T4
+	vmovdqa		$T2,0x00(%rsp)
+	vpermd		$D0,$T0,$D0
+	vmovdqa		$T3,0x20-0x90(%rax)
+	vpermd		$D1,$T0,$D1
+	vmovdqa		$T4,0x40-0x90(%rax)
+	vpermd		$D2,$T0,$D2
+	vmovdqa		$D0,0x60-0x90(%rax)
+	vpermd		$D3,$T0,$D3
+	vmovdqa		$D1,0x80-0x90(%rax)
+	vpermd		$D4,$T0,$D4
+	vmovdqa		$D2,0xa0-0x90(%rax)
+	vpermd		$MASK,$T0,$MASK
+	vmovdqa		$D3,0xc0-0x90(%rax)
+	vmovdqa		$D4,0xe0-0x90(%rax)
+	vmovdqa		$MASK,0x100-0x90(%rax)
+	vmovdqa		64(%rcx),$MASK		# .Lmask26
+
+	################################################################
+	# load input
+	vmovdqu		16*0($inp),%x#$T0
+	vmovdqu		16*1($inp),%x#$T1
+	vinserti128	\$1,16*2($inp),$T0,$T0
+	vinserti128	\$1,16*3($inp),$T1,$T1
+	lea		16*4($inp),$inp
+
+	vpsrldq		\$6,$T0,$T2		# splat input
+	vpsrldq		\$6,$T1,$T3
+	vpunpckhqdq	$T1,$T0,$T4		# 4
+	vpunpcklqdq	$T3,$T2,$T2		# 2:3
+	vpunpcklqdq	$T1,$T0,$T0		# 0:1
+
+	vpsrlq		\$30,$T2,$T3
+	vpsrlq		\$4,$T2,$T2
+	vpsrlq		\$26,$T0,$T1
+	vpsrlq		\$40,$T4,$T4		# 4
+	vpand		$MASK,$T2,$T2		# 2
+	vpand		$MASK,$T0,$T0		# 0
+	vpand		$MASK,$T1,$T1		# 1
+	vpand		$MASK,$T3,$T3		# 3
+	vpor		32(%rcx),$T4,$T4	# padbit, yes, always
+
+	vpaddq		$H2,$T2,$H2		# accumulate input
+	sub		\$64,$len
+	jz		.Ltail_avx2_512
+	jmp		.Loop_avx2_512
+
+.align	32
+.Loop_avx2_512:
+	################################################################
+	# ((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4
+	# ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3
+	# ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2
+	# ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1
+	#   \________/\__________/
+	################################################################
+	#vpaddq		$H2,$T2,$H2		# accumulate input
+	vpaddq		$H0,$T0,$H0
+	vmovdqa		`32*0`(%rsp),$T0	# r0^4
+	vpaddq		$H1,$T1,$H1
+	vmovdqa		`32*1`(%rsp),$T1	# r1^4
+	vpaddq		$H3,$T3,$H3
+	vmovdqa		`32*3`(%rsp),$T2	# r2^4
+	vpaddq		$H4,$T4,$H4
+	vmovdqa		`32*6-0x90`(%rax),$T3	# s3^4
+	vmovdqa		`32*8-0x90`(%rax),$S4	# s4^4
+
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	#
+	# however, as h2 is "chronologically" first one available pull
+	# corresponding operations up, so it's
+	#
+	# d4 = h2*r2   + h4*r0 + h3*r1             + h1*r3   + h0*r4
+	# d3 = h2*r1   + h3*r0           + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0           + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h2*5*r4 + h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3
+	# d0 = h2*5*r3 + h0*r0 + h4*5*r1 + h3*5*r2           + h1*5*r4
+
+	vpmuludq	$H2,$T0,$D2		# d2 = h2*r0
+	vpmuludq	$H2,$T1,$D3		# d3 = h2*r1
+	vpmuludq	$H2,$T2,$D4		# d4 = h2*r2
+	vpmuludq	$H2,$T3,$D0		# d0 = h2*s3
+	vpmuludq	$H2,$S4,$D1		# d1 = h2*s4
+
+	vpmuludq	$H0,$T1,$T4		# h0*r1
+	vpmuludq	$H1,$T1,$H2		# h1*r1, borrow $H2 as temp
+	vpaddq		$T4,$D1,$D1		# d1 += h0*r1
+	vpaddq		$H2,$D2,$D2		# d2 += h1*r1
+	vpmuludq	$H3,$T1,$T4		# h3*r1
+	vpmuludq	`32*2`(%rsp),$H4,$H2	# h4*s1
+	vpaddq		$T4,$D4,$D4		# d4 += h3*r1
+	vpaddq		$H2,$D0,$D0		# d0 += h4*s1
+	 vmovdqa	`32*4-0x90`(%rax),$T1	# s2
+
+	vpmuludq	$H0,$T0,$T4		# h0*r0
+	vpmuludq	$H1,$T0,$H2		# h1*r0
+	vpaddq		$T4,$D0,$D0		# d0 += h0*r0
+	vpaddq		$H2,$D1,$D1		# d1 += h1*r0
+	vpmuludq	$H3,$T0,$T4		# h3*r0
+	vpmuludq	$H4,$T0,$H2		# h4*r0
+	 vmovdqu	16*0($inp),%x#$T0	# load input
+	vpaddq		$T4,$D3,$D3		# d3 += h3*r0
+	vpaddq		$H2,$D4,$D4		# d4 += h4*r0
+	 vinserti128	\$1,16*2($inp),$T0,$T0
+
+	vpmuludq	$H3,$T1,$T4		# h3*s2
+	vpmuludq	$H4,$T1,$H2		# h4*s2
+	 vmovdqu	16*1($inp),%x#$T1
+	vpaddq		$T4,$D0,$D0		# d0 += h3*s2
+	vpaddq		$H2,$D1,$D1		# d1 += h4*s2
+	 vmovdqa	`32*5-0x90`(%rax),$H2	# r3
+	vpmuludq	$H1,$T2,$T4		# h1*r2
+	vpmuludq	$H0,$T2,$T2		# h0*r2
+	vpaddq		$T4,$D3,$D3		# d3 += h1*r2
+	vpaddq		$T2,$D2,$D2		# d2 += h0*r2
+	 vinserti128	\$1,16*3($inp),$T1,$T1
+	 lea		16*4($inp),$inp
+
+	vpmuludq	$H1,$H2,$T4		# h1*r3
+	vpmuludq	$H0,$H2,$H2		# h0*r3
+	 vpsrldq	\$6,$T0,$T2		# splat input
+	vpaddq		$T4,$D4,$D4		# d4 += h1*r3
+	vpaddq		$H2,$D3,$D3		# d3 += h0*r3
+	vpmuludq	$H3,$T3,$T4		# h3*s3
+	vpmuludq	$H4,$T3,$H2		# h4*s3
+	 vpsrldq	\$6,$T1,$T3
+	vpaddq		$T4,$D1,$D1		# d1 += h3*s3
+	vpaddq		$H2,$D2,$D2		# d2 += h4*s3
+	 vpunpckhqdq	$T1,$T0,$T4		# 4
+
+	vpmuludq	$H3,$S4,$H3		# h3*s4
+	vpmuludq	$H4,$S4,$H4		# h4*s4
+	 vpunpcklqdq	$T1,$T0,$T0		# 0:1
+	vpaddq		$H3,$D2,$H2		# h2 = d2 + h3*r4
+	vpaddq		$H4,$D3,$H3		# h3 = d3 + h4*r4
+	 vpunpcklqdq	$T3,$T2,$T3		# 2:3
+	vpmuludq	`32*7-0x90`(%rax),$H0,$H4	# h0*r4
+	vpmuludq	$H1,$S4,$H0		# h1*s4
+	vmovdqa		64(%rcx),$MASK		# .Lmask26
+	vpaddq		$H4,$D4,$H4		# h4 = d4 + h0*r4
+	vpaddq		$H0,$D0,$H0		# h0 = d0 + h1*s4
+
+	################################################################
+	# lazy reduction (interleaved with tail of input splat)
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$D1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H4,$D4
+	vpand		$MASK,$H4,$H4
+
+	 vpsrlq		\$4,$T3,$T2
+
+	vpsrlq		\$26,$H1,$D1
+	vpand		$MASK,$H1,$H1
+	vpaddq		$D1,$H2,$H2		# h1 -> h2
+
+	vpaddq		$D4,$H0,$H0
+	vpsllq		\$2,$D4,$D4
+	vpaddq		$D4,$H0,$H0		# h4 -> h0
+
+	 vpand		$MASK,$T2,$T2		# 2
+	 vpsrlq		\$26,$T0,$T1
+
+	vpsrlq		\$26,$H2,$D2
+	vpand		$MASK,$H2,$H2
+	vpaddq		$D2,$H3,$H3		# h2 -> h3
+
+	 vpaddq		$T2,$H2,$H2		# modulo-scheduled
+	 vpsrlq		\$30,$T3,$T3
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	 vpsrlq		\$40,$T4,$T4		# 4
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	 vpand		$MASK,$T0,$T0		# 0
+	 vpand		$MASK,$T1,$T1		# 1
+	 vpand		$MASK,$T3,$T3		# 3
+	 vpor		32(%rcx),$T4,$T4	# padbit, yes, always
+
+	sub		\$64,$len
+	jnz		.Loop_avx2_512
+
+	.byte		0x66,0x90
+.Ltail_avx2_512:
+	################################################################
+	# while above multiplications were by r^4 in all lanes, in last
+	# iteration we multiply least significant lane by r^4 and most
+	# significant one by r, so copy of above except that references
+	# to the precomputed table are displaced by 4...
+
+	#vpaddq		$H2,$T2,$H2		# accumulate input
+	vpaddq		$H0,$T0,$H0
+	vmovdqu		`32*0+4`(%rsp),$T0	# r0^4
+	vpaddq		$H1,$T1,$H1
+	vmovdqu		`32*1+4`(%rsp),$T1	# r1^4
+	vpaddq		$H3,$T3,$H3
+	vmovdqu		`32*3+4`(%rsp),$T2	# r2^4
+	vpaddq		$H4,$T4,$H4
+	vmovdqu		`32*6+4-0x90`(%rax),$T3	# s3^4
+	vmovdqu		`32*8+4-0x90`(%rax),$S4	# s4^4
+
+	vpmuludq	$H2,$T0,$D2		# d2 = h2*r0
+	vpmuludq	$H2,$T1,$D3		# d3 = h2*r1
+	vpmuludq	$H2,$T2,$D4		# d4 = h2*r2
+	vpmuludq	$H2,$T3,$D0		# d0 = h2*s3
+	vpmuludq	$H2,$S4,$D1		# d1 = h2*s4
+
+	vpmuludq	$H0,$T1,$T4		# h0*r1
+	vpmuludq	$H1,$T1,$H2		# h1*r1
+	vpaddq		$T4,$D1,$D1		# d1 += h0*r1
+	vpaddq		$H2,$D2,$D2		# d2 += h1*r1
+	vpmuludq	$H3,$T1,$T4		# h3*r1
+	vpmuludq	`32*2+4`(%rsp),$H4,$H2	# h4*s1
+	vpaddq		$T4,$D4,$D4		# d4 += h3*r1
+	vpaddq		$H2,$D0,$D0		# d0 += h4*s1
+
+	vpmuludq	$H0,$T0,$T4		# h0*r0
+	vpmuludq	$H1,$T0,$H2		# h1*r0
+	vpaddq		$T4,$D0,$D0		# d0 += h0*r0
+	 vmovdqu	`32*4+4-0x90`(%rax),$T1	# s2
+	vpaddq		$H2,$D1,$D1		# d1 += h1*r0
+	vpmuludq	$H3,$T0,$T4		# h3*r0
+	vpmuludq	$H4,$T0,$H2		# h4*r0
+	vpaddq		$T4,$D3,$D3		# d3 += h3*r0
+	vpaddq		$H2,$D4,$D4		# d4 += h4*r0
+
+	vpmuludq	$H3,$T1,$T4		# h3*s2
+	vpmuludq	$H4,$T1,$H2		# h4*s2
+	vpaddq		$T4,$D0,$D0		# d0 += h3*s2
+	vpaddq		$H2,$D1,$D1		# d1 += h4*s2
+	 vmovdqu	`32*5+4-0x90`(%rax),$H2	# r3
+	vpmuludq	$H1,$T2,$T4		# h1*r2
+	vpmuludq	$H0,$T2,$T2		# h0*r2
+	vpaddq		$T4,$D3,$D3		# d3 += h1*r2
+	vpaddq		$T2,$D2,$D2		# d2 += h0*r2
+
+	vpmuludq	$H1,$H2,$T4		# h1*r3
+	vpmuludq	$H0,$H2,$H2		# h0*r3
+	vpaddq		$T4,$D4,$D4		# d4 += h1*r3
+	vpaddq		$H2,$D3,$D3		# d3 += h0*r3
+	vpmuludq	$H3,$T3,$T4		# h3*s3
+	vpmuludq	$H4,$T3,$H2		# h4*s3
+	vpaddq		$T4,$D1,$D1		# d1 += h3*s3
+	vpaddq		$H2,$D2,$D2		# d2 += h4*s3
+
+	vpmuludq	$H3,$S4,$H3		# h3*s4
+	vpmuludq	$H4,$S4,$H4		# h4*s4
+	vpaddq		$H3,$D2,$H2		# h2 = d2 + h3*r4
+	vpaddq		$H4,$D3,$H3		# h3 = d3 + h4*r4
+	vpmuludq	`32*7+4-0x90`(%rax),$H0,$H4		# h0*r4
+	vpmuludq	$H1,$S4,$H0		# h1*s4
+	vmovdqa		64(%rcx),$MASK		# .Lmask26
+	vpaddq		$H4,$D4,$H4		# h4 = d4 + h0*r4
+	vpaddq		$H0,$D0,$H0		# h0 = d0 + h1*s4
+
+	################################################################
+	# horizontal addition
+
+	vpsrldq		\$8,$D1,$T1
+	vpsrldq		\$8,$H2,$T2
+	vpsrldq		\$8,$H3,$T3
+	vpsrldq		\$8,$H4,$T4
+	vpsrldq		\$8,$H0,$T0
+	vpaddq		$T1,$D1,$D1
+	vpaddq		$T2,$H2,$H2
+	vpaddq		$T3,$H3,$H3
+	vpaddq		$T4,$H4,$H4
+	vpaddq		$T0,$H0,$H0
+
+	vpermq		\$0x2,$H3,$T3
+	vpermq		\$0x2,$H4,$T4
+	vpermq		\$0x2,$H0,$T0
+	vpermq		\$0x2,$D1,$T1
+	vpermq		\$0x2,$H2,$T2
+	vpaddq		$T3,$H3,$H3
+	vpaddq		$T4,$H4,$H4
+	vpaddq		$T0,$H0,$H0
+	vpaddq		$T1,$D1,$D1
+	vpaddq		$T2,$H2,$H2
+
+	################################################################
+	# lazy reduction
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$D1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H4,$D4
+	vpand		$MASK,$H4,$H4
+
+	vpsrlq		\$26,$H1,$D1
+	vpand		$MASK,$H1,$H1
+	vpaddq		$D1,$H2,$H2		# h1 -> h2
+
+	vpaddq		$D4,$H0,$H0
+	vpsllq		\$2,$D4,$D4
+	vpaddq		$D4,$H0,$H0		# h4 -> h0
+
+	vpsrlq		\$26,$H2,$D2
+	vpand		$MASK,$H2,$H2
+	vpaddq		$D2,$H3,$H3		# h2 -> h3
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	vmovd		%x#$H0,`4*0-48-64`($ctx)# save partially reduced
+	vmovd		%x#$H1,`4*1-48-64`($ctx)
+	vmovd		%x#$H2,`4*2-48-64`($ctx)
+	vmovd		%x#$H3,`4*3-48-64`($ctx)
+	vmovd		%x#$H4,`4*4-48-64`($ctx)
+___
+$code.=<<___	if ($win64);
+	vmovdqa		0x50(%r11),%xmm6
+	vmovdqa		0x60(%r11),%xmm7
+	vmovdqa		0x70(%r11),%xmm8
+	vmovdqa		0x80(%r11),%xmm9
+	vmovdqa		0x90(%r11),%xmm10
+	vmovdqa		0xa0(%r11),%xmm11
+	vmovdqa		0xb0(%r11),%xmm12
+	vmovdqa		0xc0(%r11),%xmm13
+	vmovdqa		0xd0(%r11),%xmm14
+	vmovdqa		0xe0(%r11),%xmm15
+	lea		0xf8(%r11),%rsp
+.Ldo_avx2_epilogue_512:
+___
+$code.=<<___	if (!$win64);
+	lea		8(%r11),%rsp
+.cfi_def_cfa		%rsp,8
+___
+$code.=<<___;
+	vzeroupper
+	ret
+.cfi_endproc
+.size	poly1305_blocks_avx2,.-poly1305_blocks_avx2
+___
+
+
+my ($R0,$R1,$R2,$R3,$R4, $S1,$S2,$S3,$S4) = map("%zmm$_",(16..24));
+my ($M0,$M1,$M2,$M3,$M4) = map("%zmm$_",(25..29));
+my $PADBIT="%zmm30";
+
+map(s/%y/%z/,($T4,$T0,$T1,$T2,$T3));		# switch to %zmm domain
+map(s/%y/%z/,($D0,$D1,$D2,$D3,$D4));
+map(s/%y/%z/,($H0,$H1,$H2,$H3,$H4));
+map(s/%y/%z/,($MASK));
+
+
+$code.=<<___;
+.cfi_startproc
+.Lblocks_avx512:
+	mov		\$15,%eax
+	kmovw		%eax,%k2
+___
+$code.=<<___	if (!$win64);
+	lea		-8(%rsp),%r11
+.cfi_def_cfa		%r11,16
+	sub		\$0x128,%rsp
+___
+$code.=<<___	if ($win64);
+	lea		-0xf8(%rsp),%r11
+	sub		\$0x1c8,%rsp
+	vmovdqa		%xmm6,0x50(%r11)
+	vmovdqa		%xmm7,0x60(%r11)
+	vmovdqa		%xmm8,0x70(%r11)
+	vmovdqa		%xmm9,0x80(%r11)
+	vmovdqa		%xmm10,0x90(%r11)
+	vmovdqa		%xmm11,0xa0(%r11)
+	vmovdqa		%xmm12,0xb0(%r11)
+	vmovdqa		%xmm13,0xc0(%r11)
+	vmovdqa		%xmm14,0xd0(%r11)
+	vmovdqa		%xmm15,0xe0(%r11)
+.Ldo_avx512_body:
+___
+$code.=<<___;
+	lea		.Lconst(%rip),%rcx
+	lea		48+64($ctx),$ctx	# size optimization
+	vmovdqa		96(%rcx),%y#$T2		# .Lpermd_avx2
+
+	# expand pre-calculated table
+	vmovdqu		`16*0-64`($ctx),%x#$D0	# will become expanded ${R0}
+	and		\$-512,%rsp
+	vmovdqu		`16*1-64`($ctx),%x#$D1	# will become ... ${R1}
+	mov		\$0x20,%rax
+	vmovdqu		`16*2-64`($ctx),%x#$T0	# ... ${S1}
+	vmovdqu		`16*3-64`($ctx),%x#$D2	# ... ${R2}
+	vmovdqu		`16*4-64`($ctx),%x#$T1	# ... ${S2}
+	vmovdqu		`16*5-64`($ctx),%x#$D3	# ... ${R3}
+	vmovdqu		`16*6-64`($ctx),%x#$T3	# ... ${S3}
+	vmovdqu		`16*7-64`($ctx),%x#$D4	# ... ${R4}
+	vmovdqu		`16*8-64`($ctx),%x#$T4	# ... ${S4}
+	vpermd		$D0,$T2,$R0		# 00003412 -> 14243444
+	vpbroadcastq	64(%rcx),$MASK		# .Lmask26
+	vpermd		$D1,$T2,$R1
+	vpermd		$T0,$T2,$S1
+	vpermd		$D2,$T2,$R2
+	vmovdqa64	$R0,0x00(%rsp){%k2}	# save in case $len%128 != 0
+	 vpsrlq		\$32,$R0,$T0		# 14243444 -> 01020304
+	vpermd		$T1,$T2,$S2
+	vmovdqu64	$R1,0x00(%rsp,%rax){%k2}
+	 vpsrlq		\$32,$R1,$T1
+	vpermd		$D3,$T2,$R3
+	vmovdqa64	$S1,0x40(%rsp){%k2}
+	vpermd		$T3,$T2,$S3
+	vpermd		$D4,$T2,$R4
+	vmovdqu64	$R2,0x40(%rsp,%rax){%k2}
+	vpermd		$T4,$T2,$S4
+	vmovdqa64	$S2,0x80(%rsp){%k2}
+	vmovdqu64	$R3,0x80(%rsp,%rax){%k2}
+	vmovdqa64	$S3,0xc0(%rsp){%k2}
+	vmovdqu64	$R4,0xc0(%rsp,%rax){%k2}
+	vmovdqa64	$S4,0x100(%rsp){%k2}
+
+	################################################################
+	# calculate 5th through 8th powers of the key
+	#
+	# d0 = r0'*r0 + r1'*5*r4 + r2'*5*r3 + r3'*5*r2 + r4'*5*r1
+	# d1 = r0'*r1 + r1'*r0   + r2'*5*r4 + r3'*5*r3 + r4'*5*r2
+	# d2 = r0'*r2 + r1'*r1   + r2'*r0   + r3'*5*r4 + r4'*5*r3
+	# d3 = r0'*r3 + r1'*r2   + r2'*r1   + r3'*r0   + r4'*5*r4
+	# d4 = r0'*r4 + r1'*r3   + r2'*r2   + r3'*r1   + r4'*r0
+
+	vpmuludq	$T0,$R0,$D0		# d0 = r0'*r0
+	vpmuludq	$T0,$R1,$D1		# d1 = r0'*r1
+	vpmuludq	$T0,$R2,$D2		# d2 = r0'*r2
+	vpmuludq	$T0,$R3,$D3		# d3 = r0'*r3
+	vpmuludq	$T0,$R4,$D4		# d4 = r0'*r4
+	 vpsrlq		\$32,$R2,$T2
+
+	vpmuludq	$T1,$S4,$M0
+	vpmuludq	$T1,$R0,$M1
+	vpmuludq	$T1,$R1,$M2
+	vpmuludq	$T1,$R2,$M3
+	vpmuludq	$T1,$R3,$M4
+	 vpsrlq		\$32,$R3,$T3
+	vpaddq		$M0,$D0,$D0		# d0 += r1'*5*r4
+	vpaddq		$M1,$D1,$D1		# d1 += r1'*r0
+	vpaddq		$M2,$D2,$D2		# d2 += r1'*r1
+	vpaddq		$M3,$D3,$D3		# d3 += r1'*r2
+	vpaddq		$M4,$D4,$D4		# d4 += r1'*r3
+
+	vpmuludq	$T2,$S3,$M0
+	vpmuludq	$T2,$S4,$M1
+	vpmuludq	$T2,$R1,$M3
+	vpmuludq	$T2,$R2,$M4
+	vpmuludq	$T2,$R0,$M2
+	 vpsrlq		\$32,$R4,$T4
+	vpaddq		$M0,$D0,$D0		# d0 += r2'*5*r3
+	vpaddq		$M1,$D1,$D1		# d1 += r2'*5*r4
+	vpaddq		$M3,$D3,$D3		# d3 += r2'*r1
+	vpaddq		$M4,$D4,$D4		# d4 += r2'*r2
+	vpaddq		$M2,$D2,$D2		# d2 += r2'*r0
+
+	vpmuludq	$T3,$S2,$M0
+	vpmuludq	$T3,$R0,$M3
+	vpmuludq	$T3,$R1,$M4
+	vpmuludq	$T3,$S3,$M1
+	vpmuludq	$T3,$S4,$M2
+	vpaddq		$M0,$D0,$D0		# d0 += r3'*5*r2
+	vpaddq		$M3,$D3,$D3		# d3 += r3'*r0
+	vpaddq		$M4,$D4,$D4		# d4 += r3'*r1
+	vpaddq		$M1,$D1,$D1		# d1 += r3'*5*r3
+	vpaddq		$M2,$D2,$D2		# d2 += r3'*5*r4
+
+	vpmuludq	$T4,$S4,$M3
+	vpmuludq	$T4,$R0,$M4
+	vpmuludq	$T4,$S1,$M0
+	vpmuludq	$T4,$S2,$M1
+	vpmuludq	$T4,$S3,$M2
+	vpaddq		$M3,$D3,$D3		# d3 += r2'*5*r4
+	vpaddq		$M4,$D4,$D4		# d4 += r2'*r0
+	vpaddq		$M0,$D0,$D0		# d0 += r2'*5*r1
+	vpaddq		$M1,$D1,$D1		# d1 += r2'*5*r2
+	vpaddq		$M2,$D2,$D2		# d2 += r2'*5*r3
+
+	################################################################
+	# load input
+	vmovdqu64	16*0($inp),%z#$T3
+	vmovdqu64	16*4($inp),%z#$T4
+	lea		16*8($inp),$inp
+
+	################################################################
+	# lazy reduction
+
+	vpsrlq		\$26,$D3,$M3
+	vpandq		$MASK,$D3,$D3
+	vpaddq		$M3,$D4,$D4		# d3 -> d4
+
+	vpsrlq		\$26,$D0,$M0
+	vpandq		$MASK,$D0,$D0
+	vpaddq		$M0,$D1,$D1		# d0 -> d1
+
+	vpsrlq		\$26,$D4,$M4
+	vpandq		$MASK,$D4,$D4
+
+	vpsrlq		\$26,$D1,$M1
+	vpandq		$MASK,$D1,$D1
+	vpaddq		$M1,$D2,$D2		# d1 -> d2
+
+	vpaddq		$M4,$D0,$D0
+	vpsllq		\$2,$M4,$M4
+	vpaddq		$M4,$D0,$D0		# d4 -> d0
+
+	vpsrlq		\$26,$D2,$M2
+	vpandq		$MASK,$D2,$D2
+	vpaddq		$M2,$D3,$D3		# d2 -> d3
+
+	vpsrlq		\$26,$D0,$M0
+	vpandq		$MASK,$D0,$D0
+	vpaddq		$M0,$D1,$D1		# d0 -> d1
+
+	vpsrlq		\$26,$D3,$M3
+	vpandq		$MASK,$D3,$D3
+	vpaddq		$M3,$D4,$D4		# d3 -> d4
+
+	################################################################
+	# at this point we have 14243444 in $R0-$S4 and 05060708 in
+	# $D0-$D4, ...
+
+	vpunpcklqdq	$T4,$T3,$T0	# transpose input
+	vpunpckhqdq	$T4,$T3,$T4
+
+	# ... since input 64-bit lanes are ordered as 73625140, we could
+	# "vperm" it to 76543210 (here and in each loop iteration), *or*
+	# we could just flow along, hence the goal for $R0-$S4 is
+	# 1858286838784888 ...
+
+	vmovdqa32	128(%rcx),$M0		# .Lpermd_avx512:
+	mov		\$0x7777,%eax
+	kmovw		%eax,%k1
+
+	vpermd		$R0,$M0,$R0		# 14243444 -> 1---2---3---4---
+	vpermd		$R1,$M0,$R1
+	vpermd		$R2,$M0,$R2
+	vpermd		$R3,$M0,$R3
+	vpermd		$R4,$M0,$R4
+
+	vpermd		$D0,$M0,${R0}{%k1}	# 05060708 -> 1858286838784888
+	vpermd		$D1,$M0,${R1}{%k1}
+	vpermd		$D2,$M0,${R2}{%k1}
+	vpermd		$D3,$M0,${R3}{%k1}
+	vpermd		$D4,$M0,${R4}{%k1}
+
+	vpslld		\$2,$R1,$S1		# *5
+	vpslld		\$2,$R2,$S2
+	vpslld		\$2,$R3,$S3
+	vpslld		\$2,$R4,$S4
+	vpaddd		$R1,$S1,$S1
+	vpaddd		$R2,$S2,$S2
+	vpaddd		$R3,$S3,$S3
+	vpaddd		$R4,$S4,$S4
+
+	vpbroadcastq	32(%rcx),$PADBIT	# .L129
+
+	vpsrlq		\$52,$T0,$T2		# splat input
+	vpsllq		\$12,$T4,$T3
+	vporq		$T3,$T2,$T2
+	vpsrlq		\$26,$T0,$T1
+	vpsrlq		\$14,$T4,$T3
+	vpsrlq		\$40,$T4,$T4		# 4
+	vpandq		$MASK,$T2,$T2		# 2
+	vpandq		$MASK,$T0,$T0		# 0
+	#vpandq		$MASK,$T1,$T1		# 1
+	#vpandq		$MASK,$T3,$T3		# 3
+	#vporq		$PADBIT,$T4,$T4		# padbit, yes, always
+
+	vpaddq		$H2,$T2,$H2		# accumulate input
+	sub		\$192,$len
+	jbe		.Ltail_avx512
+	jmp		.Loop_avx512
+
+.align	32
+.Loop_avx512:
+	################################################################
+	# ((inp[0]*r^8+inp[ 8])*r^8+inp[16])*r^8
+	# ((inp[1]*r^8+inp[ 9])*r^8+inp[17])*r^7
+	# ((inp[2]*r^8+inp[10])*r^8+inp[18])*r^6
+	# ((inp[3]*r^8+inp[11])*r^8+inp[19])*r^5
+	# ((inp[4]*r^8+inp[12])*r^8+inp[20])*r^4
+	# ((inp[5]*r^8+inp[13])*r^8+inp[21])*r^3
+	# ((inp[6]*r^8+inp[14])*r^8+inp[22])*r^2
+	# ((inp[7]*r^8+inp[15])*r^8+inp[23])*r^1
+	#   \________/\___________/
+	################################################################
+	#vpaddq		$H2,$T2,$H2		# accumulate input
+
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+	#
+	# however, as h2 is "chronologically" first one available pull
+	# corresponding operations up, so it's
+	#
+	# d3 = h2*r1   + h0*r3 + h1*r2   + h3*r0 + h4*5*r4
+	# d4 = h2*r2   + h0*r4 + h1*r3   + h3*r1 + h4*r0
+	# d0 = h2*5*r3 + h0*r0 + h1*5*r4         + h3*5*r2 + h4*5*r1
+	# d1 = h2*5*r4 + h0*r1           + h1*r0 + h3*5*r3 + h4*5*r2
+	# d2 = h2*r0           + h0*r2   + h1*r1 + h3*5*r4 + h4*5*r3
+
+	vpmuludq	$H2,$R1,$D3		# d3 = h2*r1
+	 vpaddq		$H0,$T0,$H0
+	vpmuludq	$H2,$R2,$D4		# d4 = h2*r2
+	 vpandq		$MASK,$T1,$T1		# 1
+	vpmuludq	$H2,$S3,$D0		# d0 = h2*s3
+	 vpandq		$MASK,$T3,$T3		# 3
+	vpmuludq	$H2,$S4,$D1		# d1 = h2*s4
+	 vporq		$PADBIT,$T4,$T4		# padbit, yes, always
+	vpmuludq	$H2,$R0,$D2		# d2 = h2*r0
+	 vpaddq		$H1,$T1,$H1		# accumulate input
+	 vpaddq		$H3,$T3,$H3
+	 vpaddq		$H4,$T4,$H4
+
+	  vmovdqu64	16*0($inp),$T3		# load input
+	  vmovdqu64	16*4($inp),$T4
+	  lea		16*8($inp),$inp
+	vpmuludq	$H0,$R3,$M3
+	vpmuludq	$H0,$R4,$M4
+	vpmuludq	$H0,$R0,$M0
+	vpmuludq	$H0,$R1,$M1
+	vpaddq		$M3,$D3,$D3		# d3 += h0*r3
+	vpaddq		$M4,$D4,$D4		# d4 += h0*r4
+	vpaddq		$M0,$D0,$D0		# d0 += h0*r0
+	vpaddq		$M1,$D1,$D1		# d1 += h0*r1
+
+	vpmuludq	$H1,$R2,$M3
+	vpmuludq	$H1,$R3,$M4
+	vpmuludq	$H1,$S4,$M0
+	vpmuludq	$H0,$R2,$M2
+	vpaddq		$M3,$D3,$D3		# d3 += h1*r2
+	vpaddq		$M4,$D4,$D4		# d4 += h1*r3
+	vpaddq		$M0,$D0,$D0		# d0 += h1*s4
+	vpaddq		$M2,$D2,$D2		# d2 += h0*r2
+
+	  vpunpcklqdq	$T4,$T3,$T0		# transpose input
+	  vpunpckhqdq	$T4,$T3,$T4
+
+	vpmuludq	$H3,$R0,$M3
+	vpmuludq	$H3,$R1,$M4
+	vpmuludq	$H1,$R0,$M1
+	vpmuludq	$H1,$R1,$M2
+	vpaddq		$M3,$D3,$D3		# d3 += h3*r0
+	vpaddq		$M4,$D4,$D4		# d4 += h3*r1
+	vpaddq		$M1,$D1,$D1		# d1 += h1*r0
+	vpaddq		$M2,$D2,$D2		# d2 += h1*r1
+
+	vpmuludq	$H4,$S4,$M3
+	vpmuludq	$H4,$R0,$M4
+	vpmuludq	$H3,$S2,$M0
+	vpmuludq	$H3,$S3,$M1
+	vpaddq		$M3,$D3,$D3		# d3 += h4*s4
+	vpmuludq	$H3,$S4,$M2
+	vpaddq		$M4,$D4,$D4		# d4 += h4*r0
+	vpaddq		$M0,$D0,$D0		# d0 += h3*s2
+	vpaddq		$M1,$D1,$D1		# d1 += h3*s3
+	vpaddq		$M2,$D2,$D2		# d2 += h3*s4
+
+	vpmuludq	$H4,$S1,$M0
+	vpmuludq	$H4,$S2,$M1
+	vpmuludq	$H4,$S3,$M2
+	vpaddq		$M0,$D0,$H0		# h0 = d0 + h4*s1
+	vpaddq		$M1,$D1,$H1		# h1 = d2 + h4*s2
+	vpaddq		$M2,$D2,$H2		# h2 = d3 + h4*s3
+
+	################################################################
+	# lazy reduction (interleaved with input splat)
+
+	 vpsrlq		\$52,$T0,$T2		# splat input
+	 vpsllq		\$12,$T4,$T3
+
+	vpsrlq		\$26,$D3,$H3
+	vpandq		$MASK,$D3,$D3
+	vpaddq		$H3,$D4,$H4		# h3 -> h4
+
+	 vporq		$T3,$T2,$T2
+
+	vpsrlq		\$26,$H0,$D0
+	vpandq		$MASK,$H0,$H0
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	 vpandq		$MASK,$T2,$T2		# 2
+
+	vpsrlq		\$26,$H4,$D4
+	vpandq		$MASK,$H4,$H4
+
+	vpsrlq		\$26,$H1,$D1
+	vpandq		$MASK,$H1,$H1
+	vpaddq		$D1,$H2,$H2		# h1 -> h2
+
+	vpaddq		$D4,$H0,$H0
+	vpsllq		\$2,$D4,$D4
+	vpaddq		$D4,$H0,$H0		# h4 -> h0
+
+	 vpaddq		$T2,$H2,$H2		# modulo-scheduled
+	 vpsrlq		\$26,$T0,$T1
+
+	vpsrlq		\$26,$H2,$D2
+	vpandq		$MASK,$H2,$H2
+	vpaddq		$D2,$D3,$H3		# h2 -> h3
+
+	 vpsrlq		\$14,$T4,$T3
+
+	vpsrlq		\$26,$H0,$D0
+	vpandq		$MASK,$H0,$H0
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	 vpsrlq		\$40,$T4,$T4		# 4
+
+	vpsrlq		\$26,$H3,$D3
+	vpandq		$MASK,$H3,$H3
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	 vpandq		$MASK,$T0,$T0		# 0
+	 #vpandq	$MASK,$T1,$T1		# 1
+	 #vpandq	$MASK,$T3,$T3		# 3
+	 #vporq		$PADBIT,$T4,$T4		# padbit, yes, always
+
+	sub		\$128,$len
+	ja		.Loop_avx512
+
+.Ltail_avx512:
+	################################################################
+	# while above multiplications were by r^8 in all lanes, in last
+	# iteration we multiply least significant lane by r^8 and most
+	# significant one by r, that's why table gets shifted...
+
+	vpsrlq		\$32,$R0,$R0		# 0105020603070408
+	vpsrlq		\$32,$R1,$R1
+	vpsrlq		\$32,$R2,$R2
+	vpsrlq		\$32,$S3,$S3
+	vpsrlq		\$32,$S4,$S4
+	vpsrlq		\$32,$R3,$R3
+	vpsrlq		\$32,$R4,$R4
+	vpsrlq		\$32,$S1,$S1
+	vpsrlq		\$32,$S2,$S2
+
+	################################################################
+	# load either next or last 64 byte of input
+	lea		($inp,$len),$inp
+
+	#vpaddq		$H2,$T2,$H2		# accumulate input
+	vpaddq		$H0,$T0,$H0
+
+	vpmuludq	$H2,$R1,$D3		# d3 = h2*r1
+	vpmuludq	$H2,$R2,$D4		# d4 = h2*r2
+	vpmuludq	$H2,$S3,$D0		# d0 = h2*s3
+	 vpandq		$MASK,$T1,$T1		# 1
+	vpmuludq	$H2,$S4,$D1		# d1 = h2*s4
+	 vpandq		$MASK,$T3,$T3		# 3
+	vpmuludq	$H2,$R0,$D2		# d2 = h2*r0
+	 vporq		$PADBIT,$T4,$T4		# padbit, yes, always
+	 vpaddq		$H1,$T1,$H1		# accumulate input
+	 vpaddq		$H3,$T3,$H3
+	 vpaddq		$H4,$T4,$H4
+
+	  vmovdqu	16*0($inp),%x#$T0
+	vpmuludq	$H0,$R3,$M3
+	vpmuludq	$H0,$R4,$M4
+	vpmuludq	$H0,$R0,$M0
+	vpmuludq	$H0,$R1,$M1
+	vpaddq		$M3,$D3,$D3		# d3 += h0*r3
+	vpaddq		$M4,$D4,$D4		# d4 += h0*r4
+	vpaddq		$M0,$D0,$D0		# d0 += h0*r0
+	vpaddq		$M1,$D1,$D1		# d1 += h0*r1
+
+	  vmovdqu	16*1($inp),%x#$T1
+	vpmuludq	$H1,$R2,$M3
+	vpmuludq	$H1,$R3,$M4
+	vpmuludq	$H1,$S4,$M0
+	vpmuludq	$H0,$R2,$M2
+	vpaddq		$M3,$D3,$D3		# d3 += h1*r2
+	vpaddq		$M4,$D4,$D4		# d4 += h1*r3
+	vpaddq		$M0,$D0,$D0		# d0 += h1*s4
+	vpaddq		$M2,$D2,$D2		# d2 += h0*r2
+
+	  vinserti128	\$1,16*2($inp),%y#$T0,%y#$T0
+	vpmuludq	$H3,$R0,$M3
+	vpmuludq	$H3,$R1,$M4
+	vpmuludq	$H1,$R0,$M1
+	vpmuludq	$H1,$R1,$M2
+	vpaddq		$M3,$D3,$D3		# d3 += h3*r0
+	vpaddq		$M4,$D4,$D4		# d4 += h3*r1
+	vpaddq		$M1,$D1,$D1		# d1 += h1*r0
+	vpaddq		$M2,$D2,$D2		# d2 += h1*r1
+
+	  vinserti128	\$1,16*3($inp),%y#$T1,%y#$T1
+	vpmuludq	$H4,$S4,$M3
+	vpmuludq	$H4,$R0,$M4
+	vpmuludq	$H3,$S2,$M0
+	vpmuludq	$H3,$S3,$M1
+	vpmuludq	$H3,$S4,$M2
+	vpaddq		$M3,$D3,$H3		# h3 = d3 + h4*s4
+	vpaddq		$M4,$D4,$D4		# d4 += h4*r0
+	vpaddq		$M0,$D0,$D0		# d0 += h3*s2
+	vpaddq		$M1,$D1,$D1		# d1 += h3*s3
+	vpaddq		$M2,$D2,$D2		# d2 += h3*s4
+
+	vpmuludq	$H4,$S1,$M0
+	vpmuludq	$H4,$S2,$M1
+	vpmuludq	$H4,$S3,$M2
+	vpaddq		$M0,$D0,$H0		# h0 = d0 + h4*s1
+	vpaddq		$M1,$D1,$H1		# h1 = d2 + h4*s2
+	vpaddq		$M2,$D2,$H2		# h2 = d3 + h4*s3
+
+	################################################################
+	# horizontal addition
+
+	mov		\$1,%eax
+	vpermq		\$0xb1,$H3,$D3
+	vpermq		\$0xb1,$D4,$H4
+	vpermq		\$0xb1,$H0,$D0
+	vpermq		\$0xb1,$H1,$D1
+	vpermq		\$0xb1,$H2,$D2
+	vpaddq		$D3,$H3,$H3
+	vpaddq		$D4,$H4,$H4
+	vpaddq		$D0,$H0,$H0
+	vpaddq		$D1,$H1,$H1
+	vpaddq		$D2,$H2,$H2
+
+	kmovw		%eax,%k3
+	vpermq		\$0x2,$H3,$D3
+	vpermq		\$0x2,$H4,$D4
+	vpermq		\$0x2,$H0,$D0
+	vpermq		\$0x2,$H1,$D1
+	vpermq		\$0x2,$H2,$D2
+	vpaddq		$D3,$H3,$H3
+	vpaddq		$D4,$H4,$H4
+	vpaddq		$D0,$H0,$H0
+	vpaddq		$D1,$H1,$H1
+	vpaddq		$D2,$H2,$H2
+
+	vextracti64x4	\$0x1,$H3,%y#$D3
+	vextracti64x4	\$0x1,$H4,%y#$D4
+	vextracti64x4	\$0x1,$H0,%y#$D0
+	vextracti64x4	\$0x1,$H1,%y#$D1
+	vextracti64x4	\$0x1,$H2,%y#$D2
+	vpaddq		$D3,$H3,${H3}{%k3}{z}	# keep single qword in case
+	vpaddq		$D4,$H4,${H4}{%k3}{z}	# it's passed to .Ltail_avx2
+	vpaddq		$D0,$H0,${H0}{%k3}{z}
+	vpaddq		$D1,$H1,${H1}{%k3}{z}
+	vpaddq		$D2,$H2,${H2}{%k3}{z}
+___
+map(s/%z/%y/,($T0,$T1,$T2,$T3,$T4, $PADBIT));
+map(s/%z/%y/,($H0,$H1,$H2,$H3,$H4, $D0,$D1,$D2,$D3,$D4, $MASK));
+$code.=<<___;
+	################################################################
+	# lazy reduction (interleaved with input splat)
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	 vpsrldq	\$6,$T0,$T2		# splat input
+	 vpsrldq	\$6,$T1,$T3
+	 vpunpckhqdq	$T1,$T0,$T4		# 4
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	 vpunpcklqdq	$T3,$T2,$T2		# 2:3
+	 vpunpcklqdq	$T1,$T0,$T0		# 0:1
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H4,$D4
+	vpand		$MASK,$H4,$H4
+
+	vpsrlq		\$26,$H1,$D1
+	vpand		$MASK,$H1,$H1
+	 vpsrlq		\$30,$T2,$T3
+	 vpsrlq		\$4,$T2,$T2
+	vpaddq		$D1,$H2,$H2		# h1 -> h2
+
+	vpaddq		$D4,$H0,$H0
+	vpsllq		\$2,$D4,$D4
+	 vpsrlq		\$26,$T0,$T1
+	 vpsrlq		\$40,$T4,$T4		# 4
+	vpaddq		$D4,$H0,$H0		# h4 -> h0
+
+	vpsrlq		\$26,$H2,$D2
+	vpand		$MASK,$H2,$H2
+	 vpand		$MASK,$T2,$T2		# 2
+	 vpand		$MASK,$T0,$T0		# 0
+	vpaddq		$D2,$H3,$H3		# h2 -> h3
+
+	vpsrlq		\$26,$H0,$D0
+	vpand		$MASK,$H0,$H0
+	 vpaddq		$H2,$T2,$H2		# accumulate input for .Ltail_avx2
+	 vpand		$MASK,$T1,$T1		# 1
+	vpaddq		$D0,$H1,$H1		# h0 -> h1
+
+	vpsrlq		\$26,$H3,$D3
+	vpand		$MASK,$H3,$H3
+	 vpand		$MASK,$T3,$T3		# 3
+	 vpor		32(%rcx),$T4,$T4	# padbit, yes, always
+	vpaddq		$D3,$H4,$H4		# h3 -> h4
+
+	lea		0x90(%rsp),%rax		# size optimization for .Ltail_avx2
+	add		\$64,$len
+	jnz		.Ltail_avx2_512
+
+	vpsubq		$T2,$H2,$H2		# undo input accumulation
+	vmovd		%x#$H0,`4*0-48-64`($ctx)# save partially reduced
+	vmovd		%x#$H1,`4*1-48-64`($ctx)
+	vmovd		%x#$H2,`4*2-48-64`($ctx)
+	vmovd		%x#$H3,`4*3-48-64`($ctx)
+	vmovd		%x#$H4,`4*4-48-64`($ctx)
+	vzeroall
+___
+$code.=<<___	if ($win64);
+	movdqa		0x50(%r11),%xmm6
+	movdqa		0x60(%r11),%xmm7
+	movdqa		0x70(%r11),%xmm8
+	movdqa		0x80(%r11),%xmm9
+	movdqa		0x90(%r11),%xmm10
+	movdqa		0xa0(%r11),%xmm11
+	movdqa		0xb0(%r11),%xmm12
+	movdqa		0xc0(%r11),%xmm13
+	movdqa		0xd0(%r11),%xmm14
+	movdqa		0xe0(%r11),%xmm15
+	lea		0xf8(%r11),%rsp
+.Ldo_avx512_epilogue:
+___
+$code.=<<___	if (!$win64);
+	lea		8(%r11),%rsp
+.cfi_def_cfa		%rsp,8
+___
+$code.=<<___;
+	ret
+.cfi_endproc
+.size	poly1305_blocks_avx512,.-poly1305_blocks_avx512
+___
+if ($avx>3 && 0) {
+########################################################################
+# VPMADD52 version using 2^44 radix.
+#
+# One can argue that base 2^52 would be more natural. Well, even though
+# some operations would be more natural, one has to recognize couple of
+# things. Base 2^52 doesn't provide advantage over base 2^44 if you look
+# at amount of multiply-n-accumulate operations. Secondly, it makes it
+# impossible to pre-compute multiples of 5 [referred to as s[]/sN in
+# reference implementations], which means that more such operations
+# would have to be performed in inner loop, which in turn makes critical
+# path longer. In other words, even though base 2^44 reduction might
+# look less elegant, overall critical path is actually shorter...
+
+########################################################################
+# Layout of opaque area is following.
+#
+#	unsigned __int64 h[3];		# current hash value base 2^44
+#	unsigned __int64 s[2];		# key value*20 base 2^44
+#	unsigned __int64 r[3];		# key value base 2^44
+#	struct { unsigned __int64 r^1, r^3, r^2, r^4; } R[4];
+#					# r^n positions reflect
+#					# placement in register, not
+#					# memory, R[3] is R[1]*20
+
+$code.=<<___;
+.type	poly1305_init_base2_44,\@function,3
+.align	32
+poly1305_init_base2_44:
+	xor	%rax,%rax
+	mov	%rax,0($ctx)		# initialize hash value
+	mov	%rax,8($ctx)
+	mov	%rax,16($ctx)
+
+.Linit_base2_44:
+	lea	poly1305_blocks_vpmadd52(%rip),%r10
+	lea	poly1305_emit_base2_44(%rip),%r11
+
+	mov	\$0x0ffffffc0fffffff,%rax
+	mov	\$0x0ffffffc0ffffffc,%rcx
+	and	0($inp),%rax
+	mov	\$0x00000fffffffffff,%r8
+	and	8($inp),%rcx
+	mov	\$0x00000fffffffffff,%r9
+	and	%rax,%r8
+	shrd	\$44,%rcx,%rax
+	mov	%r8,40($ctx)		# r0
+	and	%r9,%rax
+	shr	\$24,%rcx
+	mov	%rax,48($ctx)		# r1
+	lea	(%rax,%rax,4),%rax	# *5
+	mov	%rcx,56($ctx)		# r2
+	shl	\$2,%rax		# magic <<2
+	lea	(%rcx,%rcx,4),%rcx	# *5
+	shl	\$2,%rcx		# magic <<2
+	mov	%rax,24($ctx)		# s1
+	mov	%rcx,32($ctx)		# s2
+	movq	\$-1,64($ctx)		# write impossible value
+___
+$code.=<<___	if ($flavour !~ /elf32/);
+	mov	%r10,0(%rdx)
+	mov	%r11,8(%rdx)
+___
+$code.=<<___	if ($flavour =~ /elf32/);
+	mov	%r10d,0(%rdx)
+	mov	%r11d,4(%rdx)
+___
+$code.=<<___;
+	mov	\$1,%eax
+	ret
+.size	poly1305_init_base2_44,.-poly1305_init_base2_44
+___
+{
+my ($H0,$H1,$H2,$r2r1r0,$r1r0s2,$r0s2s1,$Dlo,$Dhi) = map("%ymm$_",(0..5,16,17));
+my ($T0,$inp_permd,$inp_shift,$PAD) = map("%ymm$_",(18..21));
+my ($reduc_mask,$reduc_rght,$reduc_left) = map("%ymm$_",(22..25));
+
+$code.=<<___;
+.type	poly1305_blocks_vpmadd52,\@function,4
+.align	32
+poly1305_blocks_vpmadd52:
+	shr	\$4,$len
+	jz	.Lno_data_vpmadd52		# too short
+
+	shl	\$40,$padbit
+	mov	64($ctx),%r8			# peek on power of the key
+
+	# if powers of the key are not calculated yet, process up to 3
+	# blocks with this single-block subroutine, otherwise ensure that
+	# length is divisible by 2 blocks and pass the rest down to next
+	# subroutine...
+
+	mov	\$3,%rax
+	mov	\$1,%r10
+	cmp	\$4,$len			# is input long
+	cmovae	%r10,%rax
+	test	%r8,%r8				# is power value impossible?
+	cmovns	%r10,%rax
+
+	and	$len,%rax			# is input of favourable length?
+	jz	.Lblocks_vpmadd52_4x
+
+	sub		%rax,$len
+	mov		\$7,%r10d
+	mov		\$1,%r11d
+	kmovw		%r10d,%k7
+	lea		.L2_44_inp_permd(%rip),%r10
+	kmovw		%r11d,%k1
+
+	vmovq		$padbit,%x#$PAD
+	vmovdqa64	0(%r10),$inp_permd	# .L2_44_inp_permd
+	vmovdqa64	32(%r10),$inp_shift	# .L2_44_inp_shift
+	vpermq		\$0xcf,$PAD,$PAD
+	vmovdqa64	64(%r10),$reduc_mask	# .L2_44_mask
+
+	vmovdqu64	0($ctx),${Dlo}{%k7}{z}		# load hash value
+	vmovdqu64	40($ctx),${r2r1r0}{%k7}{z}	# load keys
+	vmovdqu64	32($ctx),${r1r0s2}{%k7}{z}
+	vmovdqu64	24($ctx),${r0s2s1}{%k7}{z}
+
+	vmovdqa64	96(%r10),$reduc_rght	# .L2_44_shift_rgt
+	vmovdqa64	128(%r10),$reduc_left	# .L2_44_shift_lft
+
+	jmp		.Loop_vpmadd52
+
+.align	32
+.Loop_vpmadd52:
+	vmovdqu32	0($inp),%x#$T0		# load input as ----3210
+	lea		16($inp),$inp
+
+	vpermd		$T0,$inp_permd,$T0	# ----3210 -> --322110
+	vpsrlvq		$inp_shift,$T0,$T0
+	vpandq		$reduc_mask,$T0,$T0
+	vporq		$PAD,$T0,$T0
+
+	vpaddq		$T0,$Dlo,$Dlo		# accumulate input
+
+	vpermq		\$0,$Dlo,${H0}{%k7}{z}	# smash hash value
+	vpermq		\$0b01010101,$Dlo,${H1}{%k7}{z}
+	vpermq		\$0b10101010,$Dlo,${H2}{%k7}{z}
+
+	vpxord		$Dlo,$Dlo,$Dlo
+	vpxord		$Dhi,$Dhi,$Dhi
+
+	vpmadd52luq	$r2r1r0,$H0,$Dlo
+	vpmadd52huq	$r2r1r0,$H0,$Dhi
+
+	vpmadd52luq	$r1r0s2,$H1,$Dlo
+	vpmadd52huq	$r1r0s2,$H1,$Dhi
+
+	vpmadd52luq	$r0s2s1,$H2,$Dlo
+	vpmadd52huq	$r0s2s1,$H2,$Dhi
+
+	vpsrlvq		$reduc_rght,$Dlo,$T0	# 0 in topmost qword
+	vpsllvq		$reduc_left,$Dhi,$Dhi	# 0 in topmost qword
+	vpandq		$reduc_mask,$Dlo,$Dlo
+
+	vpaddq		$T0,$Dhi,$Dhi
+
+	vpermq		\$0b10010011,$Dhi,$Dhi	# 0 in lowest qword
+
+	vpaddq		$Dhi,$Dlo,$Dlo		# note topmost qword :-)
+
+	vpsrlvq		$reduc_rght,$Dlo,$T0	# 0 in topmost word
+	vpandq		$reduc_mask,$Dlo,$Dlo
+
+	vpermq		\$0b10010011,$T0,$T0
+
+	vpaddq		$T0,$Dlo,$Dlo
+
+	vpermq		\$0b10010011,$Dlo,${T0}{%k1}{z}
+
+	vpaddq		$T0,$Dlo,$Dlo
+	vpsllq		\$2,$T0,$T0
+
+	vpaddq		$T0,$Dlo,$Dlo
+
+	dec		%rax			# len-=16
+	jnz		.Loop_vpmadd52
+
+	vmovdqu64	$Dlo,0($ctx){%k7}	# store hash value
+
+	test		$len,$len
+	jnz		.Lblocks_vpmadd52_4x
+
+.Lno_data_vpmadd52:
+	ret
+.size	poly1305_blocks_vpmadd52,.-poly1305_blocks_vpmadd52
+___
+}
+{
+########################################################################
+# As implied by its name 4x subroutine processes 4 blocks in parallel
+# (but handles even 4*n+2 blocks lengths). It takes up to 4th key power
+# and is handled in 256-bit %ymm registers.
+
+my ($H0,$H1,$H2,$R0,$R1,$R2,$S1,$S2) = map("%ymm$_",(0..5,16,17));
+my ($D0lo,$D0hi,$D1lo,$D1hi,$D2lo,$D2hi) = map("%ymm$_",(18..23));
+my ($T0,$T1,$T2,$T3,$mask44,$mask42,$tmp,$PAD) = map("%ymm$_",(24..31));
+
+$code.=<<___;
+.type	poly1305_blocks_vpmadd52_4x,\@function,4
+.align	32
+poly1305_blocks_vpmadd52_4x:
+	shr	\$4,$len
+	jz	.Lno_data_vpmadd52_4x		# too short
+
+	shl	\$40,$padbit
+	mov	64($ctx),%r8			# peek on power of the key
+
+.Lblocks_vpmadd52_4x:
+	vpbroadcastq	$padbit,$PAD
+
+	vmovdqa64	.Lx_mask44(%rip),$mask44
+	mov		\$5,%eax
+	vmovdqa64	.Lx_mask42(%rip),$mask42
+	kmovw		%eax,%k1		# used in 2x path
+
+	test		%r8,%r8			# is power value impossible?
+	js		.Linit_vpmadd52		# if it is, then init R[4]
+
+	vmovq		0($ctx),%x#$H0		# load current hash value
+	vmovq		8($ctx),%x#$H1
+	vmovq		16($ctx),%x#$H2
+
+	test		\$3,$len		# is length 4*n+2?
+	jnz		.Lblocks_vpmadd52_2x_do
+
+.Lblocks_vpmadd52_4x_do:
+	vpbroadcastq	64($ctx),$R0		# load 4th power of the key
+	vpbroadcastq	96($ctx),$R1
+	vpbroadcastq	128($ctx),$R2
+	vpbroadcastq	160($ctx),$S1
+
+.Lblocks_vpmadd52_4x_key_loaded:
+	vpsllq		\$2,$R2,$S2		# S2 = R2*5*4
+	vpaddq		$R2,$S2,$S2
+	vpsllq		\$2,$S2,$S2
+
+	test		\$7,$len		# is len 8*n?
+	jz		.Lblocks_vpmadd52_8x
+
+	vmovdqu64	16*0($inp),$T2		# load data
+	vmovdqu64	16*2($inp),$T3
+	lea		16*4($inp),$inp
+
+	vpunpcklqdq	$T3,$T2,$T1		# transpose data
+	vpunpckhqdq	$T3,$T2,$T3
+
+	# at this point 64-bit lanes are ordered as 3-1-2-0
+
+	vpsrlq		\$24,$T3,$T2		# splat the data
+	vporq		$PAD,$T2,$T2
+	 vpaddq		$T2,$H2,$H2		# accumulate input
+	vpandq		$mask44,$T1,$T0
+	vpsrlq		\$44,$T1,$T1
+	vpsllq		\$20,$T3,$T3
+	vporq		$T3,$T1,$T1
+	vpandq		$mask44,$T1,$T1
+
+	sub		\$4,$len
+	jz		.Ltail_vpmadd52_4x
+	jmp		.Loop_vpmadd52_4x
+	ud2
+
+.align	32
+.Linit_vpmadd52:
+	vmovq		24($ctx),%x#$S1		# load key
+	vmovq		56($ctx),%x#$H2
+	vmovq		32($ctx),%x#$S2
+	vmovq		40($ctx),%x#$R0
+	vmovq		48($ctx),%x#$R1
+
+	vmovdqa		$R0,$H0
+	vmovdqa		$R1,$H1
+	vmovdqa		$H2,$R2
+
+	mov		\$2,%eax
+
+.Lmul_init_vpmadd52:
+	vpxorq		$D0lo,$D0lo,$D0lo
+	vpmadd52luq	$H2,$S1,$D0lo
+	vpxorq		$D0hi,$D0hi,$D0hi
+	vpmadd52huq	$H2,$S1,$D0hi
+	vpxorq		$D1lo,$D1lo,$D1lo
+	vpmadd52luq	$H2,$S2,$D1lo
+	vpxorq		$D1hi,$D1hi,$D1hi
+	vpmadd52huq	$H2,$S2,$D1hi
+	vpxorq		$D2lo,$D2lo,$D2lo
+	vpmadd52luq	$H2,$R0,$D2lo
+	vpxorq		$D2hi,$D2hi,$D2hi
+	vpmadd52huq	$H2,$R0,$D2hi
+
+	vpmadd52luq	$H0,$R0,$D0lo
+	vpmadd52huq	$H0,$R0,$D0hi
+	vpmadd52luq	$H0,$R1,$D1lo
+	vpmadd52huq	$H0,$R1,$D1hi
+	vpmadd52luq	$H0,$R2,$D2lo
+	vpmadd52huq	$H0,$R2,$D2hi
+
+	vpmadd52luq	$H1,$S2,$D0lo
+	vpmadd52huq	$H1,$S2,$D0hi
+	vpmadd52luq	$H1,$R0,$D1lo
+	vpmadd52huq	$H1,$R0,$D1hi
+	vpmadd52luq	$H1,$R1,$D2lo
+	vpmadd52huq	$H1,$R1,$D2hi
+
+	################################################################
+	# partial reduction
+	vpsrlq		\$44,$D0lo,$tmp
+	vpsllq		\$8,$D0hi,$D0hi
+	vpandq		$mask44,$D0lo,$H0
+	vpaddq		$tmp,$D0hi,$D0hi
+
+	vpaddq		$D0hi,$D1lo,$D1lo
+
+	vpsrlq		\$44,$D1lo,$tmp
+	vpsllq		\$8,$D1hi,$D1hi
+	vpandq		$mask44,$D1lo,$H1
+	vpaddq		$tmp,$D1hi,$D1hi
+
+	vpaddq		$D1hi,$D2lo,$D2lo
+
+	vpsrlq		\$42,$D2lo,$tmp
+	vpsllq		\$10,$D2hi,$D2hi
+	vpandq		$mask42,$D2lo,$H2
+	vpaddq		$tmp,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$H0,$H0
+	vpsllq		\$2,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$H0,$H0
+
+	vpsrlq		\$44,$H0,$tmp		# additional step
+	vpandq		$mask44,$H0,$H0
+
+	vpaddq		$tmp,$H1,$H1
+
+	dec		%eax
+	jz		.Ldone_init_vpmadd52
+
+	vpunpcklqdq	$R1,$H1,$R1		# 1,2
+	vpbroadcastq	%x#$H1,%x#$H1		# 2,2
+	vpunpcklqdq	$R2,$H2,$R2
+	vpbroadcastq	%x#$H2,%x#$H2
+	vpunpcklqdq	$R0,$H0,$R0
+	vpbroadcastq	%x#$H0,%x#$H0
+
+	vpsllq		\$2,$R1,$S1		# S1 = R1*5*4
+	vpsllq		\$2,$R2,$S2		# S2 = R2*5*4
+	vpaddq		$R1,$S1,$S1
+	vpaddq		$R2,$S2,$S2
+	vpsllq		\$2,$S1,$S1
+	vpsllq		\$2,$S2,$S2
+
+	jmp		.Lmul_init_vpmadd52
+	ud2
+
+.align	32
+.Ldone_init_vpmadd52:
+	vinserti128	\$1,%x#$R1,$H1,$R1	# 1,2,3,4
+	vinserti128	\$1,%x#$R2,$H2,$R2
+	vinserti128	\$1,%x#$R0,$H0,$R0
+
+	vpermq		\$0b11011000,$R1,$R1	# 1,3,2,4
+	vpermq		\$0b11011000,$R2,$R2
+	vpermq		\$0b11011000,$R0,$R0
+
+	vpsllq		\$2,$R1,$S1		# S1 = R1*5*4
+	vpaddq		$R1,$S1,$S1
+	vpsllq		\$2,$S1,$S1
+
+	vmovq		0($ctx),%x#$H0		# load current hash value
+	vmovq		8($ctx),%x#$H1
+	vmovq		16($ctx),%x#$H2
+
+	test		\$3,$len		# is length 4*n+2?
+	jnz		.Ldone_init_vpmadd52_2x
+
+	vmovdqu64	$R0,64($ctx)		# save key powers
+	vpbroadcastq	%x#$R0,$R0		# broadcast 4th power
+	vmovdqu64	$R1,96($ctx)
+	vpbroadcastq	%x#$R1,$R1
+	vmovdqu64	$R2,128($ctx)
+	vpbroadcastq	%x#$R2,$R2
+	vmovdqu64	$S1,160($ctx)
+	vpbroadcastq	%x#$S1,$S1
+
+	jmp		.Lblocks_vpmadd52_4x_key_loaded
+	ud2
+
+.align	32
+.Ldone_init_vpmadd52_2x:
+	vmovdqu64	$R0,64($ctx)		# save key powers
+	vpsrldq		\$8,$R0,$R0		# 0-1-0-2
+	vmovdqu64	$R1,96($ctx)
+	vpsrldq		\$8,$R1,$R1
+	vmovdqu64	$R2,128($ctx)
+	vpsrldq		\$8,$R2,$R2
+	vmovdqu64	$S1,160($ctx)
+	vpsrldq		\$8,$S1,$S1
+	jmp		.Lblocks_vpmadd52_2x_key_loaded
+	ud2
+
+.align	32
+.Lblocks_vpmadd52_2x_do:
+	vmovdqu64	128+8($ctx),${R2}{%k1}{z}# load 2nd and 1st key powers
+	vmovdqu64	160+8($ctx),${S1}{%k1}{z}
+	vmovdqu64	64+8($ctx),${R0}{%k1}{z}
+	vmovdqu64	96+8($ctx),${R1}{%k1}{z}
+
+.Lblocks_vpmadd52_2x_key_loaded:
+	vmovdqu64	16*0($inp),$T2		# load data
+	vpxorq		$T3,$T3,$T3
+	lea		16*2($inp),$inp
+
+	vpunpcklqdq	$T3,$T2,$T1		# transpose data
+	vpunpckhqdq	$T3,$T2,$T3
+
+	# at this point 64-bit lanes are ordered as x-1-x-0
+
+	vpsrlq		\$24,$T3,$T2		# splat the data
+	vporq		$PAD,$T2,$T2
+	 vpaddq		$T2,$H2,$H2		# accumulate input
+	vpandq		$mask44,$T1,$T0
+	vpsrlq		\$44,$T1,$T1
+	vpsllq		\$20,$T3,$T3
+	vporq		$T3,$T1,$T1
+	vpandq		$mask44,$T1,$T1
+
+	jmp		.Ltail_vpmadd52_2x
+	ud2
+
+.align	32
+.Loop_vpmadd52_4x:
+	#vpaddq		$T2,$H2,$H2		# accumulate input
+	vpaddq		$T0,$H0,$H0
+	vpaddq		$T1,$H1,$H1
+
+	vpxorq		$D0lo,$D0lo,$D0lo
+	vpmadd52luq	$H2,$S1,$D0lo
+	vpxorq		$D0hi,$D0hi,$D0hi
+	vpmadd52huq	$H2,$S1,$D0hi
+	vpxorq		$D1lo,$D1lo,$D1lo
+	vpmadd52luq	$H2,$S2,$D1lo
+	vpxorq		$D1hi,$D1hi,$D1hi
+	vpmadd52huq	$H2,$S2,$D1hi
+	vpxorq		$D2lo,$D2lo,$D2lo
+	vpmadd52luq	$H2,$R0,$D2lo
+	vpxorq		$D2hi,$D2hi,$D2hi
+	vpmadd52huq	$H2,$R0,$D2hi
+
+	 vmovdqu64	16*0($inp),$T2		# load data
+	 vmovdqu64	16*2($inp),$T3
+	 lea		16*4($inp),$inp
+	vpmadd52luq	$H0,$R0,$D0lo
+	vpmadd52huq	$H0,$R0,$D0hi
+	vpmadd52luq	$H0,$R1,$D1lo
+	vpmadd52huq	$H0,$R1,$D1hi
+	vpmadd52luq	$H0,$R2,$D2lo
+	vpmadd52huq	$H0,$R2,$D2hi
+
+	 vpunpcklqdq	$T3,$T2,$T1		# transpose data
+	 vpunpckhqdq	$T3,$T2,$T3
+	vpmadd52luq	$H1,$S2,$D0lo
+	vpmadd52huq	$H1,$S2,$D0hi
+	vpmadd52luq	$H1,$R0,$D1lo
+	vpmadd52huq	$H1,$R0,$D1hi
+	vpmadd52luq	$H1,$R1,$D2lo
+	vpmadd52huq	$H1,$R1,$D2hi
+
+	################################################################
+	# partial reduction (interleaved with data splat)
+	vpsrlq		\$44,$D0lo,$tmp
+	vpsllq		\$8,$D0hi,$D0hi
+	vpandq		$mask44,$D0lo,$H0
+	vpaddq		$tmp,$D0hi,$D0hi
+
+	 vpsrlq		\$24,$T3,$T2
+	 vporq		$PAD,$T2,$T2
+	vpaddq		$D0hi,$D1lo,$D1lo
+
+	vpsrlq		\$44,$D1lo,$tmp
+	vpsllq		\$8,$D1hi,$D1hi
+	vpandq		$mask44,$D1lo,$H1
+	vpaddq		$tmp,$D1hi,$D1hi
+
+	 vpandq		$mask44,$T1,$T0
+	 vpsrlq		\$44,$T1,$T1
+	 vpsllq		\$20,$T3,$T3
+	vpaddq		$D1hi,$D2lo,$D2lo
+
+	vpsrlq		\$42,$D2lo,$tmp
+	vpsllq		\$10,$D2hi,$D2hi
+	vpandq		$mask42,$D2lo,$H2
+	vpaddq		$tmp,$D2hi,$D2hi
+
+	  vpaddq	$T2,$H2,$H2		# accumulate input
+	vpaddq		$D2hi,$H0,$H0
+	vpsllq		\$2,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$H0,$H0
+	 vporq		$T3,$T1,$T1
+	 vpandq		$mask44,$T1,$T1
+
+	vpsrlq		\$44,$H0,$tmp		# additional step
+	vpandq		$mask44,$H0,$H0
+
+	vpaddq		$tmp,$H1,$H1
+
+	sub		\$4,$len		# len-=64
+	jnz		.Loop_vpmadd52_4x
+
+.Ltail_vpmadd52_4x:
+	vmovdqu64	128($ctx),$R2		# load all key powers
+	vmovdqu64	160($ctx),$S1
+	vmovdqu64	64($ctx),$R0
+	vmovdqu64	96($ctx),$R1
+
+.Ltail_vpmadd52_2x:
+	vpsllq		\$2,$R2,$S2		# S2 = R2*5*4
+	vpaddq		$R2,$S2,$S2
+	vpsllq		\$2,$S2,$S2
+
+	#vpaddq		$T2,$H2,$H2		# accumulate input
+	vpaddq		$T0,$H0,$H0
+	vpaddq		$T1,$H1,$H1
+
+	vpxorq		$D0lo,$D0lo,$D0lo
+	vpmadd52luq	$H2,$S1,$D0lo
+	vpxorq		$D0hi,$D0hi,$D0hi
+	vpmadd52huq	$H2,$S1,$D0hi
+	vpxorq		$D1lo,$D1lo,$D1lo
+	vpmadd52luq	$H2,$S2,$D1lo
+	vpxorq		$D1hi,$D1hi,$D1hi
+	vpmadd52huq	$H2,$S2,$D1hi
+	vpxorq		$D2lo,$D2lo,$D2lo
+	vpmadd52luq	$H2,$R0,$D2lo
+	vpxorq		$D2hi,$D2hi,$D2hi
+	vpmadd52huq	$H2,$R0,$D2hi
+
+	vpmadd52luq	$H0,$R0,$D0lo
+	vpmadd52huq	$H0,$R0,$D0hi
+	vpmadd52luq	$H0,$R1,$D1lo
+	vpmadd52huq	$H0,$R1,$D1hi
+	vpmadd52luq	$H0,$R2,$D2lo
+	vpmadd52huq	$H0,$R2,$D2hi
+
+	vpmadd52luq	$H1,$S2,$D0lo
+	vpmadd52huq	$H1,$S2,$D0hi
+	vpmadd52luq	$H1,$R0,$D1lo
+	vpmadd52huq	$H1,$R0,$D1hi
+	vpmadd52luq	$H1,$R1,$D2lo
+	vpmadd52huq	$H1,$R1,$D2hi
+
+	################################################################
+	# horizontal addition
+
+	mov		\$1,%eax
+	kmovw		%eax,%k1
+	vpsrldq		\$8,$D0lo,$T0
+	vpsrldq		\$8,$D0hi,$H0
+	vpsrldq		\$8,$D1lo,$T1
+	vpsrldq		\$8,$D1hi,$H1
+	vpaddq		$T0,$D0lo,$D0lo
+	vpaddq		$H0,$D0hi,$D0hi
+	vpsrldq		\$8,$D2lo,$T2
+	vpsrldq		\$8,$D2hi,$H2
+	vpaddq		$T1,$D1lo,$D1lo
+	vpaddq		$H1,$D1hi,$D1hi
+	 vpermq		\$0x2,$D0lo,$T0
+	 vpermq		\$0x2,$D0hi,$H0
+	vpaddq		$T2,$D2lo,$D2lo
+	vpaddq		$H2,$D2hi,$D2hi
+
+	vpermq		\$0x2,$D1lo,$T1
+	vpermq		\$0x2,$D1hi,$H1
+	vpaddq		$T0,$D0lo,${D0lo}{%k1}{z}
+	vpaddq		$H0,$D0hi,${D0hi}{%k1}{z}
+	vpermq		\$0x2,$D2lo,$T2
+	vpermq		\$0x2,$D2hi,$H2
+	vpaddq		$T1,$D1lo,${D1lo}{%k1}{z}
+	vpaddq		$H1,$D1hi,${D1hi}{%k1}{z}
+	vpaddq		$T2,$D2lo,${D2lo}{%k1}{z}
+	vpaddq		$H2,$D2hi,${D2hi}{%k1}{z}
+
+	################################################################
+	# partial reduction
+	vpsrlq		\$44,$D0lo,$tmp
+	vpsllq		\$8,$D0hi,$D0hi
+	vpandq		$mask44,$D0lo,$H0
+	vpaddq		$tmp,$D0hi,$D0hi
+
+	vpaddq		$D0hi,$D1lo,$D1lo
+
+	vpsrlq		\$44,$D1lo,$tmp
+	vpsllq		\$8,$D1hi,$D1hi
+	vpandq		$mask44,$D1lo,$H1
+	vpaddq		$tmp,$D1hi,$D1hi
+
+	vpaddq		$D1hi,$D2lo,$D2lo
+
+	vpsrlq		\$42,$D2lo,$tmp
+	vpsllq		\$10,$D2hi,$D2hi
+	vpandq		$mask42,$D2lo,$H2
+	vpaddq		$tmp,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$H0,$H0
+	vpsllq		\$2,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$H0,$H0
+
+	vpsrlq		\$44,$H0,$tmp		# additional step
+	vpandq		$mask44,$H0,$H0
+
+	vpaddq		$tmp,$H1,$H1
+						# at this point $len is
+						# either 4*n+2 or 0...
+	sub		\$2,$len		# len-=32
+	ja		.Lblocks_vpmadd52_4x_do
+
+	vmovq		%x#$H0,0($ctx)
+	vmovq		%x#$H1,8($ctx)
+	vmovq		%x#$H2,16($ctx)
+	vzeroall
+
+.Lno_data_vpmadd52_4x:
+	ret
+.size	poly1305_blocks_vpmadd52_4x,.-poly1305_blocks_vpmadd52_4x
+___
+}
+{
+########################################################################
+# As implied by its name 8x subroutine processes 8 blocks in parallel...
+# This is intermediate version, as it's used only in cases when input
+# length is either 8*n, 8*n+1 or 8*n+2...
+
+my ($H0,$H1,$H2,$R0,$R1,$R2,$S1,$S2) = map("%ymm$_",(0..5,16,17));
+my ($D0lo,$D0hi,$D1lo,$D1hi,$D2lo,$D2hi) = map("%ymm$_",(18..23));
+my ($T0,$T1,$T2,$T3,$mask44,$mask42,$tmp,$PAD) = map("%ymm$_",(24..31));
+my ($RR0,$RR1,$RR2,$SS1,$SS2) = map("%ymm$_",(6..10));
+
+$code.=<<___;
+.type	poly1305_blocks_vpmadd52_8x,\@function,4
+.align	32
+poly1305_blocks_vpmadd52_8x:
+	shr	\$4,$len
+	jz	.Lno_data_vpmadd52_8x		# too short
+
+	shl	\$40,$padbit
+	mov	64($ctx),%r8			# peek on power of the key
+
+	vmovdqa64	.Lx_mask44(%rip),$mask44
+	vmovdqa64	.Lx_mask42(%rip),$mask42
+
+	test	%r8,%r8				# is power value impossible?
+	js	.Linit_vpmadd52			# if it is, then init R[4]
+
+	vmovq	0($ctx),%x#$H0			# load current hash value
+	vmovq	8($ctx),%x#$H1
+	vmovq	16($ctx),%x#$H2
+
+.Lblocks_vpmadd52_8x:
+	################################################################
+	# fist we calculate more key powers
+
+	vmovdqu64	128($ctx),$R2		# load 1-3-2-4 powers
+	vmovdqu64	160($ctx),$S1
+	vmovdqu64	64($ctx),$R0
+	vmovdqu64	96($ctx),$R1
+
+	vpsllq		\$2,$R2,$S2		# S2 = R2*5*4
+	vpaddq		$R2,$S2,$S2
+	vpsllq		\$2,$S2,$S2
+
+	vpbroadcastq	%x#$R2,$RR2		# broadcast 4th power
+	vpbroadcastq	%x#$R0,$RR0
+	vpbroadcastq	%x#$R1,$RR1
+
+	vpxorq		$D0lo,$D0lo,$D0lo
+	vpmadd52luq	$RR2,$S1,$D0lo
+	vpxorq		$D0hi,$D0hi,$D0hi
+	vpmadd52huq	$RR2,$S1,$D0hi
+	vpxorq		$D1lo,$D1lo,$D1lo
+	vpmadd52luq	$RR2,$S2,$D1lo
+	vpxorq		$D1hi,$D1hi,$D1hi
+	vpmadd52huq	$RR2,$S2,$D1hi
+	vpxorq		$D2lo,$D2lo,$D2lo
+	vpmadd52luq	$RR2,$R0,$D2lo
+	vpxorq		$D2hi,$D2hi,$D2hi
+	vpmadd52huq	$RR2,$R0,$D2hi
+
+	vpmadd52luq	$RR0,$R0,$D0lo
+	vpmadd52huq	$RR0,$R0,$D0hi
+	vpmadd52luq	$RR0,$R1,$D1lo
+	vpmadd52huq	$RR0,$R1,$D1hi
+	vpmadd52luq	$RR0,$R2,$D2lo
+	vpmadd52huq	$RR0,$R2,$D2hi
+
+	vpmadd52luq	$RR1,$S2,$D0lo
+	vpmadd52huq	$RR1,$S2,$D0hi
+	vpmadd52luq	$RR1,$R0,$D1lo
+	vpmadd52huq	$RR1,$R0,$D1hi
+	vpmadd52luq	$RR1,$R1,$D2lo
+	vpmadd52huq	$RR1,$R1,$D2hi
+
+	################################################################
+	# partial reduction
+	vpsrlq		\$44,$D0lo,$tmp
+	vpsllq		\$8,$D0hi,$D0hi
+	vpandq		$mask44,$D0lo,$RR0
+	vpaddq		$tmp,$D0hi,$D0hi
+
+	vpaddq		$D0hi,$D1lo,$D1lo
+
+	vpsrlq		\$44,$D1lo,$tmp
+	vpsllq		\$8,$D1hi,$D1hi
+	vpandq		$mask44,$D1lo,$RR1
+	vpaddq		$tmp,$D1hi,$D1hi
+
+	vpaddq		$D1hi,$D2lo,$D2lo
+
+	vpsrlq		\$42,$D2lo,$tmp
+	vpsllq		\$10,$D2hi,$D2hi
+	vpandq		$mask42,$D2lo,$RR2
+	vpaddq		$tmp,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$RR0,$RR0
+	vpsllq		\$2,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$RR0,$RR0
+
+	vpsrlq		\$44,$RR0,$tmp		# additional step
+	vpandq		$mask44,$RR0,$RR0
+
+	vpaddq		$tmp,$RR1,$RR1
+
+	################################################################
+	# At this point Rx holds 1324 powers, RRx - 5768, and the goal
+	# is 15263748, which reflects how data is loaded...
+
+	vpunpcklqdq	$R2,$RR2,$T2		# 3748
+	vpunpckhqdq	$R2,$RR2,$R2		# 1526
+	vpunpcklqdq	$R0,$RR0,$T0
+	vpunpckhqdq	$R0,$RR0,$R0
+	vpunpcklqdq	$R1,$RR1,$T1
+	vpunpckhqdq	$R1,$RR1,$R1
+___
+######## switch to %zmm
+map(s/%y/%z/, $H0,$H1,$H2,$R0,$R1,$R2,$S1,$S2);
+map(s/%y/%z/, $D0lo,$D0hi,$D1lo,$D1hi,$D2lo,$D2hi);
+map(s/%y/%z/, $T0,$T1,$T2,$T3,$mask44,$mask42,$tmp,$PAD);
+map(s/%y/%z/, $RR0,$RR1,$RR2,$SS1,$SS2);
+
+$code.=<<___;
+	vshufi64x2	\$0x44,$R2,$T2,$RR2	# 15263748
+	vshufi64x2	\$0x44,$R0,$T0,$RR0
+	vshufi64x2	\$0x44,$R1,$T1,$RR1
+
+	vmovdqu64	16*0($inp),$T2		# load data
+	vmovdqu64	16*4($inp),$T3
+	lea		16*8($inp),$inp
+
+	vpsllq		\$2,$RR2,$SS2		# S2 = R2*5*4
+	vpsllq		\$2,$RR1,$SS1		# S1 = R1*5*4
+	vpaddq		$RR2,$SS2,$SS2
+	vpaddq		$RR1,$SS1,$SS1
+	vpsllq		\$2,$SS2,$SS2
+	vpsllq		\$2,$SS1,$SS1
+
+	vpbroadcastq	$padbit,$PAD
+	vpbroadcastq	%x#$mask44,$mask44
+	vpbroadcastq	%x#$mask42,$mask42
+
+	vpbroadcastq	%x#$SS1,$S1		# broadcast 8th power
+	vpbroadcastq	%x#$SS2,$S2
+	vpbroadcastq	%x#$RR0,$R0
+	vpbroadcastq	%x#$RR1,$R1
+	vpbroadcastq	%x#$RR2,$R2
+
+	vpunpcklqdq	$T3,$T2,$T1		# transpose data
+	vpunpckhqdq	$T3,$T2,$T3
+
+	# at this point 64-bit lanes are ordered as 73625140
+
+	vpsrlq		\$24,$T3,$T2		# splat the data
+	vporq		$PAD,$T2,$T2
+	 vpaddq		$T2,$H2,$H2		# accumulate input
+	vpandq		$mask44,$T1,$T0
+	vpsrlq		\$44,$T1,$T1
+	vpsllq		\$20,$T3,$T3
+	vporq		$T3,$T1,$T1
+	vpandq		$mask44,$T1,$T1
+
+	sub		\$8,$len
+	jz		.Ltail_vpmadd52_8x
+	jmp		.Loop_vpmadd52_8x
+
+.align	32
+.Loop_vpmadd52_8x:
+	#vpaddq		$T2,$H2,$H2		# accumulate input
+	vpaddq		$T0,$H0,$H0
+	vpaddq		$T1,$H1,$H1
+
+	vpxorq		$D0lo,$D0lo,$D0lo
+	vpmadd52luq	$H2,$S1,$D0lo
+	vpxorq		$D0hi,$D0hi,$D0hi
+	vpmadd52huq	$H2,$S1,$D0hi
+	vpxorq		$D1lo,$D1lo,$D1lo
+	vpmadd52luq	$H2,$S2,$D1lo
+	vpxorq		$D1hi,$D1hi,$D1hi
+	vpmadd52huq	$H2,$S2,$D1hi
+	vpxorq		$D2lo,$D2lo,$D2lo
+	vpmadd52luq	$H2,$R0,$D2lo
+	vpxorq		$D2hi,$D2hi,$D2hi
+	vpmadd52huq	$H2,$R0,$D2hi
+
+	 vmovdqu64	16*0($inp),$T2		# load data
+	 vmovdqu64	16*4($inp),$T3
+	 lea		16*8($inp),$inp
+	vpmadd52luq	$H0,$R0,$D0lo
+	vpmadd52huq	$H0,$R0,$D0hi
+	vpmadd52luq	$H0,$R1,$D1lo
+	vpmadd52huq	$H0,$R1,$D1hi
+	vpmadd52luq	$H0,$R2,$D2lo
+	vpmadd52huq	$H0,$R2,$D2hi
+
+	 vpunpcklqdq	$T3,$T2,$T1		# transpose data
+	 vpunpckhqdq	$T3,$T2,$T3
+	vpmadd52luq	$H1,$S2,$D0lo
+	vpmadd52huq	$H1,$S2,$D0hi
+	vpmadd52luq	$H1,$R0,$D1lo
+	vpmadd52huq	$H1,$R0,$D1hi
+	vpmadd52luq	$H1,$R1,$D2lo
+	vpmadd52huq	$H1,$R1,$D2hi
+
+	################################################################
+	# partial reduction (interleaved with data splat)
+	vpsrlq		\$44,$D0lo,$tmp
+	vpsllq		\$8,$D0hi,$D0hi
+	vpandq		$mask44,$D0lo,$H0
+	vpaddq		$tmp,$D0hi,$D0hi
+
+	 vpsrlq		\$24,$T3,$T2
+	 vporq		$PAD,$T2,$T2
+	vpaddq		$D0hi,$D1lo,$D1lo
+
+	vpsrlq		\$44,$D1lo,$tmp
+	vpsllq		\$8,$D1hi,$D1hi
+	vpandq		$mask44,$D1lo,$H1
+	vpaddq		$tmp,$D1hi,$D1hi
+
+	 vpandq		$mask44,$T1,$T0
+	 vpsrlq		\$44,$T1,$T1
+	 vpsllq		\$20,$T3,$T3
+	vpaddq		$D1hi,$D2lo,$D2lo
+
+	vpsrlq		\$42,$D2lo,$tmp
+	vpsllq		\$10,$D2hi,$D2hi
+	vpandq		$mask42,$D2lo,$H2
+	vpaddq		$tmp,$D2hi,$D2hi
+
+	  vpaddq	$T2,$H2,$H2		# accumulate input
+	vpaddq		$D2hi,$H0,$H0
+	vpsllq		\$2,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$H0,$H0
+	 vporq		$T3,$T1,$T1
+	 vpandq		$mask44,$T1,$T1
+
+	vpsrlq		\$44,$H0,$tmp		# additional step
+	vpandq		$mask44,$H0,$H0
+
+	vpaddq		$tmp,$H1,$H1
+
+	sub		\$8,$len		# len-=128
+	jnz		.Loop_vpmadd52_8x
+
+.Ltail_vpmadd52_8x:
+	#vpaddq		$T2,$H2,$H2		# accumulate input
+	vpaddq		$T0,$H0,$H0
+	vpaddq		$T1,$H1,$H1
+
+	vpxorq		$D0lo,$D0lo,$D0lo
+	vpmadd52luq	$H2,$SS1,$D0lo
+	vpxorq		$D0hi,$D0hi,$D0hi
+	vpmadd52huq	$H2,$SS1,$D0hi
+	vpxorq		$D1lo,$D1lo,$D1lo
+	vpmadd52luq	$H2,$SS2,$D1lo
+	vpxorq		$D1hi,$D1hi,$D1hi
+	vpmadd52huq	$H2,$SS2,$D1hi
+	vpxorq		$D2lo,$D2lo,$D2lo
+	vpmadd52luq	$H2,$RR0,$D2lo
+	vpxorq		$D2hi,$D2hi,$D2hi
+	vpmadd52huq	$H2,$RR0,$D2hi
+
+	vpmadd52luq	$H0,$RR0,$D0lo
+	vpmadd52huq	$H0,$RR0,$D0hi
+	vpmadd52luq	$H0,$RR1,$D1lo
+	vpmadd52huq	$H0,$RR1,$D1hi
+	vpmadd52luq	$H0,$RR2,$D2lo
+	vpmadd52huq	$H0,$RR2,$D2hi
+
+	vpmadd52luq	$H1,$SS2,$D0lo
+	vpmadd52huq	$H1,$SS2,$D0hi
+	vpmadd52luq	$H1,$RR0,$D1lo
+	vpmadd52huq	$H1,$RR0,$D1hi
+	vpmadd52luq	$H1,$RR1,$D2lo
+	vpmadd52huq	$H1,$RR1,$D2hi
+
+	################################################################
+	# horizontal addition
+
+	mov		\$1,%eax
+	kmovw		%eax,%k1
+	vpsrldq		\$8,$D0lo,$T0
+	vpsrldq		\$8,$D0hi,$H0
+	vpsrldq		\$8,$D1lo,$T1
+	vpsrldq		\$8,$D1hi,$H1
+	vpaddq		$T0,$D0lo,$D0lo
+	vpaddq		$H0,$D0hi,$D0hi
+	vpsrldq		\$8,$D2lo,$T2
+	vpsrldq		\$8,$D2hi,$H2
+	vpaddq		$T1,$D1lo,$D1lo
+	vpaddq		$H1,$D1hi,$D1hi
+	 vpermq		\$0x2,$D0lo,$T0
+	 vpermq		\$0x2,$D0hi,$H0
+	vpaddq		$T2,$D2lo,$D2lo
+	vpaddq		$H2,$D2hi,$D2hi
+
+	vpermq		\$0x2,$D1lo,$T1
+	vpermq		\$0x2,$D1hi,$H1
+	vpaddq		$T0,$D0lo,$D0lo
+	vpaddq		$H0,$D0hi,$D0hi
+	vpermq		\$0x2,$D2lo,$T2
+	vpermq		\$0x2,$D2hi,$H2
+	vpaddq		$T1,$D1lo,$D1lo
+	vpaddq		$H1,$D1hi,$D1hi
+	 vextracti64x4	\$1,$D0lo,%y#$T0
+	 vextracti64x4	\$1,$D0hi,%y#$H0
+	vpaddq		$T2,$D2lo,$D2lo
+	vpaddq		$H2,$D2hi,$D2hi
+
+	vextracti64x4	\$1,$D1lo,%y#$T1
+	vextracti64x4	\$1,$D1hi,%y#$H1
+	vextracti64x4	\$1,$D2lo,%y#$T2
+	vextracti64x4	\$1,$D2hi,%y#$H2
+___
+######## switch back to %ymm
+map(s/%z/%y/, $H0,$H1,$H2,$R0,$R1,$R2,$S1,$S2);
+map(s/%z/%y/, $D0lo,$D0hi,$D1lo,$D1hi,$D2lo,$D2hi);
+map(s/%z/%y/, $T0,$T1,$T2,$T3,$mask44,$mask42,$tmp,$PAD);
+
+$code.=<<___;
+	vpaddq		$T0,$D0lo,${D0lo}{%k1}{z}
+	vpaddq		$H0,$D0hi,${D0hi}{%k1}{z}
+	vpaddq		$T1,$D1lo,${D1lo}{%k1}{z}
+	vpaddq		$H1,$D1hi,${D1hi}{%k1}{z}
+	vpaddq		$T2,$D2lo,${D2lo}{%k1}{z}
+	vpaddq		$H2,$D2hi,${D2hi}{%k1}{z}
+
+	################################################################
+	# partial reduction
+	vpsrlq		\$44,$D0lo,$tmp
+	vpsllq		\$8,$D0hi,$D0hi
+	vpandq		$mask44,$D0lo,$H0
+	vpaddq		$tmp,$D0hi,$D0hi
+
+	vpaddq		$D0hi,$D1lo,$D1lo
+
+	vpsrlq		\$44,$D1lo,$tmp
+	vpsllq		\$8,$D1hi,$D1hi
+	vpandq		$mask44,$D1lo,$H1
+	vpaddq		$tmp,$D1hi,$D1hi
+
+	vpaddq		$D1hi,$D2lo,$D2lo
+
+	vpsrlq		\$42,$D2lo,$tmp
+	vpsllq		\$10,$D2hi,$D2hi
+	vpandq		$mask42,$D2lo,$H2
+	vpaddq		$tmp,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$H0,$H0
+	vpsllq		\$2,$D2hi,$D2hi
+
+	vpaddq		$D2hi,$H0,$H0
+
+	vpsrlq		\$44,$H0,$tmp		# additional step
+	vpandq		$mask44,$H0,$H0
+
+	vpaddq		$tmp,$H1,$H1
+
+	################################################################
+
+	vmovq		%x#$H0,0($ctx)
+	vmovq		%x#$H1,8($ctx)
+	vmovq		%x#$H2,16($ctx)
+	vzeroall
+
+.Lno_data_vpmadd52_8x:
+	ret
+.size	poly1305_blocks_vpmadd52_8x,.-poly1305_blocks_vpmadd52_8x
+___
+}
+$code.=<<___;
+.type	poly1305_emit_base2_44,\@function,3
+.align	32
+poly1305_emit_base2_44:
+	mov	0($ctx),%r8	# load hash value
+	mov	8($ctx),%r9
+	mov	16($ctx),%r10
+
+	mov	%r9,%rax
+	shr	\$20,%r9
+	shl	\$44,%rax
+	mov	%r10,%rcx
+	shr	\$40,%r10
+	shl	\$24,%rcx
+
+	add	%rax,%r8
+	adc	%rcx,%r9
+	adc	\$0,%r10
+
+	mov	%r8,%rax
+	add	\$5,%r8		# compare to modulus
+	mov	%r9,%rcx
+	adc	\$0,%r9
+	adc	\$0,%r10
+	shr	\$2,%r10	# did 130-bit value overflow?
+	cmovnz	%r8,%rax
+	cmovnz	%r9,%rcx
+
+	add	0($nonce),%rax	# accumulate nonce
+	adc	8($nonce),%rcx
+	mov	%rax,0($mac)	# write result
+	mov	%rcx,8($mac)
+
+	ret
+.size	poly1305_emit_base2_44,.-poly1305_emit_base2_44
+___
+}	}	}
+}
+
+
+# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
+#		CONTEXT *context,DISPATCHER_CONTEXT *disp)
+if ($win64) {
+$rec="%rcx";
+$frame="%rdx";
+$context="%r8";
+$disp="%r9";
+
+$code.=<<___;
+.extern	__imp_RtlVirtualUnwind
+.type	se_handler,\@abi-omnipotent
+.align	16
+se_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue label
+	cmp	%r10,%rbx		# context->Rip<.Lprologue
+	jb	.Lcommon_seh_tail
+
+	mov	152($context),%rax	# pull context->Rsp
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=.Lepilogue
+	jae	.Lcommon_seh_tail
+
+	lea	48(%rax),%rax
+
+	mov	-8(%rax),%rbx
+	mov	-16(%rax),%rbp
+	mov	-24(%rax),%r12
+	mov	-32(%rax),%r13
+	mov	-40(%rax),%r14
+	mov	-48(%rax),%r15
+	mov	%rbx,144($context)	# restore context->Rbx
+	mov	%rbp,160($context)	# restore context->Rbp
+	mov	%r12,216($context)	# restore context->R12
+	mov	%r13,224($context)	# restore context->R13
+	mov	%r14,232($context)	# restore context->R14
+	mov	%r15,240($context)	# restore context->R14
+
+	jmp	.Lcommon_seh_tail
+.size	se_handler,.-se_handler
+
+.type	avx_handler,\@abi-omnipotent
+.align	16
+avx_handler:
+	push	%rsi
+	push	%rdi
+	push	%rbx
+	push	%rbp
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+	pushfq
+	sub	\$64,%rsp
+
+	mov	120($context),%rax	# pull context->Rax
+	mov	248($context),%rbx	# pull context->Rip
+
+	mov	8($disp),%rsi		# disp->ImageBase
+	mov	56($disp),%r11		# disp->HandlerData
+
+	mov	0(%r11),%r10d		# HandlerData[0]
+	lea	(%rsi,%r10),%r10	# prologue label
+	cmp	%r10,%rbx		# context->Rip<prologue label
+	jb	.Lcommon_seh_tail
+
+	mov	152($context),%rax	# pull context->Rsp
+
+	mov	4(%r11),%r10d		# HandlerData[1]
+	lea	(%rsi,%r10),%r10	# epilogue label
+	cmp	%r10,%rbx		# context->Rip>=epilogue label
+	jae	.Lcommon_seh_tail
+
+	mov	208($context),%rax	# pull context->R11
+
+	lea	0x50(%rax),%rsi
+	lea	0xf8(%rax),%rax
+	lea	512($context),%rdi	# &context.Xmm6
+	mov	\$20,%ecx
+	.long	0xa548f3fc		# cld; rep movsq
+
+.Lcommon_seh_tail:
+	mov	8(%rax),%rdi
+	mov	16(%rax),%rsi
+	mov	%rax,152($context)	# restore context->Rsp
+	mov	%rsi,168($context)	# restore context->Rsi
+	mov	%rdi,176($context)	# restore context->Rdi
+
+	mov	40($disp),%rdi		# disp->ContextRecord
+	mov	$context,%rsi		# context
+	mov	\$154,%ecx		# sizeof(CONTEXT)
+	.long	0xa548f3fc		# cld; rep movsq
+
+	mov	$disp,%rsi
+	xor	%rcx,%rcx		# arg1, UNW_FLAG_NHANDLER
+	mov	8(%rsi),%rdx		# arg2, disp->ImageBase
+	mov	0(%rsi),%r8		# arg3, disp->ControlPc
+	mov	16(%rsi),%r9		# arg4, disp->FunctionEntry
+	mov	40(%rsi),%r10		# disp->ContextRecord
+	lea	56(%rsi),%r11		# &disp->HandlerData
+	lea	24(%rsi),%r12		# &disp->EstablisherFrame
+	mov	%r10,32(%rsp)		# arg5
+	mov	%r11,40(%rsp)		# arg6
+	mov	%r12,48(%rsp)		# arg7
+	mov	%rcx,56(%rsp)		# arg8, (NULL)
+	call	*__imp_RtlVirtualUnwind(%rip)
+
+	mov	\$1,%eax		# ExceptionContinueSearch
+	add	\$64,%rsp
+	popfq
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+	pop	%rbp
+	pop	%rbx
+	pop	%rdi
+	pop	%rsi
+	ret
+.size	avx_handler,.-avx_handler
+
+.section	.pdata
+.align	4
+	.rva	.LSEH_begin_poly1305_init_x86_64
+	.rva	.LSEH_end_poly1305_init_x86_64
+	.rva	.LSEH_info_poly1305_init
+
+	.rva	.LSEH_begin_poly1305_blocks_x86_64
+	.rva	.LSEH_end_poly1305_blocks_x86_64
+	.rva	.LSEH_info_poly1305_blocks
+
+	.rva	.LSEH_begin_poly1305_emit_x86_64
+	.rva	.LSEH_end_poly1305_emit_x86_64
+	.rva	.LSEH_info_poly1305_emit
+___
+$code.=<<___ if ($avx);
+	.rva	.LSEH_begin_poly1305_blocks_avx
+	.rva	.Lbase2_64_avx
+	.rva	.LSEH_info_poly1305_blocks_avx_1
+
+	.rva	.Lbase2_64_avx
+	.rva	.Leven_avx
+	.rva	.LSEH_info_poly1305_blocks_avx_2
+
+	.rva	.Leven_avx
+	.rva	.LSEH_end_poly1305_blocks_avx
+	.rva	.LSEH_info_poly1305_blocks_avx_3
+
+	.rva	.LSEH_begin_poly1305_emit_avx
+	.rva	.LSEH_end_poly1305_emit_avx
+	.rva	.LSEH_info_poly1305_emit_avx
+___
+$code.=<<___ if ($avx>1);
+	.rva	.LSEH_begin_poly1305_blocks_avx2
+	.rva	.Lbase2_64_avx2
+	.rva	.LSEH_info_poly1305_blocks_avx2_1
+
+	.rva	.Lbase2_64_avx2
+	.rva	.Leven_avx2
+	.rva	.LSEH_info_poly1305_blocks_avx2_2
+
+	.rva	.Leven_avx2
+	.rva	.LSEH_end_poly1305_blocks_avx2
+	.rva	.LSEH_info_poly1305_blocks_avx2_3
+___
+$code.=<<___ if ($avx>2);
+	.rva	.LSEH_begin_poly1305_blocks_avx512
+	.rva	.LSEH_end_poly1305_blocks_avx512
+	.rva	.LSEH_info_poly1305_blocks_avx512
+___
+$code.=<<___;
+.section	.xdata
+.align	8
+.LSEH_info_poly1305_init:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.LSEH_begin_poly1305_init_x86_64,.LSEH_begin_poly1305_init_x86_64
+
+.LSEH_info_poly1305_blocks:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.Lblocks_body,.Lblocks_epilogue
+
+.LSEH_info_poly1305_emit:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.LSEH_begin_poly1305_emit_x86_64,.LSEH_begin_poly1305_emit_x86_64
+___
+$code.=<<___ if ($avx);
+.LSEH_info_poly1305_blocks_avx_1:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.Lblocks_avx_body,.Lblocks_avx_epilogue		# HandlerData[]
+
+.LSEH_info_poly1305_blocks_avx_2:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.Lbase2_64_avx_body,.Lbase2_64_avx_epilogue	# HandlerData[]
+
+.LSEH_info_poly1305_blocks_avx_3:
+	.byte	9,0,0,0
+	.rva	avx_handler
+	.rva	.Ldo_avx_body,.Ldo_avx_epilogue			# HandlerData[]
+
+.LSEH_info_poly1305_emit_avx:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.LSEH_begin_poly1305_emit_avx,.LSEH_begin_poly1305_emit_avx
+___
+$code.=<<___ if ($avx>1);
+.LSEH_info_poly1305_blocks_avx2_1:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.Lblocks_avx2_body,.Lblocks_avx2_epilogue	# HandlerData[]
+
+.LSEH_info_poly1305_blocks_avx2_2:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.Lbase2_64_avx2_body,.Lbase2_64_avx2_epilogue	# HandlerData[]
+
+.LSEH_info_poly1305_blocks_avx2_3:
+	.byte	9,0,0,0
+	.rva	avx_handler
+	.rva	.Ldo_avx2_body,.Ldo_avx2_epilogue		# HandlerData[]
+___
+$code.=<<___ if ($avx>2);
+.LSEH_info_poly1305_blocks_avx512:
+	.byte	9,0,0,0
+	.rva	avx_handler
+	.rva	.Ldo_avx512_body,.Ldo_avx512_epilogue		# HandlerData[]
+___
+}
+
+foreach (split('\n',$code)) {
+	s/\`([^\`]*)\`/eval($1)/ge;
+	s/%r([a-z]+)#d/%e$1/g;
+	s/%r([0-9]+)#d/%r$1d/g;
+	s/%x#%[yz]/%x/g or s/%y#%z/%y/g or s/%z#%[yz]/%z/g;
+
+	print $_,"\n";
+}
+close STDOUT;
diff --git a/crypto/make_poly1305_x86.pl b/crypto/make_poly1305_x86.pl
new file mode 100644
index 0000000..ec1efd9
--- /dev/null
+++ b/crypto/make_poly1305_x86.pl
@@ -0,0 +1,1815 @@
+#! /usr/bin/env perl
+# Copyright 2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# This module implements Poly1305 hash for x86.
+#
+# April 2015
+#
+# Numbers are cycles per processed byte with poly1305_blocks alone,
+# measured with rdtsc at fixed clock frequency.
+#
+#		IALU/gcc-3.4(*)	SSE2(**)	AVX2
+# Pentium	15.7/+80%	-
+# PIII		6.21/+90%	-
+# P4		19.8/+40%	3.24
+# Core 2	4.85/+90%	1.80
+# Westmere	4.58/+100%	1.43
+# Sandy Bridge	3.90/+100%	1.36
+# Haswell	3.88/+70%	1.18		0.72
+# Skylake	3.10/+60%	1.14		0.62
+# Silvermont	11.0/+40%	4.80
+# Goldmont	4.10/+200%	2.10
+# VIA Nano	6.71/+90%	2.47
+# Sledgehammer	3.51/+180%	4.27
+# Bulldozer	4.53/+140%	1.31
+#
+# (*)	gcc 4.8 for some reason generated worse code;
+# (**)	besides SSE2 there are floating-point and AVX options; FP
+#	is deemed unnecessary, because pre-SSE2 processor are too
+#	old to care about, while it's not the fastest option on
+#	SSE2-capable ones; AVX is omitted, because it doesn't give
+#	a lot of improvement, 5-10% depending on processor;
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+push(@INC,"${dir}","${dir}../../perlasm");
+require "x86asm.pl";
+
+$output=pop;
+open STDOUT,">$output";
+
+&asm_init($ARGV[0],$ARGV[$#ARGV] eq "386");
+
+$sse2=$avx=0;
+for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); }
+
+if ($sse2) {
+	&static_label("const_sse2");
+	&static_label("enter_blocks");
+	&static_label("enter_emit");
+	&external_label("OPENSSL_ia32cap_P");
+
+	if (`$ENV{CC} -Wa,-v -c -o /dev/null -x assembler /dev/null 2>&1`
+			=~ /GNU assembler version ([2-9]\.[0-9]+)/) {
+		$avx = ($1>=2.19) + ($1>=2.22);
+	}
+
+	if (!$avx && $ARGV[0] eq "win32n" &&
+	   `nasm -v 2>&1` =~ /NASM version ([2-9]\.[0-9]+)/) {
+	$avx = ($1>=2.09) + ($1>=2.10);
+	}
+
+	if (!$avx && `$ENV{CC} -v 2>&1` =~ /(^clang version|based on LLVM) ([3-9]\.[0-9]+)/) {
+		$avx = ($2>=3.0) + ($2>3.0);
+	}
+}
+
+########################################################################
+# Layout of opaque area is following.
+#
+#	unsigned __int32 h[5];		# current hash value base 2^32
+#	unsigned __int32 pad;		# is_base2_26 in vector context
+#	unsigned __int32 r[4];		# key value base 2^32
+
+&align(64);
+&function_begin("poly1305_init");
+	&mov	("edi",&wparam(0));		# context
+	&mov	("esi",&wparam(1));		# key
+	&mov	("ebp",&wparam(2));		# function table
+
+	&xor	("eax","eax");
+	&mov	(&DWP(4*0,"edi"),"eax");	# zero hash value
+	&mov	(&DWP(4*1,"edi"),"eax");
+	&mov	(&DWP(4*2,"edi"),"eax");
+	&mov	(&DWP(4*3,"edi"),"eax");
+	&mov	(&DWP(4*4,"edi"),"eax");
+	&mov	(&DWP(4*5,"edi"),"eax");	# is_base2_26
+
+	&cmp	("esi",0);
+	&je	(&label("nokey"));
+
+    if ($sse2) {
+	&call	(&label("pic_point"));
+    &set_label("pic_point");
+	&blindpop("ebx");
+
+	&lea	("eax",&DWP("poly1305_blocks-".&label("pic_point"),"ebx"));
+	&lea	("edx",&DWP("poly1305_emit-".&label("pic_point"),"ebx"));
+
+	&picmeup("edi","OPENSSL_ia32cap_P","ebx",&label("pic_point"));
+	&mov	("ecx",&DWP(0,"edi"));
+	&and	("ecx",1<<26|1<<24);
+	&cmp	("ecx",1<<26|1<<24);		# SSE2 and XMM?
+	&jne	(&label("no_sse2"));
+
+	&lea	("eax",&DWP("_poly1305_blocks_sse2-".&label("pic_point"),"ebx"));
+	&lea	("edx",&DWP("_poly1305_emit_sse2-".&label("pic_point"),"ebx"));
+
+      if ($avx>1) {
+	&mov	("ecx",&DWP(8,"edi"));
+	&test	("ecx",1<<5);			# AVX2?
+	&jz	(&label("no_sse2"));
+
+	&lea	("eax",&DWP("_poly1305_blocks_avx2-".&label("pic_point"),"ebx"));
+      }
+    &set_label("no_sse2");
+	&mov	("edi",&wparam(0));		# reload context
+	&mov	(&DWP(0,"ebp"),"eax");		# fill function table
+	&mov	(&DWP(4,"ebp"),"edx");
+    }
+
+	&mov	("eax",&DWP(4*0,"esi"));	# load input key
+	&mov	("ebx",&DWP(4*1,"esi"));
+	&mov	("ecx",&DWP(4*2,"esi"));
+	&mov	("edx",&DWP(4*3,"esi"));
+	&and	("eax",0x0fffffff);
+	&and	("ebx",0x0ffffffc);
+	&and	("ecx",0x0ffffffc);
+	&and	("edx",0x0ffffffc);
+	&mov	(&DWP(4*6,"edi"),"eax");
+	&mov	(&DWP(4*7,"edi"),"ebx");
+	&mov	(&DWP(4*8,"edi"),"ecx");
+	&mov	(&DWP(4*9,"edi"),"edx");
+
+	&mov	("eax",$sse2);
+&set_label("nokey");
+&function_end("poly1305_init");
+
+($h0,$h1,$h2,$h3,$h4,
+ $d0,$d1,$d2,$d3,
+ $r0,$r1,$r2,$r3,
+     $s1,$s2,$s3)=map(4*$_,(0..15));
+
+&function_begin("poly1305_blocks");
+	&mov	("edi",&wparam(0));		# ctx
+	&mov	("esi",&wparam(1));		# inp
+	&mov	("ecx",&wparam(2));		# len
+&set_label("enter_blocks");
+	&and	("ecx",-15);
+	&jz	(&label("nodata"));
+
+	&stack_push(16);
+	&mov	("eax",&DWP(4*6,"edi"));	# r0
+	&mov	("ebx",&DWP(4*7,"edi"));	# r1
+	 &lea	("ebp",&DWP(0,"esi","ecx"));	# end of input
+	&mov	("ecx",&DWP(4*8,"edi"));	# r2
+	&mov	("edx",&DWP(4*9,"edi"));	# r3
+
+	&mov	(&wparam(2),"ebp");
+	&mov	("ebp","esi");
+
+	&mov	(&DWP($r0,"esp"),"eax");	# r0
+	&mov	("eax","ebx");
+	&shr	("eax",2);
+	&mov	(&DWP($r1,"esp"),"ebx");	# r1
+	&add	("eax","ebx");			# s1
+	&mov	("ebx","ecx");
+	&shr	("ebx",2);
+	&mov	(&DWP($r2,"esp"),"ecx");	# r2
+	&add	("ebx","ecx");			# s2
+	&mov	("ecx","edx");
+	&shr	("ecx",2);
+	&mov	(&DWP($r3,"esp"),"edx");	# r3
+	&add	("ecx","edx");			# s3
+	&mov	(&DWP($s1,"esp"),"eax");	# s1
+	&mov	(&DWP($s2,"esp"),"ebx");	# s2
+	&mov	(&DWP($s3,"esp"),"ecx");	# s3
+
+	&mov	("eax",&DWP(4*0,"edi"));	# load hash value
+	&mov	("ebx",&DWP(4*1,"edi"));
+	&mov	("ecx",&DWP(4*2,"edi"));
+	&mov	("esi",&DWP(4*3,"edi"));
+	&mov	("edi",&DWP(4*4,"edi"));
+	&jmp	(&label("loop"));
+
+&set_label("loop",32);
+	&add	("eax",&DWP(4*0,"ebp"));	# accumulate input
+	&adc	("ebx",&DWP(4*1,"ebp"));
+	&adc	("ecx",&DWP(4*2,"ebp"));
+	&adc	("esi",&DWP(4*3,"ebp"));
+	&lea	("ebp",&DWP(4*4,"ebp"));
+	&adc	("edi",&wparam(3));		# padbit
+
+	&mov	(&DWP($h0,"esp"),"eax");	# put aside hash[+inp]
+	&mov	(&DWP($h3,"esp"),"esi");
+
+	&mul	(&DWP($r0,"esp"));		# h0*r0
+	 &mov	(&DWP($h4,"esp"),"edi");
+	&mov	("edi","eax");
+	&mov	("eax","ebx");			# h1
+	&mov	("esi","edx");
+	&mul	(&DWP($s3,"esp"));		# h1*s3
+	&add	("edi","eax");
+	&mov	("eax","ecx");			# h2
+	&adc	("esi","edx");
+	&mul	(&DWP($s2,"esp"));		# h2*s2
+	&add	("edi","eax");
+	&mov	("eax",&DWP($h3,"esp"));
+	&adc	("esi","edx");
+	&mul	(&DWP($s1,"esp"));		# h3*s1
+	&add	("edi","eax");
+	 &mov	("eax",&DWP($h0,"esp"));
+	&adc	("esi","edx");
+
+	&mul	(&DWP($r1,"esp"));		# h0*r1
+	 &mov	(&DWP($d0,"esp"),"edi");
+	&xor	("edi","edi");
+	&add	("esi","eax");
+	&mov	("eax","ebx");			# h1
+	&adc	("edi","edx");
+	&mul	(&DWP($r0,"esp"));		# h1*r0
+	&add	("esi","eax");
+	&mov	("eax","ecx");			# h2
+	&adc	("edi","edx");
+	&mul	(&DWP($s3,"esp"));		# h2*s3
+	&add	("esi","eax");
+	&mov	("eax",&DWP($h3,"esp"));
+	&adc	("edi","edx");
+	&mul	(&DWP($s2,"esp"));		# h3*s2
+	&add	("esi","eax");
+	&mov	("eax",&DWP($h4,"esp"));
+	&adc	("edi","edx");
+	&imul	("eax",&DWP($s1,"esp"));	# h4*s1
+	&add	("esi","eax");
+	 &mov	("eax",&DWP($h0,"esp"));
+	&adc	("edi",0);
+
+	&mul	(&DWP($r2,"esp"));		# h0*r2
+	 &mov	(&DWP($d1,"esp"),"esi");
+	&xor	("esi","esi");
+	&add	("edi","eax");
+	&mov	("eax","ebx");			# h1
+	&adc	("esi","edx");
+	&mul	(&DWP($r1,"esp"));		# h1*r1
+	&add	("edi","eax");
+	&mov	("eax","ecx");			# h2
+	&adc	("esi","edx");
+	&mul	(&DWP($r0,"esp"));		# h2*r0
+	&add	("edi","eax");
+	&mov	("eax",&DWP($h3,"esp"));
+	&adc	("esi","edx");
+	&mul	(&DWP($s3,"esp"));		# h3*s3
+	&add	("edi","eax");
+	&mov	("eax",&DWP($h4,"esp"));
+	&adc	("esi","edx");
+	&imul	("eax",&DWP($s2,"esp"));	# h4*s2
+	&add	("edi","eax");
+	 &mov	("eax",&DWP($h0,"esp"));
+	&adc	("esi",0);
+
+	&mul	(&DWP($r3,"esp"));		# h0*r3
+	 &mov	(&DWP($d2,"esp"),"edi");
+	&xor	("edi","edi");
+	&add	("esi","eax");
+	&mov	("eax","ebx");			# h1
+	&adc	("edi","edx");
+	&mul	(&DWP($r2,"esp"));		# h1*r2
+	&add	("esi","eax");
+	&mov	("eax","ecx");			# h2
+	&adc	("edi","edx");
+	&mul	(&DWP($r1,"esp"));		# h2*r1
+	&add	("esi","eax");
+	&mov	("eax",&DWP($h3,"esp"));
+	&adc	("edi","edx");
+	&mul	(&DWP($r0,"esp"));		# h3*r0
+	&add	("esi","eax");
+	 &mov	("ecx",&DWP($h4,"esp"));
+	&adc	("edi","edx");
+
+	&mov	("edx","ecx");
+	&imul	("ecx",&DWP($s3,"esp"));	# h4*s3
+	&add	("esi","ecx");
+	 &mov	("eax",&DWP($d0,"esp"));
+	&adc	("edi",0);
+
+	&imul	("edx",&DWP($r0,"esp"));	# h4*r0
+	&add	("edx","edi");
+
+	&mov	("ebx",&DWP($d1,"esp"));
+	&mov	("ecx",&DWP($d2,"esp"));
+
+	&mov	("edi","edx");			# last reduction step
+	&shr	("edx",2);
+	&and	("edi",3);
+	&lea	("edx",&DWP(0,"edx","edx",4));	# *5
+	&add	("eax","edx");
+	&adc	("ebx",0);
+	&adc	("ecx",0);
+	&adc	("esi",0);
+	&adc	("edi",0);
+
+	&cmp	("ebp",&wparam(2));		# done yet?
+	&jne	(&label("loop"));
+
+	&mov	("edx",&wparam(0));		# ctx
+	&stack_pop(16);
+	&mov	(&DWP(4*0,"edx"),"eax");	# store hash value
+	&mov	(&DWP(4*1,"edx"),"ebx");
+	&mov	(&DWP(4*2,"edx"),"ecx");
+	&mov	(&DWP(4*3,"edx"),"esi");
+	&mov	(&DWP(4*4,"edx"),"edi");
+&set_label("nodata");
+&function_end("poly1305_blocks");
+
+&function_begin("poly1305_emit");
+	&mov	("ebp",&wparam(0));		# context
+&set_label("enter_emit");
+	&mov	("edi",&wparam(1));		# output
+	&mov	("eax",&DWP(4*0,"ebp"));	# load hash value
+	&mov	("ebx",&DWP(4*1,"ebp"));
+	&mov	("ecx",&DWP(4*2,"ebp"));
+	&mov	("edx",&DWP(4*3,"ebp"));
+	&mov	("esi",&DWP(4*4,"ebp"));
+
+	&add	("eax",5);			# compare to modulus
+	&adc	("ebx",0);
+	&adc	("ecx",0);
+	&adc	("edx",0);
+	&adc	("esi",0);
+	&shr	("esi",2);			# did it carry/borrow?
+	&neg	("esi");			# do we choose hash-modulus?
+
+	&and	("eax","esi");
+	&and	("ebx","esi");
+	&and	("ecx","esi");
+	&and	("edx","esi");
+	&mov	(&DWP(4*0,"edi"),"eax");
+	&mov	(&DWP(4*1,"edi"),"ebx");
+	&mov	(&DWP(4*2,"edi"),"ecx");
+	&mov	(&DWP(4*3,"edi"),"edx");
+
+	&not	("esi");			# or original hash value?
+	&mov	("eax",&DWP(4*0,"ebp"));
+	&mov	("ebx",&DWP(4*1,"ebp"));
+	&mov	("ecx",&DWP(4*2,"ebp"));
+	&mov	("edx",&DWP(4*3,"ebp"));
+	&mov	("ebp",&wparam(2));
+	&and	("eax","esi");
+	&and	("ebx","esi");
+	&and	("ecx","esi");
+	&and	("edx","esi");
+	&or	("eax",&DWP(4*0,"edi"));
+	&or	("ebx",&DWP(4*1,"edi"));
+	&or	("ecx",&DWP(4*2,"edi"));
+	&or	("edx",&DWP(4*3,"edi"));
+
+	&add	("eax",&DWP(4*0,"ebp"));	# accumulate key
+	&adc	("ebx",&DWP(4*1,"ebp"));
+	&adc	("ecx",&DWP(4*2,"ebp"));
+	&adc	("edx",&DWP(4*3,"ebp"));
+
+	&mov	(&DWP(4*0,"edi"),"eax");
+	&mov	(&DWP(4*1,"edi"),"ebx");
+	&mov	(&DWP(4*2,"edi"),"ecx");
+	&mov	(&DWP(4*3,"edi"),"edx");
+&function_end("poly1305_emit");
+
+if ($sse2) {
+########################################################################
+# Layout of opaque area is following.
+#
+#	unsigned __int32 h[5];		# current hash value base 2^26
+#	unsigned __int32 is_base2_26;
+#	unsigned __int32 r[4];		# key value base 2^32
+#	unsigned __int32 pad[2];
+#	struct { unsigned __int32 r^4, r^3, r^2, r^1; } r[9];
+#
+# where r^n are base 2^26 digits of degrees of multiplier key. There are
+# 5 digits, but last four are interleaved with multiples of 5, totalling
+# in 9 elements: r0, r1, 5*r1, r2, 5*r2, r3, 5*r3, r4, 5*r4.
+
+my ($D0,$D1,$D2,$D3,$D4,$T0,$T1,$T2)=map("xmm$_",(0..7));
+my $MASK=$T2;	# borrow and keep in mind
+
+&align	(32);
+&function_begin_B("_poly1305_init_sse2");
+	&movdqu		($D4,&QWP(4*6,"edi"));		# key base 2^32
+	&lea		("edi",&DWP(16*3,"edi"));	# size optimization
+	&mov		("ebp","esp");
+	&sub		("esp",16*(9+5));
+	&and		("esp",-16);
+
+	#&pand		($D4,&QWP(96,"ebx"));		# magic mask
+	&movq		($MASK,&QWP(64,"ebx"));
+
+	&movdqa		($D0,$D4);
+	&movdqa		($D1,$D4);
+	&movdqa		($D2,$D4);
+
+	&pand		($D0,$MASK);			# -> base 2^26
+	&psrlq		($D1,26);
+	&psrldq		($D2,6);
+	&pand		($D1,$MASK);
+	&movdqa		($D3,$D2);
+	&psrlq		($D2,4)
+	&psrlq		($D3,30);
+	&pand		($D2,$MASK);
+	&pand		($D3,$MASK);
+	&psrldq		($D4,13);
+
+	&lea		("edx",&DWP(16*9,"esp"));	# size optimization
+	&mov		("ecx",2);
+&set_label("square");
+	&movdqa		(&QWP(16*0,"esp"),$D0);
+	&movdqa		(&QWP(16*1,"esp"),$D1);
+	&movdqa		(&QWP(16*2,"esp"),$D2);
+	&movdqa		(&QWP(16*3,"esp"),$D3);
+	&movdqa		(&QWP(16*4,"esp"),$D4);
+
+	&movdqa		($T1,$D1);
+	&movdqa		($T0,$D2);
+	&pslld		($T1,2);
+	&pslld		($T0,2);
+	&paddd		($T1,$D1);			# *5
+	&paddd		($T0,$D2);			# *5
+	&movdqa		(&QWP(16*5,"esp"),$T1);
+	&movdqa		(&QWP(16*6,"esp"),$T0);
+	&movdqa		($T1,$D3);
+	&movdqa		($T0,$D4);
+	&pslld		($T1,2);
+	&pslld		($T0,2);
+	&paddd		($T1,$D3);			# *5
+	&paddd		($T0,$D4);			# *5
+	&movdqa		(&QWP(16*7,"esp"),$T1);
+	&movdqa		(&QWP(16*8,"esp"),$T0);
+
+	&pshufd		($T1,$D0,0b01000100);
+	&movdqa		($T0,$D1);
+	&pshufd		($D1,$D1,0b01000100);
+	&pshufd		($D2,$D2,0b01000100);
+	&pshufd		($D3,$D3,0b01000100);
+	&pshufd		($D4,$D4,0b01000100);
+	&movdqa		(&QWP(16*0,"edx"),$T1);
+	&movdqa		(&QWP(16*1,"edx"),$D1);
+	&movdqa		(&QWP(16*2,"edx"),$D2);
+	&movdqa		(&QWP(16*3,"edx"),$D3);
+	&movdqa		(&QWP(16*4,"edx"),$D4);
+
+	################################################################
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	&pmuludq	($D4,$D0);			# h4*r0
+	&pmuludq	($D3,$D0);			# h3*r0
+	&pmuludq	($D2,$D0);			# h2*r0
+	&pmuludq	($D1,$D0);			# h1*r0
+	&pmuludq	($D0,$T1);			# h0*r0
+
+sub pmuladd {
+my $load = shift;
+my $base = shift; $base = "esp" if (!defined($base));
+
+	################################################################
+	# As for choice to "rotate" $T0-$T2 in order to move paddq
+	# past next multiplication. While it makes code harder to read
+	# and doesn't have significant effect on most processors, it
+	# makes a lot of difference on Atom, up to 30% improvement.
+
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&QWP(16*3,$base));		# r1*h3
+	&movdqa		($T2,$T1);
+	&pmuludq	($T1,&QWP(16*2,$base));		# r1*h2
+	&paddq		($D4,$T0);
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&QWP(16*1,$base));		# r1*h1
+	&paddq		($D3,$T1);
+	&$load		($T1,5);			# s1
+	&pmuludq	($T0,&QWP(16*0,$base));		# r1*h0
+	&paddq		($D2,$T2);
+	&pmuludq	($T1,&QWP(16*4,$base));		# s1*h4
+	 &$load		($T2,2);			# r2^n
+	&paddq		($D1,$T0);
+
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&QWP(16*2,$base));		# r2*h2
+	 &paddq		($D0,$T1);
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&QWP(16*1,$base));		# r2*h1
+	&paddq		($D4,$T2);
+	&$load		($T2,6);			# s2^n
+	&pmuludq	($T1,&QWP(16*0,$base));		# r2*h0
+	&paddq		($D3,$T0);
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&QWP(16*4,$base));		# s2*h4
+	&paddq		($D2,$T1);
+	&pmuludq	($T0,&QWP(16*3,$base));		# s2*h3
+	 &$load		($T1,3);			# r3^n
+	&paddq		($D1,$T2);
+
+	&movdqa		($T2,$T1);
+	&pmuludq	($T1,&QWP(16*1,$base));		# r3*h1
+	 &paddq		($D0,$T0);
+	&$load		($T0,7);			# s3^n
+	&pmuludq	($T2,&QWP(16*0,$base));		# r3*h0
+	&paddq		($D4,$T1);
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&QWP(16*4,$base));		# s3*h4
+	&paddq		($D3,$T2);
+	&movdqa		($T2,$T1);
+	&pmuludq	($T1,&QWP(16*3,$base));		# s3*h3
+	&paddq		($D2,$T0);
+	&pmuludq	($T2,&QWP(16*2,$base));		# s3*h2
+	 &$load		($T0,4);			# r4^n
+	&paddq		($D1,$T1);
+
+	&$load		($T1,8);			# s4^n
+	&pmuludq	($T0,&QWP(16*0,$base));		# r4*h0
+	 &paddq		($D0,$T2);
+	&movdqa		($T2,$T1);
+	&pmuludq	($T1,&QWP(16*4,$base));		# s4*h4
+	&paddq		($D4,$T0);
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&QWP(16*1,$base));		# s4*h1
+	&paddq		($D3,$T1);
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&QWP(16*2,$base));		# s4*h2
+	&paddq		($D0,$T2);
+	&pmuludq	($T1,&QWP(16*3,$base));		# s4*h3
+	 &movdqa	($MASK,&QWP(64,"ebx"));
+	&paddq		($D1,$T0);
+	&paddq		($D2,$T1);
+}
+	&pmuladd	(sub {	my ($reg,$i)=@_;
+				&movdqa ($reg,&QWP(16*$i,"esp"));
+			     },"edx");
+
+sub lazy_reduction {
+my $extra = shift;
+
+	################################################################
+	# lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
+	# and P. Schwabe
+	#
+	# [(*) see discussion in poly1305-armv4 module]
+
+	 &movdqa	($T0,$D3);
+	 &pand		($D3,$MASK);
+	 &psrlq		($T0,26);
+	 &$extra	()				if (defined($extra));
+	 &paddq		($T0,$D4);			# h3 -> h4
+	&movdqa		($T1,$D0);
+	&pand		($D0,$MASK);
+	&psrlq		($T1,26);
+	 &movdqa	($D4,$T0);
+	&paddq		($T1,$D1);			# h0 -> h1
+	 &psrlq		($T0,26);
+	 &pand		($D4,$MASK);
+	&movdqa		($D1,$T1);
+	&psrlq		($T1,26);
+	 &paddd		($D0,$T0);			# favour paddd when
+							# possible, because
+							# paddq is "broken"
+							# on Atom
+	 &psllq		($T0,2);
+	&paddq		($T1,$D2);			# h1 -> h2
+	 &paddq		($T0,$D0);			# h4 -> h0 (*)
+	&pand		($D1,$MASK);
+	&movdqa		($D2,$T1);
+	&psrlq		($T1,26);
+	&pand		($D2,$MASK);
+	&paddd		($T1,$D3);			# h2 -> h3
+	 &movdqa	($D0,$T0);
+	 &psrlq		($T0,26);
+	&movdqa		($D3,$T1);
+	&psrlq		($T1,26);
+	 &pand		($D0,$MASK);
+	 &paddd		($D1,$T0);			# h0 -> h1
+	&pand		($D3,$MASK);
+	&paddd		($D4,$T1);			# h3 -> h4
+}
+	&lazy_reduction	();
+
+	&dec		("ecx");
+	&jz		(&label("square_break"));
+
+	&punpcklqdq	($D0,&QWP(16*0,"esp"));		# 0:r^1:0:r^2
+	&punpcklqdq	($D1,&QWP(16*1,"esp"));
+	&punpcklqdq	($D2,&QWP(16*2,"esp"));
+	&punpcklqdq	($D3,&QWP(16*3,"esp"));
+	&punpcklqdq	($D4,&QWP(16*4,"esp"));
+	&jmp		(&label("square"));
+
+&set_label("square_break");
+	&psllq		($D0,32);			# -> r^3:0:r^4:0
+	&psllq		($D1,32);
+	&psllq		($D2,32);
+	&psllq		($D3,32);
+	&psllq		($D4,32);
+	&por		($D0,&QWP(16*0,"esp"));		# r^3:r^1:r^4:r^2
+	&por		($D1,&QWP(16*1,"esp"));
+	&por		($D2,&QWP(16*2,"esp"));
+	&por		($D3,&QWP(16*3,"esp"));
+	&por		($D4,&QWP(16*4,"esp"));
+
+	&pshufd		($D0,$D0,0b10001101);		# -> r^1:r^2:r^3:r^4
+	&pshufd		($D1,$D1,0b10001101);
+	&pshufd		($D2,$D2,0b10001101);
+	&pshufd		($D3,$D3,0b10001101);
+	&pshufd		($D4,$D4,0b10001101);
+
+	&movdqu		(&QWP(16*0,"edi"),$D0);		# save the table
+	&movdqu		(&QWP(16*1,"edi"),$D1);
+	&movdqu		(&QWP(16*2,"edi"),$D2);
+	&movdqu		(&QWP(16*3,"edi"),$D3);
+	&movdqu		(&QWP(16*4,"edi"),$D4);
+
+	&movdqa		($T1,$D1);
+	&movdqa		($T0,$D2);
+	&pslld		($T1,2);
+	&pslld		($T0,2);
+	&paddd		($T1,$D1);			# *5
+	&paddd		($T0,$D2);			# *5
+	&movdqu		(&QWP(16*5,"edi"),$T1);
+	&movdqu		(&QWP(16*6,"edi"),$T0);
+	&movdqa		($T1,$D3);
+	&movdqa		($T0,$D4);
+	&pslld		($T1,2);
+	&pslld		($T0,2);
+	&paddd		($T1,$D3);			# *5
+	&paddd		($T0,$D4);			# *5
+	&movdqu		(&QWP(16*7,"edi"),$T1);
+	&movdqu		(&QWP(16*8,"edi"),$T0);
+
+	&mov		("esp","ebp");
+	&lea		("edi",&DWP(-16*3,"edi"));	# size de-optimization
+	&ret		();
+&function_end_B("_poly1305_init_sse2");
+
+&align	(32);
+&function_begin("_poly1305_blocks_sse2");
+	&mov	("edi",&wparam(0));			# ctx
+	&mov	("esi",&wparam(1));			# inp
+	&mov	("ecx",&wparam(2));			# len
+
+	&mov	("eax",&DWP(4*5,"edi"));		# is_base2_26
+	&and	("ecx",-16);
+	&jz	(&label("nodata"));
+	&cmp	("ecx",64);
+	&jae	(&label("enter_sse2"));
+	&test	("eax","eax");				# is_base2_26?
+	&jz	(&label("enter_blocks"));
+
+&set_label("enter_sse2",16);
+	&call	(&label("pic_point"));
+&set_label("pic_point");
+	&blindpop("ebx");
+	&lea	("ebx",&DWP(&label("const_sse2")."-".&label("pic_point"),"ebx"));
+
+	&test	("eax","eax");				# is_base2_26?
+	&jnz	(&label("base2_26"));
+
+	&call	("_poly1305_init_sse2");
+
+	################################################# base 2^32 -> base 2^26
+	&mov	("eax",&DWP(0,"edi"));
+	&mov	("ecx",&DWP(3,"edi"));
+	&mov	("edx",&DWP(6,"edi"));
+	&mov	("esi",&DWP(9,"edi"));
+	&mov	("ebp",&DWP(13,"edi"));
+	&mov	(&DWP(4*5,"edi"),1);			# is_base2_26
+
+	&shr	("ecx",2);
+	&and	("eax",0x3ffffff);
+	&shr	("edx",4);
+	&and	("ecx",0x3ffffff);
+	&shr	("esi",6);
+	&and	("edx",0x3ffffff);
+
+	&movd	($D0,"eax");
+	&movd	($D1,"ecx");
+	&movd	($D2,"edx");
+	&movd	($D3,"esi");
+	&movd	($D4,"ebp");
+
+	&mov	("esi",&wparam(1));			# [reload] inp
+	&mov	("ecx",&wparam(2));			# [reload] len
+	&jmp	(&label("base2_32"));
+
+&set_label("base2_26",16);
+	&movd	($D0,&DWP(4*0,"edi"));			# load hash value
+	&movd	($D1,&DWP(4*1,"edi"));
+	&movd	($D2,&DWP(4*2,"edi"));
+	&movd	($D3,&DWP(4*3,"edi"));
+	&movd	($D4,&DWP(4*4,"edi"));
+	&movdqa	($MASK,&QWP(64,"ebx"));
+
+&set_label("base2_32");
+	&mov	("eax",&wparam(3));			# padbit
+	&mov	("ebp","esp");
+
+	&sub	("esp",16*(5+5+5+9+9));
+	&and	("esp",-16);
+
+	&lea	("edi",&DWP(16*3,"edi"));		# size optimization
+	&shl	("eax",24);				# padbit
+
+	&test	("ecx",31);
+	&jz	(&label("even"));
+
+	################################################################
+	# process single block, with SSE2, because it's still faster
+	# even though half of result is discarded
+
+	&movdqu		($T1,&QWP(0,"esi"));		# input
+	&lea		("esi",&DWP(16,"esi"));
+
+	&movdqa		($T0,$T1);			# -> base 2^26 ...
+	&pand		($T1,$MASK);
+	&paddd		($D0,$T1);			# ... and accumulate
+
+	&movdqa		($T1,$T0);
+	&psrlq		($T0,26);
+	&psrldq		($T1,6);
+	&pand		($T0,$MASK);
+	&paddd		($D1,$T0);
+
+	&movdqa		($T0,$T1);
+	&psrlq		($T1,4);
+	&pand		($T1,$MASK);
+	&paddd		($D2,$T1);
+
+	&movdqa		($T1,$T0);
+	&psrlq		($T0,30);
+	&pand		($T0,$MASK);
+	&psrldq		($T1,7);
+	&paddd		($D3,$T0);
+
+	&movd		($T0,"eax");			# padbit
+	&paddd		($D4,$T1);
+	 &movd		($T1,&DWP(16*0+12,"edi"));	# r0
+	&paddd		($D4,$T0);
+
+	&movdqa		(&QWP(16*0,"esp"),$D0);
+	&movdqa		(&QWP(16*1,"esp"),$D1);
+	&movdqa		(&QWP(16*2,"esp"),$D2);
+	&movdqa		(&QWP(16*3,"esp"),$D3);
+	&movdqa		(&QWP(16*4,"esp"),$D4);
+
+	################################################################
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	&pmuludq	($D0,$T1);			# h4*r0
+	&pmuludq	($D1,$T1);			# h3*r0
+	&pmuludq	($D2,$T1);			# h2*r0
+	 &movd		($T0,&DWP(16*1+12,"edi"));	# r1
+	&pmuludq	($D3,$T1);			# h1*r0
+	&pmuludq	($D4,$T1);			# h0*r0
+
+	&pmuladd	(sub {	my ($reg,$i)=@_;
+				&movd ($reg,&DWP(16*$i+12,"edi"));
+			     });
+
+	&lazy_reduction	();
+
+	&sub		("ecx",16);
+	&jz		(&label("done"));
+
+&set_label("even");
+	&lea		("edx",&DWP(16*(5+5+5+9),"esp"));# size optimization
+	&lea		("eax",&DWP(-16*2,"esi"));
+	&sub		("ecx",64);
+
+	################################################################
+	# expand and copy pre-calculated table to stack
+
+	&movdqu		($T0,&QWP(16*0,"edi"));		# r^1:r^2:r^3:r^4
+	&pshufd		($T1,$T0,0b01000100);		# duplicate r^3:r^4
+	&cmovb		("esi","eax");
+	&pshufd		($T0,$T0,0b11101110);		# duplicate r^1:r^2
+	&movdqa		(&QWP(16*0,"edx"),$T1);
+	&lea		("eax",&DWP(16*10,"esp"));
+	&movdqu		($T1,&QWP(16*1,"edi"));
+	&movdqa		(&QWP(16*(0-9),"edx"),$T0);
+	&pshufd		($T0,$T1,0b01000100);
+	&pshufd		($T1,$T1,0b11101110);
+	&movdqa		(&QWP(16*1,"edx"),$T0);
+	&movdqu		($T0,&QWP(16*2,"edi"));
+	&movdqa		(&QWP(16*(1-9),"edx"),$T1);
+	&pshufd		($T1,$T0,0b01000100);
+	&pshufd		($T0,$T0,0b11101110);
+	&movdqa		(&QWP(16*2,"edx"),$T1);
+	&movdqu		($T1,&QWP(16*3,"edi"));
+	&movdqa		(&QWP(16*(2-9),"edx"),$T0);
+	&pshufd		($T0,$T1,0b01000100);
+	&pshufd		($T1,$T1,0b11101110);
+	&movdqa		(&QWP(16*3,"edx"),$T0);
+	&movdqu		($T0,&QWP(16*4,"edi"));
+	&movdqa		(&QWP(16*(3-9),"edx"),$T1);
+	&pshufd		($T1,$T0,0b01000100);
+	&pshufd		($T0,$T0,0b11101110);
+	&movdqa		(&QWP(16*4,"edx"),$T1);
+	&movdqu		($T1,&QWP(16*5,"edi"));
+	&movdqa		(&QWP(16*(4-9),"edx"),$T0);
+	&pshufd		($T0,$T1,0b01000100);
+	&pshufd		($T1,$T1,0b11101110);
+	&movdqa		(&QWP(16*5,"edx"),$T0);
+	&movdqu		($T0,&QWP(16*6,"edi"));
+	&movdqa		(&QWP(16*(5-9),"edx"),$T1);
+	&pshufd		($T1,$T0,0b01000100);
+	&pshufd		($T0,$T0,0b11101110);
+	&movdqa		(&QWP(16*6,"edx"),$T1);
+	&movdqu		($T1,&QWP(16*7,"edi"));
+	&movdqa		(&QWP(16*(6-9),"edx"),$T0);
+	&pshufd		($T0,$T1,0b01000100);
+	&pshufd		($T1,$T1,0b11101110);
+	&movdqa		(&QWP(16*7,"edx"),$T0);
+	&movdqu		($T0,&QWP(16*8,"edi"));
+	&movdqa		(&QWP(16*(7-9),"edx"),$T1);
+	&pshufd		($T1,$T0,0b01000100);
+	&pshufd		($T0,$T0,0b11101110);
+	&movdqa		(&QWP(16*8,"edx"),$T1);
+	&movdqa		(&QWP(16*(8-9),"edx"),$T0);
+
+sub load_input {
+my ($inpbase,$offbase)=@_;
+
+	&movdqu		($T0,&QWP($inpbase+0,"esi"));	# load input
+	&movdqu		($T1,&QWP($inpbase+16,"esi"));
+	&lea		("esi",&DWP(16*2,"esi"));
+
+	&movdqa		(&QWP($offbase+16*2,"esp"),$D2);
+	&movdqa		(&QWP($offbase+16*3,"esp"),$D3);
+	&movdqa		(&QWP($offbase+16*4,"esp"),$D4);
+
+	&movdqa		($D2,$T0);			# splat input
+	&movdqa		($D3,$T1);
+	&psrldq		($D2,6);
+	&psrldq		($D3,6);
+	&movdqa		($D4,$T0);
+	&punpcklqdq	($D2,$D3);			# 2:3
+	&punpckhqdq	($D4,$T1);			# 4
+	&punpcklqdq	($T0,$T1);			# 0:1
+
+	&movdqa		($D3,$D2);
+	&psrlq		($D2,4);
+	&psrlq		($D3,30);
+	&movdqa		($T1,$T0);
+	&psrlq		($D4,40);			# 4
+	&psrlq		($T1,26);
+	&pand		($T0,$MASK);			# 0
+	&pand		($T1,$MASK);			# 1
+	&pand		($D2,$MASK);			# 2
+	&pand		($D3,$MASK);			# 3
+	&por		($D4,&QWP(0,"ebx"));		# padbit, yes, always
+
+	&movdqa		(&QWP($offbase+16*0,"esp"),$D0)	if ($offbase);
+	&movdqa		(&QWP($offbase+16*1,"esp"),$D1)	if ($offbase);
+}
+	&load_input	(16*2,16*5);
+
+	&jbe		(&label("skip_loop"));
+	&jmp		(&label("loop"));
+
+&set_label("loop",32);
+	################################################################
+	# ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+	# ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
+	#   \___________________/
+	# ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
+	# ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
+	#   \___________________/ \____________________/
+	################################################################
+
+	&movdqa		($T2,&QWP(16*(0-9),"edx"));	# r0^2
+	&movdqa		(&QWP(16*1,"eax"),$T1);
+	&movdqa		(&QWP(16*2,"eax"),$D2);
+	&movdqa		(&QWP(16*3,"eax"),$D3);
+	&movdqa		(&QWP(16*4,"eax"),$D4);
+
+	################################################################
+	# d4 = h4*r0 + h0*r4   + h1*r3   + h2*r2   + h3*r1
+	# d3 = h3*r0 + h0*r3   + h1*r2   + h2*r1   + h4*5*r4
+	# d2 = h2*r0 + h0*r2   + h1*r1   + h3*5*r4 + h4*5*r3
+	# d1 = h1*r0 + h0*r1   + h2*5*r4 + h3*5*r3 + h4*5*r2
+	# d0 = h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
+
+	&movdqa		($D1,$T0);
+	&pmuludq	($T0,$T2);			# h0*r0
+	&movdqa		($D0,$T1);
+	&pmuludq	($T1,$T2);			# h1*r0
+	&pmuludq	($D2,$T2);			# h2*r0
+	&pmuludq	($D3,$T2);			# h3*r0
+	&pmuludq	($D4,$T2);			# h4*r0
+
+sub pmuladd_alt {
+my $addr = shift;
+
+	&pmuludq	($D0,&$addr(8));		# h1*s4
+	&movdqa		($T2,$D1);
+	&pmuludq	($D1,&$addr(1));		# h0*r1
+	&paddq		($D0,$T0);
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&$addr(2));		# h0*r2
+	&paddq		($D1,$T1);
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&$addr(3));		# h0*r3
+	&paddq		($D2,$T2);
+	 &movdqa	($T2,&QWP(16*1,"eax"));		# pull h1
+	&pmuludq	($T1,&$addr(4));		# h0*r4
+	&paddq		($D3,$T0);
+
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&$addr(1));		# h1*r1
+	 &paddq		($D4,$T1);
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&$addr(2));		# h1*r2
+	&paddq		($D2,$T2);
+	&movdqa		($T2,&QWP(16*2,"eax"));		# pull h2
+	&pmuludq	($T1,&$addr(3));		# h1*r3
+	&paddq		($D3,$T0);
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&$addr(7));		# h2*s3
+	&paddq		($D4,$T1);
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&$addr(8));		# h2*s4
+	&paddq		($D0,$T2);
+
+	&movdqa		($T2,$T1);
+	&pmuludq	($T1,&$addr(1));		# h2*r1
+	 &paddq		($D1,$T0);
+	&movdqa		($T0,&QWP(16*3,"eax"));		# pull h3
+	&pmuludq	($T2,&$addr(2));		# h2*r2
+	&paddq		($D3,$T1);
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&$addr(6));		# h3*s2
+	&paddq		($D4,$T2);
+	&movdqa		($T2,$T1);
+	&pmuludq	($T1,&$addr(7));		# h3*s3
+	&paddq		($D0,$T0);
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&$addr(8));		# h3*s4
+	&paddq		($D1,$T1);
+
+	&movdqa		($T1,&QWP(16*4,"eax"));		# pull h4
+	&pmuludq	($T0,&$addr(1));		# h3*r1
+	 &paddq		($D2,$T2);
+	&movdqa		($T2,$T1);
+	&pmuludq	($T1,&$addr(8));		# h4*s4
+	&paddq		($D4,$T0);
+	&movdqa		($T0,$T2);
+	&pmuludq	($T2,&$addr(5));		# h4*s1
+	&paddq		($D3,$T1);
+	&movdqa		($T1,$T0);
+	&pmuludq	($T0,&$addr(6));		# h4*s2
+	&paddq		($D0,$T2);
+	 &movdqa	($MASK,&QWP(64,"ebx"));
+	&pmuludq	($T1,&$addr(7));		# h4*s3
+	&paddq		($D1,$T0);
+	&paddq		($D2,$T1);
+}
+	&pmuladd_alt	(sub {	my $i=shift; &QWP(16*($i-9),"edx");	});
+
+	&load_input	(-16*2,0);
+	&lea		("eax",&DWP(-16*2,"esi"));
+	&sub		("ecx",64);
+
+	&paddd		($T0,&QWP(16*(5+0),"esp"));	# add hash value
+	&paddd		($T1,&QWP(16*(5+1),"esp"));
+	&paddd		($D2,&QWP(16*(5+2),"esp"));
+	&paddd		($D3,&QWP(16*(5+3),"esp"));
+	&paddd		($D4,&QWP(16*(5+4),"esp"));
+
+	&cmovb		("esi","eax");
+	&lea		("eax",&DWP(16*10,"esp"));
+
+	&movdqa		($T2,&QWP(16*0,"edx"));		# r0^4
+	&movdqa		(&QWP(16*1,"esp"),$D1);
+	&movdqa		(&QWP(16*1,"eax"),$T1);
+	&movdqa		(&QWP(16*2,"eax"),$D2);
+	&movdqa		(&QWP(16*3,"eax"),$D3);
+	&movdqa		(&QWP(16*4,"eax"),$D4);
+
+	################################################################
+	# d4 += h4*r0 + h0*r4   + h1*r3   + h2*r2   + h3*r1
+	# d3 += h3*r0 + h0*r3   + h1*r2   + h2*r1   + h4*5*r4
+	# d2 += h2*r0 + h0*r2   + h1*r1   + h3*5*r4 + h4*5*r3
+	# d1 += h1*r0 + h0*r1   + h2*5*r4 + h3*5*r3 + h4*5*r2
+	# d0 += h0*r0 + h1*5*r4 + h2*5*r3 + h3*5*r2 + h4*5*r1
+
+	&movdqa		($D1,$T0);
+	&pmuludq	($T0,$T2);			# h0*r0
+	&paddq		($T0,$D0);
+	&movdqa		($D0,$T1);
+	&pmuludq	($T1,$T2);			# h1*r0
+	&pmuludq	($D2,$T2);			# h2*r0
+	&pmuludq	($D3,$T2);			# h3*r0
+	&pmuludq	($D4,$T2);			# h4*r0
+
+	&paddq		($T1,&QWP(16*1,"esp"));
+	&paddq		($D2,&QWP(16*2,"esp"));
+	&paddq		($D3,&QWP(16*3,"esp"));
+	&paddq		($D4,&QWP(16*4,"esp"));
+
+	&pmuladd_alt	(sub {	my $i=shift; &QWP(16*$i,"edx");	});
+
+	&lazy_reduction	();
+
+	&load_input	(16*2,16*5);
+
+	&ja		(&label("loop"));
+
+&set_label("skip_loop");
+	################################################################
+	# multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
+
+	 &pshufd	($T2,&QWP(16*(0-9),"edx"),0x10);# r0^n
+	&add		("ecx",32);
+	&jnz		(&label("long_tail"));
+
+	&paddd		($T0,$D0);			# add hash value
+	&paddd		($T1,$D1);
+	&paddd		($D2,&QWP(16*7,"esp"));
+	&paddd		($D3,&QWP(16*8,"esp"));
+	&paddd		($D4,&QWP(16*9,"esp"));
+
+&set_label("long_tail");
+
+	&movdqa		(&QWP(16*0,"eax"),$T0);
+	&movdqa		(&QWP(16*1,"eax"),$T1);
+	&movdqa		(&QWP(16*2,"eax"),$D2);
+	&movdqa		(&QWP(16*3,"eax"),$D3);
+	&movdqa		(&QWP(16*4,"eax"),$D4);
+
+	################################################################
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	&pmuludq	($T0,$T2);			# h0*r0
+	&pmuludq	($T1,$T2);			# h1*r0
+	&pmuludq	($D2,$T2);			# h2*r0
+	&movdqa		($D0,$T0);
+	 &pshufd	($T0,&QWP(16*(1-9),"edx"),0x10);# r1^n
+	&pmuludq	($D3,$T2);			# h3*r0
+	&movdqa		($D1,$T1);
+	&pmuludq	($D4,$T2);			# h4*r0
+
+	&pmuladd	(sub {	my ($reg,$i)=@_;
+				&pshufd ($reg,&QWP(16*($i-9),"edx"),0x10);
+			     },"eax");
+
+	&jz		(&label("short_tail"));
+
+	&load_input	(-16*2,0);
+
+	 &pshufd	($T2,&QWP(16*0,"edx"),0x10);	# r0^n
+	&paddd		($T0,&QWP(16*5,"esp"));		# add hash value
+	&paddd		($T1,&QWP(16*6,"esp"));
+	&paddd		($D2,&QWP(16*7,"esp"));
+	&paddd		($D3,&QWP(16*8,"esp"));
+	&paddd		($D4,&QWP(16*9,"esp"));
+
+	################################################################
+	# multiply inp[0:1] by r^4:r^3 and accumulate
+
+	&movdqa		(&QWP(16*0,"esp"),$T0);
+	&pmuludq	($T0,$T2);			# h0*r0
+	&movdqa		(&QWP(16*1,"esp"),$T1);
+	&pmuludq	($T1,$T2);			# h1*r0
+	&paddq		($D0,$T0);
+	&movdqa		($T0,$D2);
+	&pmuludq	($D2,$T2);			# h2*r0
+	&paddq		($D1,$T1);
+	&movdqa		($T1,$D3);
+	&pmuludq	($D3,$T2);			# h3*r0
+	&paddq		($D2,&QWP(16*2,"esp"));
+	&movdqa		(&QWP(16*2,"esp"),$T0);
+	 &pshufd	($T0,&QWP(16*1,"edx"),0x10);	# r1^n
+	&paddq		($D3,&QWP(16*3,"esp"));
+	&movdqa		(&QWP(16*3,"esp"),$T1);
+	&movdqa		($T1,$D4);
+	&pmuludq	($D4,$T2);			# h4*r0
+	&paddq		($D4,&QWP(16*4,"esp"));
+	&movdqa		(&QWP(16*4,"esp"),$T1);
+
+	&pmuladd	(sub {	my ($reg,$i)=@_;
+				&pshufd ($reg,&QWP(16*$i,"edx"),0x10);
+			     });
+
+&set_label("short_tail");
+
+	################################################################
+	# horizontal addition
+
+	&pshufd		($T1,$D4,0b01001110);
+	&pshufd		($T0,$D3,0b01001110);
+	&paddq		($D4,$T1);
+	&paddq		($D3,$T0);
+	&pshufd		($T1,$D0,0b01001110);
+	&pshufd		($T0,$D1,0b01001110);
+	&paddq		($D0,$T1);
+	&paddq		($D1,$T0);
+	&pshufd		($T1,$D2,0b01001110);
+	#&paddq		($D2,$T1);
+
+	&lazy_reduction	(sub { &paddq ($D2,$T1) });
+
+&set_label("done");
+	&movd		(&DWP(-16*3+4*0,"edi"),$D0);	# store hash value
+	&movd		(&DWP(-16*3+4*1,"edi"),$D1);
+	&movd		(&DWP(-16*3+4*2,"edi"),$D2);
+	&movd		(&DWP(-16*3+4*3,"edi"),$D3);
+	&movd		(&DWP(-16*3+4*4,"edi"),$D4);
+	&mov	("esp","ebp");
+&set_label("nodata");
+&function_end("_poly1305_blocks_sse2");
+
+&align	(32);
+&function_begin("_poly1305_emit_sse2");
+	&mov	("ebp",&wparam(0));		# context
+
+	&cmp	(&DWP(4*5,"ebp"),0);		# is_base2_26?
+	&je	(&label("enter_emit"));
+
+	&mov	("eax",&DWP(4*0,"ebp"));	# load hash value
+	&mov	("edi",&DWP(4*1,"ebp"));
+	&mov	("ecx",&DWP(4*2,"ebp"));
+	&mov	("edx",&DWP(4*3,"ebp"));
+	&mov	("esi",&DWP(4*4,"ebp"));
+
+	&mov	("ebx","edi");			# base 2^26 -> base 2^32
+	&shl	("edi",26);
+	&shr	("ebx",6);
+	&add	("eax","edi");
+	&mov	("edi","ecx");
+	&adc	("ebx",0);
+
+	&shl	("edi",20);
+	&shr	("ecx",12);
+	&add	("ebx","edi");
+	&mov	("edi","edx");
+	&adc	("ecx",0);
+
+	&shl	("edi",14);
+	&shr	("edx",18);
+	&add	("ecx","edi");
+	&mov	("edi","esi");
+	&adc	("edx",0);
+
+	&shl	("edi",8);
+	&shr	("esi",24);
+	&add	("edx","edi");
+	&adc	("esi",0);			# can be partially reduced
+
+	&mov	("edi","esi");			# final reduction
+	&and	("esi",3);
+	&shr	("edi",2);
+	&lea	("ebp",&DWP(0,"edi","edi",4));	# *5
+	 &mov	("edi",&wparam(1));		# output
+	&add	("eax","ebp");
+	 &mov	("ebp",&wparam(2));		# key
+	&adc	("ebx",0);
+	&adc	("ecx",0);
+	&adc	("edx",0);
+	&adc	("esi",0);
+
+	&movd	($D0,"eax");			# offload original hash value
+	&add	("eax",5);			# compare to modulus
+	&movd	($D1,"ebx");
+	&adc	("ebx",0);
+	&movd	($D2,"ecx");
+	&adc	("ecx",0);
+	&movd	($D3,"edx");
+	&adc	("edx",0);
+	&adc	("esi",0);
+	&shr	("esi",2);			# did it carry/borrow?
+
+	&neg	("esi");			# do we choose (hash-modulus) ...
+	&and	("eax","esi");
+	&and	("ebx","esi");
+	&and	("ecx","esi");
+	&and	("edx","esi");
+	&mov	(&DWP(4*0,"edi"),"eax");
+	&movd	("eax",$D0);
+	&mov	(&DWP(4*1,"edi"),"ebx");
+	&movd	("ebx",$D1);
+	&mov	(&DWP(4*2,"edi"),"ecx");
+	&movd	("ecx",$D2);
+	&mov	(&DWP(4*3,"edi"),"edx");
+	&movd	("edx",$D3);
+
+	&not	("esi");			# ... or original hash value?
+	&and	("eax","esi");
+	&and	("ebx","esi");
+	&or	("eax",&DWP(4*0,"edi"));
+	&and	("ecx","esi");
+	&or	("ebx",&DWP(4*1,"edi"));
+	&and	("edx","esi");
+	&or	("ecx",&DWP(4*2,"edi"));
+	&or	("edx",&DWP(4*3,"edi"));
+
+	&add	("eax",&DWP(4*0,"ebp"));	# accumulate key
+	&adc	("ebx",&DWP(4*1,"ebp"));
+	&mov	(&DWP(4*0,"edi"),"eax");
+	&adc	("ecx",&DWP(4*2,"ebp"));
+	&mov	(&DWP(4*1,"edi"),"ebx");
+	&adc	("edx",&DWP(4*3,"ebp"));
+	&mov	(&DWP(4*2,"edi"),"ecx");
+	&mov	(&DWP(4*3,"edi"),"edx");
+&function_end("_poly1305_emit_sse2");
+
+if ($avx>1) {
+########################################################################
+# Note that poly1305_init_avx2 operates on %xmm, I could have used
+# poly1305_init_sse2...
+
+&align	(32);
+&function_begin_B("_poly1305_init_avx2");
+	&vmovdqu	($D4,&QWP(4*6,"edi"));		# key base 2^32
+	&lea		("edi",&DWP(16*3,"edi"));	# size optimization
+	&mov		("ebp","esp");
+	&sub		("esp",16*(9+5));
+	&and		("esp",-16);
+
+	#&vpand		($D4,$D4,&QWP(96,"ebx"));	# magic mask
+	&vmovdqa	($MASK,&QWP(64,"ebx"));
+
+	&vpand		($D0,$D4,$MASK);		# -> base 2^26
+	&vpsrlq		($D1,$D4,26);
+	&vpsrldq	($D3,$D4,6);
+	&vpand		($D1,$D1,$MASK);
+	&vpsrlq		($D2,$D3,4)
+	&vpsrlq		($D3,$D3,30);
+	&vpand		($D2,$D2,$MASK);
+	&vpand		($D3,$D3,$MASK);
+	&vpsrldq	($D4,$D4,13);
+
+	&lea		("edx",&DWP(16*9,"esp"));	# size optimization
+	&mov		("ecx",2);
+&set_label("square");
+	&vmovdqa	(&QWP(16*0,"esp"),$D0);
+	&vmovdqa	(&QWP(16*1,"esp"),$D1);
+	&vmovdqa	(&QWP(16*2,"esp"),$D2);
+	&vmovdqa	(&QWP(16*3,"esp"),$D3);
+	&vmovdqa	(&QWP(16*4,"esp"),$D4);
+
+	&vpslld		($T1,$D1,2);
+	&vpslld		($T0,$D2,2);
+	&vpaddd		($T1,$T1,$D1);			# *5
+	&vpaddd		($T0,$T0,$D2);			# *5
+	&vmovdqa	(&QWP(16*5,"esp"),$T1);
+	&vmovdqa	(&QWP(16*6,"esp"),$T0);
+	&vpslld		($T1,$D3,2);
+	&vpslld		($T0,$D4,2);
+	&vpaddd		($T1,$T1,$D3);			# *5
+	&vpaddd		($T0,$T0,$D4);			# *5
+	&vmovdqa	(&QWP(16*7,"esp"),$T1);
+	&vmovdqa	(&QWP(16*8,"esp"),$T0);
+
+	&vpshufd	($T0,$D0,0b01000100);
+	&vmovdqa	($T1,$D1);
+	&vpshufd	($D1,$D1,0b01000100);
+	&vpshufd	($D2,$D2,0b01000100);
+	&vpshufd	($D3,$D3,0b01000100);
+	&vpshufd	($D4,$D4,0b01000100);
+	&vmovdqa	(&QWP(16*0,"edx"),$T0);
+	&vmovdqa	(&QWP(16*1,"edx"),$D1);
+	&vmovdqa	(&QWP(16*2,"edx"),$D2);
+	&vmovdqa	(&QWP(16*3,"edx"),$D3);
+	&vmovdqa	(&QWP(16*4,"edx"),$D4);
+
+	################################################################
+	# d4 = h4*r0 + h3*r1   + h2*r2   + h1*r3   + h0*r4
+	# d3 = h3*r0 + h2*r1   + h1*r2   + h0*r3   + h4*5*r4
+	# d2 = h2*r0 + h1*r1   + h0*r2   + h4*5*r3 + h3*5*r4
+	# d1 = h1*r0 + h0*r1   + h4*5*r2 + h3*5*r3 + h2*5*r4
+	# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
+
+	&vpmuludq	($D4,$D4,$D0);			# h4*r0
+	&vpmuludq	($D3,$D3,$D0);			# h3*r0
+	&vpmuludq	($D2,$D2,$D0);			# h2*r0
+	&vpmuludq	($D1,$D1,$D0);			# h1*r0
+	&vpmuludq	($D0,$T0,$D0);			# h0*r0
+
+	&vpmuludq	($T0,$T1,&QWP(16*3,"edx"));	# r1*h3
+	&vpaddq		($D4,$D4,$T0);
+	&vpmuludq	($T2,$T1,&QWP(16*2,"edx"));	# r1*h2
+	&vpaddq		($D3,$D3,$T2);
+	&vpmuludq	($T0,$T1,&QWP(16*1,"edx"));	# r1*h1
+	&vpaddq		($D2,$D2,$T0);
+	&vmovdqa	($T2,&QWP(16*5,"esp"));		# s1
+	&vpmuludq	($T1,$T1,&QWP(16*0,"edx"));	# r1*h0
+	&vpaddq		($D1,$D1,$T1);
+	 &vmovdqa	($T0,&QWP(16*2,"esp"));		# r2
+	&vpmuludq	($T2,$T2,&QWP(16*4,"edx"));	# s1*h4
+	&vpaddq		($D0,$D0,$T2);
+
+	&vpmuludq	($T1,$T0,&QWP(16*2,"edx"));	# r2*h2
+	&vpaddq		($D4,$D4,$T1);
+	&vpmuludq	($T2,$T0,&QWP(16*1,"edx"));	# r2*h1
+	&vpaddq		($D3,$D3,$T2);
+	&vmovdqa	($T1,&QWP(16*6,"esp"));		# s2
+	&vpmuludq	($T0,$T0,&QWP(16*0,"edx"));	# r2*h0
+	&vpaddq		($D2,$D2,$T0);
+	&vpmuludq	($T2,$T1,&QWP(16*4,"edx"));	# s2*h4
+	&vpaddq		($D1,$D1,$T2);
+	 &vmovdqa	($T0,&QWP(16*3,"esp"));		# r3
+	&vpmuludq	($T1,$T1,&QWP(16*3,"edx"));	# s2*h3
+	&vpaddq		($D0,$D0,$T1);
+
+	&vpmuludq	($T2,$T0,&QWP(16*1,"edx"));	# r3*h1
+	&vpaddq		($D4,$D4,$T2);
+	&vmovdqa	($T1,&QWP(16*7,"esp"));		# s3
+	&vpmuludq	($T0,$T0,&QWP(16*0,"edx"));	# r3*h0
+	&vpaddq		($D3,$D3,$T0);
+	&vpmuludq	($T2,$T1,&QWP(16*4,"edx"));	# s3*h4
+	&vpaddq		($D2,$D2,$T2);
+	&vpmuludq	($T0,$T1,&QWP(16*3,"edx"));	# s3*h3
+	&vpaddq		($D1,$D1,$T0);
+	 &vmovdqa	($T2,&QWP(16*4,"esp"));		# r4
+	&vpmuludq	($T1,$T1,&QWP(16*2,"edx"));	# s3*h2
+	&vpaddq		($D0,$D0,$T1);
+
+	&vmovdqa	($T0,&QWP(16*8,"esp"));		# s4
+	&vpmuludq	($T2,$T2,&QWP(16*0,"edx"));	# r4*h0
+	&vpaddq		($D4,$D4,$T2);
+	&vpmuludq	($T1,$T0,&QWP(16*4,"edx"));	# s4*h4
+	&vpaddq		($D3,$D3,$T1);
+	&vpmuludq	($T2,$T0,&QWP(16*1,"edx"));	# s4*h1
+	&vpaddq		($D0,$D0,$T2);
+	&vpmuludq	($T1,$T0,&QWP(16*2,"edx"));	# s4*h2
+	&vpaddq		($D1,$D1,$T1);
+	 &vmovdqa	($MASK,&QWP(64,"ebx"));
+	&vpmuludq	($T0,$T0,&QWP(16*3,"edx"));	# s4*h3
+	&vpaddq		($D2,$D2,$T0);
+
+	################################################################
+	# lazy reduction
+	 &vpsrlq	($T0,$D3,26);
+	 &vpand		($D3,$D3,$MASK);
+	&vpsrlq		($T1,$D0,26);
+	&vpand		($D0,$D0,$MASK);
+	 &vpaddq	($D4,$D4,$T0);			# h3 -> h4
+	&vpaddq		($D1,$D1,$T1);			# h0 -> h1
+	 &vpsrlq	($T0,$D4,26);
+	 &vpand		($D4,$D4,$MASK);
+	&vpsrlq		($T1,$D1,26);
+	&vpand		($D1,$D1,$MASK);
+	&vpaddq		($D2,$D2,$T1);			# h1 -> h2
+	 &vpaddd	($D0,$D0,$T0);
+	 &vpsllq	($T0,$T0,2);
+	&vpsrlq		($T1,$D2,26);
+	&vpand		($D2,$D2,$MASK);
+	 &vpaddd	($D0,$D0,$T0);			# h4 -> h0
+	&vpaddd		($D3,$D3,$T1);			# h2 -> h3
+	&vpsrlq		($T1,$D3,26);
+	 &vpsrlq	($T0,$D0,26);
+	 &vpand		($D0,$D0,$MASK);
+	&vpand		($D3,$D3,$MASK);
+	 &vpaddd	($D1,$D1,$T0);			# h0 -> h1
+	&vpaddd		($D4,$D4,$T1);			# h3 -> h4
+
+	&dec		("ecx");
+	&jz		(&label("square_break"));
+
+	&vpunpcklqdq	($D0,$D0,&QWP(16*0,"esp"));	# 0:r^1:0:r^2
+	&vpunpcklqdq	($D1,$D1,&QWP(16*1,"esp"));
+	&vpunpcklqdq	($D2,$D2,&QWP(16*2,"esp"));
+	&vpunpcklqdq	($D3,$D3,&QWP(16*3,"esp"));
+	&vpunpcklqdq	($D4,$D4,&QWP(16*4,"esp"));
+	&jmp		(&label("square"));
+
+&set_label("square_break");
+	&vpsllq		($D0,$D0,32);			# -> r^3:0:r^4:0
+	&vpsllq		($D1,$D1,32);
+	&vpsllq		($D2,$D2,32);
+	&vpsllq		($D3,$D3,32);
+	&vpsllq		($D4,$D4,32);
+	&vpor		($D0,$D0,&QWP(16*0,"esp"));	# r^3:r^1:r^4:r^2
+	&vpor		($D1,$D1,&QWP(16*1,"esp"));
+	&vpor		($D2,$D2,&QWP(16*2,"esp"));
+	&vpor		($D3,$D3,&QWP(16*3,"esp"));
+	&vpor		($D4,$D4,&QWP(16*4,"esp"));
+
+	&vpshufd	($D0,$D0,0b10001101);		# -> r^1:r^2:r^3:r^4
+	&vpshufd	($D1,$D1,0b10001101);
+	&vpshufd	($D2,$D2,0b10001101);
+	&vpshufd	($D3,$D3,0b10001101);
+	&vpshufd	($D4,$D4,0b10001101);
+
+	&vmovdqu	(&QWP(16*0,"edi"),$D0);		# save the table
+	&vmovdqu	(&QWP(16*1,"edi"),$D1);
+	&vmovdqu	(&QWP(16*2,"edi"),$D2);
+	&vmovdqu	(&QWP(16*3,"edi"),$D3);
+	&vmovdqu	(&QWP(16*4,"edi"),$D4);
+
+	&vpslld		($T1,$D1,2);
+	&vpslld		($T0,$D2,2);
+	&vpaddd		($T1,$T1,$D1);			# *5
+	&vpaddd		($T0,$T0,$D2);			# *5
+	&vmovdqu	(&QWP(16*5,"edi"),$T1);
+	&vmovdqu	(&QWP(16*6,"edi"),$T0);
+	&vpslld		($T1,$D3,2);
+	&vpslld		($T0,$D4,2);
+	&vpaddd		($T1,$T1,$D3);			# *5
+	&vpaddd		($T0,$T0,$D4);			# *5
+	&vmovdqu	(&QWP(16*7,"edi"),$T1);
+	&vmovdqu	(&QWP(16*8,"edi"),$T0);
+
+	&mov		("esp","ebp");
+	&lea		("edi",&DWP(-16*3,"edi"));	# size de-optimization
+	&ret		();
+&function_end_B("_poly1305_init_avx2");
+
+########################################################################
+# now it's time to switch to %ymm
+
+my ($D0,$D1,$D2,$D3,$D4,$T0,$T1,$T2)=map("ymm$_",(0..7));
+my $MASK=$T2;
+
+sub X { my $reg=shift; $reg=~s/^ymm/xmm/; $reg; }
+
+&align	(32);
+&function_begin("_poly1305_blocks_avx2");
+	&mov	("edi",&wparam(0));			# ctx
+	&mov	("esi",&wparam(1));			# inp
+	&mov	("ecx",&wparam(2));			# len
+
+	&mov	("eax",&DWP(4*5,"edi"));		# is_base2_26
+	&and	("ecx",-16);
+	&jz	(&label("nodata"));
+	&cmp	("ecx",64);
+	&jae	(&label("enter_avx2"));
+	&test	("eax","eax");				# is_base2_26?
+	&jz	(&label("enter_blocks"));
+
+&set_label("enter_avx2");
+	&vzeroupper	();
+
+	&call	(&label("pic_point"));
+&set_label("pic_point");
+	&blindpop("ebx");
+	&lea	("ebx",&DWP(&label("const_sse2")."-".&label("pic_point"),"ebx"));
+
+	&test	("eax","eax");				# is_base2_26?
+	&jnz	(&label("base2_26"));
+
+	&call	("_poly1305_init_avx2");
+
+	################################################# base 2^32 -> base 2^26
+	&mov	("eax",&DWP(0,"edi"));
+	&mov	("ecx",&DWP(3,"edi"));
+	&mov	("edx",&DWP(6,"edi"));
+	&mov	("esi",&DWP(9,"edi"));
+	&mov	("ebp",&DWP(13,"edi"));
+
+	&shr	("ecx",2);
+	&and	("eax",0x3ffffff);
+	&shr	("edx",4);
+	&and	("ecx",0x3ffffff);
+	&shr	("esi",6);
+	&and	("edx",0x3ffffff);
+
+	&mov	(&DWP(4*0,"edi"),"eax");
+	&mov	(&DWP(4*1,"edi"),"ecx");
+	&mov	(&DWP(4*2,"edi"),"edx");
+	&mov	(&DWP(4*3,"edi"),"esi");
+	&mov	(&DWP(4*4,"edi"),"ebp");
+	&mov	(&DWP(4*5,"edi"),1);			# is_base2_26
+
+	&mov	("esi",&wparam(1));			# [reload] inp
+	&mov	("ecx",&wparam(2));			# [reload] len
+
+&set_label("base2_26");
+	&mov	("eax",&wparam(3));			# padbit
+	&mov	("ebp","esp");
+
+	&sub	("esp",32*(5+9));
+	&and	("esp",-512);				# ensure that frame
+							# doesn't cross page
+							# boundary, which is
+							# essential for
+							# misaligned 32-byte
+							# loads
+
+	################################################################
+        # expand and copy pre-calculated table to stack
+
+	&vmovdqu	(&X($D0),&QWP(16*(3+0),"edi"));
+	&lea		("edx",&DWP(32*5+128,"esp"));	# +128 size optimization
+	&vmovdqu	(&X($D1),&QWP(16*(3+1),"edi"));
+	&vmovdqu	(&X($D2),&QWP(16*(3+2),"edi"));
+	&vmovdqu	(&X($D3),&QWP(16*(3+3),"edi"));
+	&vmovdqu	(&X($D4),&QWP(16*(3+4),"edi"));
+	&lea		("edi",&DWP(16*3,"edi"));	# size optimization
+	&vpermq		($D0,$D0,0b01000000);		# 00001234 -> 12343434
+	&vpermq		($D1,$D1,0b01000000);
+	&vpermq		($D2,$D2,0b01000000);
+	&vpermq		($D3,$D3,0b01000000);
+	&vpermq		($D4,$D4,0b01000000);
+	&vpshufd	($D0,$D0,0b11001000);		# 12343434 -> 14243444
+	&vpshufd	($D1,$D1,0b11001000);
+	&vpshufd	($D2,$D2,0b11001000);
+	&vpshufd	($D3,$D3,0b11001000);
+	&vpshufd	($D4,$D4,0b11001000);
+	&vmovdqa	(&QWP(32*0-128,"edx"),$D0);
+	&vmovdqu	(&X($D0),&QWP(16*5,"edi"));
+	&vmovdqa	(&QWP(32*1-128,"edx"),$D1);
+	&vmovdqu	(&X($D1),&QWP(16*6,"edi"));
+	&vmovdqa	(&QWP(32*2-128,"edx"),$D2);
+	&vmovdqu	(&X($D2),&QWP(16*7,"edi"));
+	&vmovdqa	(&QWP(32*3-128,"edx"),$D3);
+	&vmovdqu	(&X($D3),&QWP(16*8,"edi"));
+	&vmovdqa	(&QWP(32*4-128,"edx"),$D4);
+	&vpermq		($D0,$D0,0b01000000);
+	&vpermq		($D1,$D1,0b01000000);
+	&vpermq		($D2,$D2,0b01000000);
+	&vpermq		($D3,$D3,0b01000000);
+	&vpshufd	($D0,$D0,0b11001000);
+	&vpshufd	($D1,$D1,0b11001000);
+	&vpshufd	($D2,$D2,0b11001000);
+	&vpshufd	($D3,$D3,0b11001000);
+	&vmovdqa	(&QWP(32*5-128,"edx"),$D0);
+	&vmovd		(&X($D0),&DWP(-16*3+4*0,"edi"));# load hash value
+	&vmovdqa	(&QWP(32*6-128,"edx"),$D1);
+	&vmovd		(&X($D1),&DWP(-16*3+4*1,"edi"));
+	&vmovdqa	(&QWP(32*7-128,"edx"),$D2);
+	&vmovd		(&X($D2),&DWP(-16*3+4*2,"edi"));
+	&vmovdqa	(&QWP(32*8-128,"edx"),$D3);
+	&vmovd		(&X($D3),&DWP(-16*3+4*3,"edi"));
+	&vmovd		(&X($D4),&DWP(-16*3+4*4,"edi"));
+	&vmovdqa	($MASK,&QWP(64,"ebx"));
+	&neg		("eax");			# padbit
+
+	&test		("ecx",63);
+	&jz		(&label("even"));
+
+	&mov		("edx","ecx");
+	&and		("ecx",-64);
+	&and		("edx",63);
+
+	&vmovdqu	(&X($T0),&QWP(16*0,"esi"));
+	&cmp		("edx",32);
+	&jb		(&label("one"));
+
+	&vmovdqu	(&X($T1),&QWP(16*1,"esi"));
+	&je		(&label("two"));
+
+	&vinserti128	($T0,$T0,&QWP(16*2,"esi"),1);
+	&lea		("esi",&DWP(16*3,"esi"));
+	&lea		("ebx",&DWP(8,"ebx"));		# three padbits
+	&lea		("edx",&DWP(32*5+128+8,"esp"));	# --:r^1:r^2:r^3 (*)
+	&jmp		(&label("tail"));
+
+&set_label("two");
+	&lea		("esi",&DWP(16*2,"esi"));
+	&lea		("ebx",&DWP(16,"ebx"));		# two padbits
+	&lea		("edx",&DWP(32*5+128+16,"esp"));# --:--:r^1:r^2 (*)
+	&jmp		(&label("tail"));
+
+&set_label("one");
+	&lea		("esi",&DWP(16*1,"esi"));
+	&vpxor		($T1,$T1,$T1);
+	&lea		("ebx",&DWP(32,"ebx","eax",8));	# one or no padbits
+	&lea		("edx",&DWP(32*5+128+24,"esp"));# --:--:--:r^1 (*)
+	&jmp		(&label("tail"));
+
+# (*)	spots marked with '--' are data from next table entry, but they
+#	are multiplied by 0 and therefore rendered insignificant
+
+&set_label("even",32);
+	&vmovdqu	(&X($T0),&QWP(16*0,"esi"));	# load input
+	&vmovdqu	(&X($T1),&QWP(16*1,"esi"));
+	&vinserti128	($T0,$T0,&QWP(16*2,"esi"),1);
+	&vinserti128	($T1,$T1,&QWP(16*3,"esi"),1);
+	&lea		("esi",&DWP(16*4,"esi"));
+	&sub		("ecx",64);
+	&jz		(&label("tail"));
+
+&set_label("loop");
+	################################################################
+	# ((inp[0]*r^4+r[4])*r^4+r[8])*r^4
+	# ((inp[1]*r^4+r[5])*r^4+r[9])*r^3
+	# ((inp[2]*r^4+r[6])*r^4+r[10])*r^2
+	# ((inp[3]*r^4+r[7])*r^4+r[11])*r^1
+	#   \________/ \_______/
+	################################################################
+
+sub vsplat_input {
+	&vmovdqa	(&QWP(32*2,"esp"),$D2);
+	&vpsrldq	($D2,$T0,6);			# splat input
+	&vmovdqa	(&QWP(32*0,"esp"),$D0);
+	&vpsrldq	($D0,$T1,6);
+	&vmovdqa	(&QWP(32*1,"esp"),$D1);
+	&vpunpckhqdq	($D1,$T0,$T1);			# 4
+	&vpunpcklqdq	($T0,$T0,$T1);			# 0:1
+	&vpunpcklqdq	($D2,$D2,$D0);			# 2:3
+
+	&vpsrlq		($D0,$D2,30);
+	&vpsrlq		($D2,$D2,4);
+	&vpsrlq		($T1,$T0,26);
+	&vpsrlq		($D1,$D1,40);			# 4
+	&vpand		($D2,$D2,$MASK);		# 2
+	&vpand		($T0,$T0,$MASK);		# 0
+	&vpand		($T1,$T1,$MASK);		# 1
+	&vpand		($D0,$D0,$MASK);		# 3 (*)
+	&vpor		($D1,$D1,&QWP(0,"ebx"));	# padbit, yes, always
+
+	# (*)	note that output is counterintuitive, inp[3:4] is
+	#	returned in $D1-2, while $D3-4 are preserved;
+}
+	&vsplat_input	();
+
+sub vpmuladd {
+my $addr = shift;
+
+	&vpaddq		($D2,$D2,&QWP(32*2,"esp"));	# add hash value
+	&vpaddq		($T0,$T0,&QWP(32*0,"esp"));
+	&vpaddq		($T1,$T1,&QWP(32*1,"esp"));
+	&vpaddq		($D0,$D0,$D3);
+	&vpaddq		($D1,$D1,$D4);
+
+	################################################################
+	# d3 = h2*r1   + h0*r3 + h1*r2   + h3*r0   + h4*5*r4
+	# d4 = h2*r2   + h0*r4 + h1*r3   + h3*r1   + h4*r0
+	# d0 = h2*5*r3 + h0*r0 + h1*5*r4 + h3*5*r2 + h4*5*r1
+	# d1 = h2*5*r4 + h0*r1 + h1*r0   + h3*5*r3 + h4*5*r2
+	# d2 = h2*r0   + h0*r2 + h1*r1   + h3*5*r4 + h4*5*r3
+
+	&vpmuludq	($D3,$D2,&$addr(1));		# d3 = h2*r1
+	 &vmovdqa	(QWP(32*1,"esp"),$T1);
+	&vpmuludq	($D4,$D2,&$addr(2));		# d4 = h2*r2
+	 &vmovdqa	(QWP(32*3,"esp"),$D0);
+	&vpmuludq	($D0,$D2,&$addr(7));		# d0 = h2*s3
+	 &vmovdqa	(QWP(32*4,"esp"),$D1);
+	&vpmuludq	($D1,$D2,&$addr(8));		# d1 = h2*s4
+	&vpmuludq	($D2,$D2,&$addr(0));		# d2 = h2*r0
+
+	&vpmuludq	($T2,$T0,&$addr(3));		# h0*r3
+	&vpaddq		($D3,$D3,$T2);			# d3 += h0*r3
+	&vpmuludq	($T1,$T0,&$addr(4));		# h0*r4
+	&vpaddq		($D4,$D4,$T1);			# d4 + h0*r4
+	&vpmuludq	($T2,$T0,&$addr(0));		# h0*r0
+	&vpaddq		($D0,$D0,$T2);			# d0 + h0*r0
+	 &vmovdqa	($T2,&QWP(32*1,"esp"));		# h1
+	&vpmuludq	($T1,$T0,&$addr(1));		# h0*r1
+	&vpaddq		($D1,$D1,$T1);			# d1 += h0*r1
+	&vpmuludq	($T0,$T0,&$addr(2));		# h0*r2
+	&vpaddq		($D2,$D2,$T0);			# d2 += h0*r2
+
+	&vpmuludq	($T1,$T2,&$addr(2));		# h1*r2
+	&vpaddq		($D3,$D3,$T1);			# d3 += h1*r2
+	&vpmuludq	($T0,$T2,&$addr(3));		# h1*r3
+	&vpaddq		($D4,$D4,$T0);			# d4 += h1*r3
+	&vpmuludq	($T1,$T2,&$addr(8));		# h1*s4
+	&vpaddq		($D0,$D0,$T1);			# d0 += h1*s4
+	 &vmovdqa	($T1,&QWP(32*3,"esp"));		# h3
+	&vpmuludq	($T0,$T2,&$addr(0));		# h1*r0
+	&vpaddq		($D1,$D1,$T0);			# d1 += h1*r0
+	&vpmuludq	($T2,$T2,&$addr(1));		# h1*r1
+	&vpaddq		($D2,$D2,$T2);			# d2 += h1*r1
+
+	&vpmuludq	($T0,$T1,&$addr(0));		# h3*r0
+	&vpaddq		($D3,$D3,$T0);			# d3 += h3*r0
+	&vpmuludq	($T2,$T1,&$addr(1));		# h3*r1
+	&vpaddq		($D4,$D4,$T2);			# d4 += h3*r1
+	&vpmuludq	($T0,$T1,&$addr(6));		# h3*s2
+	&vpaddq		($D0,$D0,$T0);			# d0 += h3*s2
+	 &vmovdqa	($T0,&QWP(32*4,"esp"));		# h4
+	&vpmuludq	($T2,$T1,&$addr(7));		# h3*s3
+	&vpaddq		($D1,$D1,$T2);			# d1+= h3*s3
+	&vpmuludq	($T1,$T1,&$addr(8));		# h3*s4
+	&vpaddq		($D2,$D2,$T1);			# d2 += h3*s4
+
+	&vpmuludq	($T2,$T0,&$addr(8));		# h4*s4
+	&vpaddq		($D3,$D3,$T2);			# d3 += h4*s4
+	&vpmuludq	($T1,$T0,&$addr(5));		# h4*s1
+	&vpaddq		($D0,$D0,$T1);			# d0 += h4*s1
+	&vpmuludq	($T2,$T0,&$addr(0));		# h4*r0
+	&vpaddq		($D4,$D4,$T2);			# d4 += h4*r0
+	 &vmovdqa	($MASK,&QWP(64,"ebx"));
+	&vpmuludq	($T1,$T0,&$addr(6));		# h4*s2
+	&vpaddq		($D1,$D1,$T1);			# d1 += h4*s2
+	&vpmuludq	($T0,$T0,&$addr(7));		# h4*s3
+	&vpaddq		($D2,$D2,$T0);			# d2 += h4*s3
+}
+	&vpmuladd	(sub {	my $i=shift; &QWP(32*$i-128,"edx");	});
+
+sub vlazy_reduction {
+	################################################################
+	# lazy reduction
+
+	 &vpsrlq	($T0,$D3,26);
+	 &vpand		($D3,$D3,$MASK);
+	&vpsrlq		($T1,$D0,26);
+	&vpand		($D0,$D0,$MASK);
+	 &vpaddq	($D4,$D4,$T0);			# h3 -> h4
+	&vpaddq		($D1,$D1,$T1);			# h0 -> h1
+	 &vpsrlq	($T0,$D4,26);
+	 &vpand		($D4,$D4,$MASK);
+	&vpsrlq		($T1,$D1,26);
+	&vpand		($D1,$D1,$MASK);
+	&vpaddq		($D2,$D2,$T1);			# h1 -> h2
+	 &vpaddq	($D0,$D0,$T0);
+	 &vpsllq	($T0,$T0,2);
+	&vpsrlq		($T1,$D2,26);
+	&vpand		($D2,$D2,$MASK);
+	 &vpaddq	($D0,$D0,$T0);			# h4 -> h0
+	&vpaddq		($D3,$D3,$T1);			# h2 -> h3
+	&vpsrlq		($T1,$D3,26);
+	 &vpsrlq	($T0,$D0,26);
+	 &vpand		($D0,$D0,$MASK);
+	&vpand		($D3,$D3,$MASK);
+	 &vpaddq	($D1,$D1,$T0);			# h0 -> h1
+	&vpaddq		($D4,$D4,$T1);			# h3 -> h4
+}
+	&vlazy_reduction();
+
+	&vmovdqu	(&X($T0),&QWP(16*0,"esi"));	# load input
+	&vmovdqu	(&X($T1),&QWP(16*1,"esi"));
+	&vinserti128	($T0,$T0,&QWP(16*2,"esi"),1);
+	&vinserti128	($T1,$T1,&QWP(16*3,"esi"),1);
+	&lea		("esi",&DWP(16*4,"esi"));
+	&sub		("ecx",64);
+	&jnz		(&label("loop"));
+
+&set_label("tail");
+	&vsplat_input	();
+	&and		("ebx",-64);			# restore pointer
+
+	&vpmuladd	(sub {	my $i=shift; &QWP(4+32*$i-128,"edx");	});
+
+	################################################################
+	# horizontal addition
+
+	&vpsrldq	($T0,$D4,8);
+	&vpsrldq	($T1,$D3,8);
+	&vpaddq		($D4,$D4,$T0);
+	&vpsrldq	($T0,$D0,8);
+	&vpaddq		($D3,$D3,$T1);
+	&vpsrldq	($T1,$D1,8);
+	&vpaddq		($D0,$D0,$T0);
+	&vpsrldq	($T0,$D2,8);
+	&vpaddq		($D1,$D1,$T1);
+	&vpermq		($T1,$D4,2);			# keep folding
+	&vpaddq		($D2,$D2,$T0);
+	&vpermq		($T0,$D3,2);
+	&vpaddq		($D4,$D4,$T1);
+	&vpermq		($T1,$D0,2);
+	&vpaddq		($D3,$D3,$T0);
+	&vpermq		($T0,$D1,2);
+	&vpaddq		($D0,$D0,$T1);
+	&vpermq		($T1,$D2,2);
+	&vpaddq		($D1,$D1,$T0);
+	&vpaddq		($D2,$D2,$T1);
+
+	&vlazy_reduction();
+
+	&cmp		("ecx",0);
+	&je		(&label("done"));
+
+	################################################################
+	# clear all but single word
+
+	&vpshufd	(&X($D0),&X($D0),0b11111100);
+	&lea		("edx",&DWP(32*5+128,"esp"));	# restore pointer
+	&vpshufd	(&X($D1),&X($D1),0b11111100);
+	&vpshufd	(&X($D2),&X($D2),0b11111100);
+	&vpshufd	(&X($D3),&X($D3),0b11111100);
+	&vpshufd	(&X($D4),&X($D4),0b11111100);
+	&jmp		(&label("even"));
+
+&set_label("done",16);
+	&vmovd		(&DWP(-16*3+4*0,"edi"),&X($D0));# store hash value
+	&vmovd		(&DWP(-16*3+4*1,"edi"),&X($D1));
+	&vmovd		(&DWP(-16*3+4*2,"edi"),&X($D2));
+	&vmovd		(&DWP(-16*3+4*3,"edi"),&X($D3));
+	&vmovd		(&DWP(-16*3+4*4,"edi"),&X($D4));
+	&vzeroupper	();
+	&mov	("esp","ebp");
+&set_label("nodata");
+&function_end("_poly1305_blocks_avx2");
+}
+&set_label("const_sse2",64);
+	&data_word(1<<24,0,	1<<24,0,	1<<24,0,	1<<24,0);
+	&data_word(0,0,		0,0,		0,0,		0,0);
+	&data_word(0x03ffffff,0,0x03ffffff,0,	0x03ffffff,0,	0x03ffffff,0);
+	&data_word(0x0fffffff,0x0ffffffc,0x0ffffffc,0x0ffffffc);
+}
+&asciz	("Poly1305 for x86, CRYPTOGAMS by <appro\@openssl.org>");
+&align	(4);
+
+&asm_finish();
+
+close STDOUT;
diff --git a/crypto/nasm.props b/crypto/nasm.props
new file mode 100644
index 0000000..c3c5610
--- /dev/null
+++ b/crypto/nasm.props
@@ -0,0 +1,18 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <PropertyGroup
+    Condition="'$(NASMBeforeTargets)' == '' and '$(NASMAfterTargets)' == '' and '$(ConfigurationType)' != 'Makefile'">
+    <NASMBeforeTargets>Midl</NASMBeforeTargets>
+    <NASMAfterTargets>CustomBuild</NASMAfterTargets>
+  </PropertyGroup>
+  <ItemDefinitionGroup>
+    <NASM>
+      <OutputFormat>$(IntDir)%(FileName).obj</OutputFormat>
+      <PackAlignmentBoundary>0</PackAlignmentBoundary>
+      <CommandLineTemplate Condition="'$(Platform)' == 'Win32'">c:\dev\nasm\nasm.exe -f win32 [AllOptions] [AdditionalOptions] %(FullPath)</CommandLineTemplate>
+      <CommandLineTemplate Condition="'$(Platform)' == 'X64'">c:\dev\nasm\nasm.exe -f win64 [AllOptions]  [AdditionalOptions] %(FullPath)</CommandLineTemplate>
+      <CommandLineTemplate Condition="'$(Platform)' != 'Win32' and '$(Platform)' != 'X64'">echo NASM not supported on this platform</CommandLineTemplate>
+      <ExecutionDescription>Assembling [Inputs]...</ExecutionDescription>
+    </NASM>
+  </ItemDefinitionGroup>
+</Project>
\ No newline at end of file
diff --git a/crypto/nasm.targets b/crypto/nasm.targets
new file mode 100644
index 0000000..8dd1989
--- /dev/null
+++ b/crypto/nasm.targets
@@ -0,0 +1,82 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <PropertyPageSchema
+      Include="$(MSBuildThisFileDirectory)$(MSBuildThisFileName).xml" />
+    <AvailableItemName Include="NASM">
+      <Targets>_NASM</Targets>
+    </AvailableItemName>
+  </ItemGroup>
+  <PropertyGroup>
+    <ComputeLinkInputsTargets>
+      $(ComputeLinkInputsTargets);
+      ComputeNASMOutput;
+    </ComputeLinkInputsTargets>
+    <ComputeLibInputsTargets>
+      $(ComputeLibInputsTargets);
+      ComputeNASMOutput;
+    </ComputeLibInputsTargets>
+  </PropertyGroup>
+  <UsingTask
+    TaskName="NASM"
+    TaskFactory="XamlTaskFactory"
+    AssemblyName="Microsoft.Build.Tasks.v4.0">
+    <Task>$(MSBuildThisFileDirectory)$(MSBuildThisFileName).xml</Task>
+  </UsingTask>
+  <Target
+    Name="_NASM"
+    BeforeTargets="$(NASMBeforeTargets)"
+    AfterTargets="$(NASMAfterTargets)"
+    Condition="'@(NASM)' != ''"
+    Outputs="%(NASM.OutputFormat)"
+    Inputs="%(NASM.Identity);%(NASM.AdditionalDependencies);$(MSBuildProjectFile)"
+    DependsOnTargets="_SelectedFiles">
+    <ItemGroup Condition="'@(SelectedFiles)' != ''">
+      <NASM Remove="@(NASM)" Condition="'%(Identity)' != '@(SelectedFiles)'" />
+    </ItemGroup>
+    <ItemGroup>
+      <NASM_tlog Include="%(NASM.OutputFormat)" Condition="'%(NASM.OutputFormat)' != '' and '%(NASM.ExcludedFromBuild)' != 'true'">
+        <Source>@(NASM, '|')</Source>
+      </NASM_tlog>
+    </ItemGroup>
+    <Message
+      Importance="High"
+      Text="%(NASM.ExecutionDescription)" />
+    <WriteLinesToFile
+      Condition="'@(NASM_tlog)' != '' and '%(NASM_tlog.ExcludedFromBuild)' != 'true'"
+      File="$(IntDir)$(ProjectName).write.1.tlog"
+      Lines="^%(NASM_tlog.Source);@(NASM_tlog-&gt;'%(Fullpath)')"/>
+    <NASM
+      Condition="'@(NASM)' != '' and '%(NASM.ExcludedFromBuild)' != 'true'"
+      Inputs="%(NASM.Inputs)"
+      OutputFormat="%(NASM.OutputFormat)"
+      AssembledCodeListingFile="%(NASM.AssembledCodeListingFile)"
+      GenerateDebugInformation="%(NASM.GenerateDebugInformation)"
+      ErrorReporting="%(NASM.ErrorReporting)"
+      IncludePaths="%(NASM.IncludePaths)"
+      PreprocessorDefinitions="%(NASM.PreprocessorDefinitions)"
+      UndefinePreprocessorDefinitions="%(NASM.UndefinePreprocessorDefinitions)"
+      ErrorReportingFormat="%(NASM.ErrorReportingFormat)"
+      TreatWarningsAsErrors="%(NASM.TreatWarningsAsErrors)"
+      floatunderflow="%(NASM.floatunderflow)"
+      macrodefaults="%(NASM.macrodefaults)"
+      user="%(NASM.user)"
+      floatoverflow="%(NASM.floatoverflow)"
+      floatdenorm="%(NASM.floatdenorm)"
+      numberoverflow="%(NASM.numberoverflow)"
+      macroselfref="%(NASM.macroselfref)"
+      floattoolong="%(NASM.floattoolong)"
+      orphanlabels="%(NASM.orphanlabels)"
+      CommandLineTemplate="%(NASM.CommandLineTemplate)"
+      AdditionalOptions="%(NASM.AdditionalOptions)"
+ />
+  </Target>
+  <Target
+    Name="ComputeNASMOutput"
+    Condition="'@(NASM)' != ''">
+    <ItemGroup>
+      <Link Include="@(NASM->Metadata('OutputFormat')->Distinct()->ClearMetadata())" Condition="'%(NASM.ExcludedFromBuild)' != 'true'"/>
+      <Lib Include="@(NASM->Metadata('OutputFormat')->Distinct()->ClearMetadata())" Condition="'%(NASM.ExcludedFromBuild)' != 'true'"/>
+    </ItemGroup>
+  </Target>
+</Project>
diff --git a/crypto/nasm.xml b/crypto/nasm.xml
new file mode 100644
index 0000000..c2bc89b
--- /dev/null
+++ b/crypto/nasm.xml
@@ -0,0 +1,308 @@
+<?xml version="1.0" encoding="utf-8"?>
+<ProjectSchemaDefinitions xmlns="http://schemas.microsoft.com/build/2009/properties" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:sys="clr-namespace:System;assembly=mscorlib">
+  <Rule
+    Name="NASM"
+    PageTemplate="tool"
+    DisplayName="Netwide Assembler"
+    Order="200">
+    <Rule.DataSource>
+      <DataSource
+        Persistence="ProjectFile"
+        ItemType="NASM" />
+    </Rule.DataSource>
+    <Rule.Categories>
+      <Category
+        Name="General">
+        <Category.DisplayName>
+          <sys:String>General</sys:String>
+        </Category.DisplayName>
+      </Category>
+	  <Category
+        Name="Preprocessor">
+        <Category.DisplayName>
+          <sys:String>Preprocessing Options</sys:String>
+        </Category.DisplayName>
+      </Category>
+	  <Category
+        Name="Assembler Options">
+        <Category.DisplayName>
+          <sys:String>Assembler Options</sys:String>
+        </Category.DisplayName>
+      </Category>
+	  <Category
+        Name="Advanced">
+        <Category.DisplayName>
+          <sys:String>Advanced </sys:String>
+        </Category.DisplayName>
+      </Category>	  
+      <Category
+        Name="Command Line"
+        Subtype="CommandLine">
+        <Category.DisplayName>
+          <sys:String>Command Line</sys:String>
+        </Category.DisplayName>
+      </Category>
+    </Rule.Categories>
+    <StringProperty
+      Name="Inputs"
+      Category="Command Line"
+      IsRequired="true">
+      <StringProperty.DataSource>
+        <DataSource
+          Persistence="ProjectFile"
+          ItemType="NASM"
+          SourceType="Item" />
+      </StringProperty.DataSource>
+    </StringProperty>
+    
+  <StringProperty
+	  Name="OutputFormat"	  
+      Category="Assembler Options"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Output File Name"
+      Description="Specify Output Filename.-o [value]"
+	  Switch="-o [value]"
+	/>  
+
+
+	<StringListProperty
+	Name="AssembledCodeListingFile"
+	Category="Assembler Options"
+	DisplayName="Assembled Code Listing File"	
+	Description="Generates an assembled code listing file.     (-l [file])"
+	HelpUrl="http://www.nasm.us/doc/"
+	Switch="-l &quot;[value]&quot;"
+	/>
+	
+	<BoolProperty
+	Name="GenerateDebugInformation"
+	Category="Assembler Options"
+	DisplayName="Generate Debug Information"
+	Description="Generates Debug Information.     (-g)"
+	HelpUrl="http://www.nasm.us/doc/"
+	Switch="-g"	
+	/>
+	  
+  <StringListProperty
+      Name="ErrorReporting"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Redirect Error Messages to File"
+      Description="Drops the error Message on specified device"
+      Switch="-Z &quot;[value]&quot;"        
+    />
+	
+	<StringListProperty
+	Name="IncludePaths"
+	Category="General"
+	DisplayName="Include Paths"
+	Description="Sets path for include file.     (-I[path])"
+	HelpUrl="http://www.nasm.us/doc/"
+	Switch="-I[value]"
+	
+	/>
+	
+	<StringListProperty
+	Name="PreprocessorDefinitions"
+    Category="Preprocessor"
+    HelpUrl="http://www.nasm.us/doc/"
+    DisplayName="Preprocessor Definitions"
+    Description="Defines a text macro with the given name.     (-D[symbol])"
+	Switch="-D[value]"
+
+	/>
+	
+  <StringListProperty
+	Name="UndefinePreprocessorDefinitions"
+	Category="Preprocessor"
+	HelpUrl="http://www.nasm.us/doc/"
+	DisplayName="Undefine Preprocessor Definitions"
+	Description="Undefines a text macro with the given name.     (-U[symbol])"	
+	Switch="-U[value]"
+	/>
+	
+	<EnumProperty
+      Name="ErrorReportingFormat"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Error Reporting Format"
+      Description="Select the error reporting format ie. GNU or VC">
+      <EnumValue
+        Name="0"
+        DisplayName="-Xgnu	GNU format: Default format"
+        Switch="-Xgnu" />
+      <EnumValue
+        Name="1"
+        DisplayName="-Xvc	Style used by Microsoft Visual C++"
+        Switch="-Xvc" />      
+    </EnumProperty>
+	
+	<BoolProperty
+	Name="TreatWarningsAsErrors"
+	Category="Assembler Options"
+	DisplayName="Treat Warnings As Errors"
+	Description="Returns an error code if warnings are generated.     (-Werror)"
+	HelpUrl="http://www.nasm.us/doc/"
+	Switch="-Werror"
+	/>
+
+	<BoolProperty
+      Name="floatunderflow"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="float-underflow"
+      Description="floating point underflow (default off)"
+      Switch="-w+float-underflow" />
+
+  <BoolProperty
+      Name="macrodefaults"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable macro-defaults"
+      Description="macros with more default than optional parameters (default on)"
+      Switch="-w-macro-defaults" />
+
+  <BoolProperty
+      Name="user"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable user"
+      Description="%warning directives (default on)"
+      Switch="-w-user" />
+
+  <BoolProperty
+      Name="floatoverflow"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable float-overflow"
+      Description="floating point overflow (default on)"
+      Switch="-w-float-overflow" />
+
+  <BoolProperty
+      Name="floatdenorm"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="float-denorm"
+      Description="floating point denormal (default off)"
+      Switch="-w+float-denorm" />
+
+  <BoolProperty
+      Name="numberoverflow"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable number-overflow"
+      Description="numeric constant does not fit (default on)"
+      Switch="-w-number-overflow" />
+
+  <BoolProperty
+      Name="macroselfref"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="macro-selfref"
+      Description="cyclic macro references (default off)"
+      Switch="-w+macro-selfref" />
+
+  <BoolProperty
+      Name="floattoolong"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable float-toolong"
+      Description=" too many digits in floating-point number (default on)"
+      Switch="-w-float-toolong" />
+
+  <BoolProperty
+      Name="orphanlabels"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable orphan-labels"
+      Description="labels alone on lines without trailing `:' (default on)"
+      Switch="-w-orphan-labels" />
+
+  <StringProperty
+      Name="CommandLineTemplate"
+      DisplayName="Command Line"
+      Visible="False"
+      IncludeInCommandLine="False" />
+
+  <DynamicEnumProperty
+        Name="NASMBeforeTargets"
+        Category="General"
+        EnumProvider="Targets"
+        IncludeInCommandLine="False">
+      <DynamicEnumProperty.DisplayName>
+        <sys:String>Execute Before</sys:String>
+      </DynamicEnumProperty.DisplayName>
+      <DynamicEnumProperty.Description>
+        <sys:String>Specifies the targets for the build customization to run before.</sys:String>
+      </DynamicEnumProperty.Description>
+      <DynamicEnumProperty.ProviderSettings>
+        <NameValuePair
+          Name="Exclude"
+          Value="^NASMBeforeTargets|^Compute" />
+      </DynamicEnumProperty.ProviderSettings>
+      <DynamicEnumProperty.DataSource>
+        <DataSource
+          Persistence="ProjectFile"
+          ItemType=""
+          HasConfigurationCondition="true" />
+      </DynamicEnumProperty.DataSource>
+    </DynamicEnumProperty>
+  <DynamicEnumProperty
+      Name="NASMAfterTargets"
+      Category="General"
+      EnumProvider="Targets"
+      IncludeInCommandLine="False">
+      <DynamicEnumProperty.DisplayName>
+        <sys:String>Execute After</sys:String>
+      </DynamicEnumProperty.DisplayName>
+      <DynamicEnumProperty.Description>
+        <sys:String>Specifies the targets for the build customization to run after.</sys:String>
+      </DynamicEnumProperty.Description>
+      <DynamicEnumProperty.ProviderSettings>
+        <NameValuePair
+          Name="Exclude"
+          Value="^NASMAfterTargets|^Compute" />
+      </DynamicEnumProperty.ProviderSettings>
+      <DynamicEnumProperty.DataSource>
+        <DataSource
+          Persistence="ProjectFile"
+          ItemType=""
+          HasConfigurationCondition="true" />
+      </DynamicEnumProperty.DataSource>
+    </DynamicEnumProperty>
+  <StringProperty
+      Name="ExecutionDescription"
+      DisplayName="Execution Description"
+      IncludeInCommandLine="False"
+      Visible="False" />
+
+  <StringListProperty
+      Name="AdditionalDependencies"
+      DisplayName="Additional Dependencies"
+      IncludeInCommandLine="False"
+      Visible="False" />
+  
+  <StringProperty
+      Subtype="AdditionalOptions"
+      Name="AdditionalOptions"
+      Category="Command Line">
+      <StringProperty.DisplayName>
+        <sys:String>Additional Options</sys:String>
+      </StringProperty.DisplayName>
+      <StringProperty.Description>
+        <sys:String>Additional Options</sys:String>
+      </StringProperty.Description>
+    </StringProperty>
+  
+  </Rule>
+  <ItemType
+    Name="NASM"
+    DisplayName="Netwide Assembler" />
+  <FileExtension
+    Name="*.asm"
+    ContentType="NASM" />
+  <ContentType
+    Name="NASM"
+    DisplayName="Netwide Assembler"
+    ItemType="NASM" />
+</ProjectSchemaDefinitions>
diff --git a/crypto/poly1305_x64_gas.s b/crypto/poly1305_x64_gas.s
new file mode 100644
index 0000000..709ca13
--- /dev/null
+++ b/crypto/poly1305_x64_gas.s
@@ -0,0 +1,3132 @@
+.align	64
+.Lconst:
+.Lmask24:
+.long	0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0
+.L129:
+.long	16777216,0,16777216,0,16777216,0,16777216,0
+.Lmask26:
+.long	0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0
+.Lpermd_avx2:
+.long	2,2,2,3,2,0,2,1
+.Lpermd_avx512:
+.long	0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7
+
+.L2_44_inp_permd:
+.long	0,1,1,2,2,3,7,7
+.L2_44_inp_shift:
+.quad	0,12,24,64
+.L2_44_mask:
+.quad	0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff
+.L2_44_shift_rgt:
+.quad	44,44,42,64
+.L2_44_shift_lft:
+.quad	8,8,10,64
+
+.align	64
+.Lx_mask44:
+.quad	0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff
+.quad	0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff
+.Lx_mask42:
+.quad	0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff
+.quad	0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff
+
+.text	
+
+
+.global	poly1305_init_x86_64
+.global	poly1305_blocks_x86_64
+.global	poly1305_emit_x86_64
+.global	poly1305_emit_avx
+.global	poly1305_blocks_avx
+.global	poly1305_blocks_avx2
+.global	poly1305_blocks_avx512
+
+
+.type	poly1305_init_x86_64,@function
+.align	32
+poly1305_init_x86_64:
+	xorq	%rax,%rax
+	movq	%rax,0(%rdi)
+	movq	%rax,8(%rdi)
+	movq	%rax,16(%rdi)
+
+	cmpq	$0,%rsi
+	je	.Lno_key
+
+
+
+	movq	$0x0ffffffc0fffffff,%rax
+	movq	$0x0ffffffc0ffffffc,%rcx
+	andq	0(%rsi),%rax
+	andq	8(%rsi),%rcx
+	movq	%rax,24(%rdi)
+	movq	%rcx,32(%rdi)
+	movl	$1,%eax
+.Lno_key:
+	ret
+.size	poly1305_init_x86_64,.-poly1305_init_x86_64
+
+.type	poly1305_blocks_x86_64,@function
+.align	32
+poly1305_blocks_x86_64:
+.cfi_startproc	
+.Lblocks:
+	shrq	$4,%rdx
+	jz	.Lno_data
+
+	pushq	%rbx
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r15,-56
+.Lblocks_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movq	16(%rdi),%rbp
+
+	movq	%r13,%r12
+	shrq	$2,%r13
+	movq	%r12,%rax
+	addq	%r12,%r13
+	jmp	.Loop
+
+.align	32
+.Loop:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	mulq	%r14
+	movq	%rax,%r9
+	movq	%r11,%rax
+	movq	%rdx,%r10
+
+	mulq	%r14
+	movq	%rax,%r14
+	movq	%r11,%rax
+	movq	%rdx,%r8
+
+	mulq	%rbx
+	addq	%rax,%r9
+	movq	%r13,%rax
+	adcq	%rdx,%r10
+
+	mulq	%rbx
+	movq	%rbp,%rbx
+	addq	%rax,%r14
+	adcq	%rdx,%r8
+
+	imulq	%r13,%rbx
+	addq	%rbx,%r9
+	movq	%r8,%rbx
+	adcq	$0,%r10
+
+	imulq	%r11,%rbp
+	addq	%r9,%rbx
+	movq	$-4,%rax
+	adcq	%rbp,%r10
+
+	andq	%r10,%rax
+	movq	%r10,%rbp
+	shrq	$2,%r10
+	andq	$3,%rbp
+	addq	%r10,%rax
+	addq	%rax,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+	movq	%r12,%rax
+	decq	%r15
+	jnz	.Loop
+
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%rbp,16(%rdi)
+
+	movq	0(%rsp),%r15
+.cfi_restore	%r15
+	movq	8(%rsp),%r14
+.cfi_restore	%r14
+	movq	16(%rsp),%r13
+.cfi_restore	%r13
+	movq	24(%rsp),%r12
+.cfi_restore	%r12
+	movq	32(%rsp),%rbp
+.cfi_restore	%rbp
+	movq	40(%rsp),%rbx
+.cfi_restore	%rbx
+	leaq	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lno_data:
+.Lblocks_epilogue:
+	ret
+.cfi_endproc	
+.size	poly1305_blocks_x86_64,.-poly1305_blocks_x86_64
+
+.type	poly1305_emit_x86_64,@function
+.align	32
+poly1305_emit_x86_64:
+.Lemit:
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movq	16(%rdi),%r10
+
+	movq	%r8,%rax
+	addq	$5,%r8
+	movq	%r9,%rcx
+	adcq	$0,%r9
+	adcq	$0,%r10
+	shrq	$2,%r10
+	cmovnzq	%r8,%rax
+	cmovnzq	%r9,%rcx
+
+	addq	0(%rdx),%rax
+	adcq	8(%rdx),%rcx
+	movq	%rax,0(%rsi)
+	movq	%rcx,8(%rsi)
+
+	ret
+.size	poly1305_emit_x86_64,.-poly1305_emit_x86_64
+.type	__poly1305_block,@function
+.align	32
+__poly1305_block:
+	mulq	%r14
+	movq	%rax,%r9
+	movq	%r11,%rax
+	movq	%rdx,%r10
+
+	mulq	%r14
+	movq	%rax,%r14
+	movq	%r11,%rax
+	movq	%rdx,%r8
+
+	mulq	%rbx
+	addq	%rax,%r9
+	movq	%r13,%rax
+	adcq	%rdx,%r10
+
+	mulq	%rbx
+	movq	%rbp,%rbx
+	addq	%rax,%r14
+	adcq	%rdx,%r8
+
+	imulq	%r13,%rbx
+	addq	%rbx,%r9
+	movq	%r8,%rbx
+	adcq	$0,%r10
+
+	imulq	%r11,%rbp
+	addq	%r9,%rbx
+	movq	$-4,%rax
+	adcq	%rbp,%r10
+
+	andq	%r10,%rax
+	movq	%r10,%rbp
+	shrq	$2,%r10
+	andq	$3,%rbp
+	addq	%r10,%rax
+	addq	%rax,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+	ret
+.size	__poly1305_block,.-__poly1305_block
+
+.type	__poly1305_init_avx,@function
+.align	32
+__poly1305_init_avx:
+	movq	%r11,%r14
+	movq	%r12,%rbx
+	xorq	%rbp,%rbp
+
+	leaq	48+64(%rdi),%rdi
+
+	movq	%r12,%rax
+	call	__poly1305_block
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	movq	%r11,%r9
+	andl	%r11d,%edx
+	movl	%eax,-64(%rdi)
+	shrq	$26,%r8
+	movl	%edx,-60(%rdi)
+	shrq	$26,%r9
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%eax
+	andl	%r9d,%edx
+	movl	%eax,-48(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,-44(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,-32(%rdi)
+	shrq	$26,%r8
+	movl	%edx,-28(%rdi)
+	shrq	$26,%r9
+
+	movq	%rbx,%rax
+	movq	%r12,%rdx
+	shlq	$12,%rax
+	shlq	$12,%rdx
+	orq	%r8,%rax
+	orq	%r9,%rdx
+	andl	$0x3ffffff,%eax
+	andl	$0x3ffffff,%edx
+	movl	%eax,-16(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,-12(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,0(%rdi)
+	movq	%rbx,%r8
+	movl	%edx,4(%rdi)
+	movq	%r12,%r9
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	shrq	$14,%r9
+	andl	%r8d,%eax
+	andl	%r9d,%edx
+	movl	%eax,16(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,20(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,32(%rdi)
+	shrq	$26,%r8
+	movl	%edx,36(%rdi)
+	shrq	$26,%r9
+
+	movq	%rbp,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,48(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r9d,52(%rdi)
+	leaq	(%r9,%r9,4),%r9
+	movl	%r8d,64(%rdi)
+	movl	%r9d,68(%rdi)
+
+	movq	%r12,%rax
+	call	__poly1305_block
+
+	movl	$0x3ffffff,%eax
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	shrq	$26,%r8
+	movl	%eax,-52(%rdi)
+
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%edx
+	movl	%edx,-36(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,-20(%rdi)
+
+	movq	%rbx,%rax
+	shlq	$12,%rax
+	orq	%r8,%rax
+	andl	$0x3ffffff,%eax
+	movl	%eax,-4(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movq	%rbx,%r8
+	movl	%eax,12(%rdi)
+
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	andl	%r8d,%edx
+	movl	%edx,28(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,44(%rdi)
+
+	movq	%rbp,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,60(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r8d,76(%rdi)
+
+	movq	%r12,%rax
+	call	__poly1305_block
+
+	movl	$0x3ffffff,%eax
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	shrq	$26,%r8
+	movl	%eax,-56(%rdi)
+
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%edx
+	movl	%edx,-40(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,-24(%rdi)
+
+	movq	%rbx,%rax
+	shlq	$12,%rax
+	orq	%r8,%rax
+	andl	$0x3ffffff,%eax
+	movl	%eax,-8(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movq	%rbx,%r8
+	movl	%eax,8(%rdi)
+
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	andl	%r8d,%edx
+	movl	%edx,24(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,40(%rdi)
+
+	movq	%rbp,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,56(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r8d,72(%rdi)
+
+	leaq	-48-64(%rdi),%rdi
+	ret
+.size	__poly1305_init_avx,.-__poly1305_init_avx
+
+.type	poly1305_blocks_avx,@function
+.align	32
+poly1305_blocks_avx:
+.cfi_startproc	
+	movl	20(%rdi),%r8d
+	cmpq	$128,%rdx
+	jae	.Lblocks_avx
+	testl	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx:
+	andq	$-16,%rdx
+	jz	.Lno_data_avx
+
+	vzeroupper
+
+	testl	%r8d,%r8d
+	jz	.Lbase2_64_avx
+
+	testq	$31,%rdx
+	jz	.Leven_avx
+
+	pushq	%rbx
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r15,-56
+.Lblocks_avx_body:
+
+	movq	%rdx,%r15
+
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movl	16(%rdi),%ebp
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+
+	movl	%r8d,%r14d
+	andq	$-2147483648,%r8
+	movq	%r9,%r12
+	movl	%r9d,%ebx
+	andq	$-2147483648,%r9
+
+	shrq	$6,%r8
+	shlq	$52,%r12
+	addq	%r8,%r14
+	shrq	$12,%rbx
+	shrq	$18,%r9
+	addq	%r12,%r14
+	adcq	%r9,%rbx
+
+	movq	%rbp,%r8
+	shlq	$40,%r8
+	shrq	$24,%rbp
+	addq	%r8,%rbx
+	adcq	$0,%rbp
+
+	movq	$-4,%r9
+	movq	%rbp,%r8
+	andq	%rbp,%r9
+	shrq	$2,%r8
+	andq	$3,%rbp
+	addq	%r9,%r8
+	addq	%r8,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+
+	call	__poly1305_block
+
+	testq	%rcx,%rcx
+	jz	.Lstore_base2_64_avx
+
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r11
+	movq	%rbx,%r12
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r11
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r11,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r12
+	andq	$0x3ffffff,%rbx
+	orq	%r12,%rbp
+
+	subq	$16,%r15
+	jz	.Lstore_base2_26_avx
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	jmp	.Lproceed_avx
+
+.align	32
+.Lstore_base2_64_avx:
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%rbp,16(%rdi)
+	jmp	.Ldone_avx
+
+.align	16
+.Lstore_base2_26_avx:
+	movl	%eax,0(%rdi)
+	movl	%edx,4(%rdi)
+	movl	%r14d,8(%rdi)
+	movl	%ebx,12(%rdi)
+	movl	%ebp,16(%rdi)
+.align	16
+.Ldone_avx:
+	movq	0(%rsp),%r15
+.cfi_restore	%r15
+	movq	8(%rsp),%r14
+.cfi_restore	%r14
+	movq	16(%rsp),%r13
+.cfi_restore	%r13
+	movq	24(%rsp),%r12
+.cfi_restore	%r12
+	movq	32(%rsp),%rbp
+.cfi_restore	%rbp
+	movq	40(%rsp),%rbx
+.cfi_restore	%rbx
+	leaq	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lno_data_avx:
+.Lblocks_avx_epilogue:
+	ret
+.cfi_endproc	
+
+.align	32
+.Lbase2_64_avx:
+.cfi_startproc	
+	pushq	%rbx
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r15,-56
+.Lbase2_64_avx_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movl	16(%rdi),%ebp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	testq	$31,%rdx
+	jz	.Linit_avx
+
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	subq	$16,%r15
+
+	call	__poly1305_block
+
+.Linit_avx:
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r8
+	movq	%rbx,%r9
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r8
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r8,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r9
+	andq	$0x3ffffff,%rbx
+	orq	%r9,%rbp
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	movl	$1,20(%rdi)
+
+	call	__poly1305_init_avx
+
+.Lproceed_avx:
+	movq	%r15,%rdx
+
+	movq	0(%rsp),%r15
+.cfi_restore	%r15
+	movq	8(%rsp),%r14
+.cfi_restore	%r14
+	movq	16(%rsp),%r13
+.cfi_restore	%r13
+	movq	24(%rsp),%r12
+.cfi_restore	%r12
+	movq	32(%rsp),%rbp
+.cfi_restore	%rbp
+	movq	40(%rsp),%rbx
+.cfi_restore	%rbx
+	leaq	48(%rsp),%rax
+	leaq	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lbase2_64_avx_epilogue:
+	jmp	.Ldo_avx
+.cfi_endproc	
+
+.align	32
+.Leven_avx:
+.cfi_startproc	
+	vmovd	0(%rdi),%xmm0
+	vmovd	4(%rdi),%xmm1
+	vmovd	8(%rdi),%xmm2
+	vmovd	12(%rdi),%xmm3
+	vmovd	16(%rdi),%xmm4
+
+.Ldo_avx:
+	leaq	-88(%rsp),%r11
+.cfi_def_cfa	%r11,0x60
+	subq	$0x178,%rsp
+	subq	$64,%rdx
+	leaq	-32(%rsi),%rax
+	cmovcq	%rax,%rsi
+
+	vmovdqu	48(%rdi),%xmm14
+	leaq	112(%rdi),%rdi
+	leaq	.Lconst(%rip),%rcx
+
+
+
+	vmovdqu	32(%rsi),%xmm5
+	vmovdqu	48(%rsi),%xmm6
+	vmovdqa	64(%rcx),%xmm15
+
+	vpsrldq	$6,%xmm5,%xmm7
+	vpsrldq	$6,%xmm6,%xmm8
+	vpunpckhqdq	%xmm6,%xmm5,%xmm9
+	vpunpcklqdq	%xmm6,%xmm5,%xmm5
+	vpunpcklqdq	%xmm8,%xmm7,%xmm8
+
+	vpsrlq	$40,%xmm9,%xmm9
+	vpsrlq	$26,%xmm5,%xmm6
+	vpand	%xmm15,%xmm5,%xmm5
+	vpsrlq	$4,%xmm8,%xmm7
+	vpand	%xmm15,%xmm6,%xmm6
+	vpsrlq	$30,%xmm8,%xmm8
+	vpand	%xmm15,%xmm7,%xmm7
+	vpand	%xmm15,%xmm8,%xmm8
+	vpor	32(%rcx),%xmm9,%xmm9
+
+	jbe	.Lskip_loop_avx
+
+
+	vmovdqu	-48(%rdi),%xmm11
+	vmovdqu	-32(%rdi),%xmm12
+	vpshufd	$0xEE,%xmm14,%xmm13
+	vpshufd	$0x44,%xmm14,%xmm10
+	vmovdqa	%xmm13,-144(%r11)
+	vmovdqa	%xmm10,0(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm14
+	vmovdqu	-16(%rdi),%xmm10
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm14,-128(%r11)
+	vmovdqa	%xmm11,16(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm13
+	vmovdqu	0(%rdi),%xmm11
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm13,-112(%r11)
+	vmovdqa	%xmm12,32(%rsp)
+	vpshufd	$0xEE,%xmm10,%xmm14
+	vmovdqu	16(%rdi),%xmm12
+	vpshufd	$0x44,%xmm10,%xmm10
+	vmovdqa	%xmm14,-96(%r11)
+	vmovdqa	%xmm10,48(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm13
+	vmovdqu	32(%rdi),%xmm10
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm13,-80(%r11)
+	vmovdqa	%xmm11,64(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm14
+	vmovdqu	48(%rdi),%xmm11
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm14,-64(%r11)
+	vmovdqa	%xmm12,80(%rsp)
+	vpshufd	$0xEE,%xmm10,%xmm13
+	vmovdqu	64(%rdi),%xmm12
+	vpshufd	$0x44,%xmm10,%xmm10
+	vmovdqa	%xmm13,-48(%r11)
+	vmovdqa	%xmm10,96(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm14
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm14,-32(%r11)
+	vmovdqa	%xmm11,112(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm13
+	vmovdqa	0(%rsp),%xmm14
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm13,-16(%r11)
+	vmovdqa	%xmm12,128(%rsp)
+
+	jmp	.Loop_avx
+
+.align	32
+.Loop_avx:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%xmm5,%xmm14,%xmm10
+	vpmuludq	%xmm6,%xmm14,%xmm11
+	vmovdqa	%xmm2,32(%r11)
+	vpmuludq	%xmm7,%xmm14,%xmm12
+	vmovdqa	16(%rsp),%xmm2
+	vpmuludq	%xmm8,%xmm14,%xmm13
+	vpmuludq	%xmm9,%xmm14,%xmm14
+
+	vmovdqa	%xmm0,0(%r11)
+	vpmuludq	32(%rsp),%xmm9,%xmm0
+	vmovdqa	%xmm1,16(%r11)
+	vpmuludq	%xmm8,%xmm2,%xmm1
+	vpaddq	%xmm0,%xmm10,%xmm10
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vmovdqa	%xmm3,48(%r11)
+	vpmuludq	%xmm7,%xmm2,%xmm0
+	vpmuludq	%xmm6,%xmm2,%xmm1
+	vpaddq	%xmm0,%xmm13,%xmm13
+	vmovdqa	48(%rsp),%xmm3
+	vpaddq	%xmm1,%xmm12,%xmm12
+	vmovdqa	%xmm4,64(%r11)
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpmuludq	%xmm7,%xmm3,%xmm0
+	vpaddq	%xmm2,%xmm11,%xmm11
+
+	vmovdqa	64(%rsp),%xmm4
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpmuludq	%xmm6,%xmm3,%xmm1
+	vpmuludq	%xmm5,%xmm3,%xmm3
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vmovdqa	80(%rsp),%xmm2
+	vpaddq	%xmm3,%xmm12,%xmm12
+	vpmuludq	%xmm9,%xmm4,%xmm0
+	vpmuludq	%xmm8,%xmm4,%xmm4
+	vpaddq	%xmm0,%xmm11,%xmm11
+	vmovdqa	96(%rsp),%xmm3
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vmovdqa	128(%rsp),%xmm4
+	vpmuludq	%xmm6,%xmm2,%xmm1
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vpaddq	%xmm2,%xmm13,%xmm13
+	vpmuludq	%xmm9,%xmm3,%xmm0
+	vpmuludq	%xmm8,%xmm3,%xmm1
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vmovdqu	0(%rsi),%xmm0
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpmuludq	%xmm7,%xmm3,%xmm3
+	vpmuludq	%xmm7,%xmm4,%xmm7
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	vmovdqu	16(%rsi),%xmm1
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vpmuludq	%xmm8,%xmm4,%xmm8
+	vpmuludq	%xmm9,%xmm4,%xmm9
+	vpsrldq	$6,%xmm0,%xmm2
+	vpaddq	%xmm8,%xmm12,%xmm12
+	vpaddq	%xmm9,%xmm13,%xmm13
+	vpsrldq	$6,%xmm1,%xmm3
+	vpmuludq	112(%rsp),%xmm5,%xmm9
+	vpmuludq	%xmm6,%xmm4,%xmm5
+	vpunpckhqdq	%xmm1,%xmm0,%xmm4
+	vpaddq	%xmm9,%xmm14,%xmm14
+	vmovdqa	-144(%r11),%xmm9
+	vpaddq	%xmm5,%xmm10,%xmm10
+
+	vpunpcklqdq	%xmm1,%xmm0,%xmm0
+	vpunpcklqdq	%xmm3,%xmm2,%xmm3
+
+
+	vpsrldq	$5,%xmm4,%xmm4
+	vpsrlq	$26,%xmm0,%xmm1
+	vpand	%xmm15,%xmm0,%xmm0
+	vpsrlq	$4,%xmm3,%xmm2
+	vpand	%xmm15,%xmm1,%xmm1
+	vpand	0(%rcx),%xmm4,%xmm4
+	vpsrlq	$30,%xmm3,%xmm3
+	vpand	%xmm15,%xmm2,%xmm2
+	vpand	%xmm15,%xmm3,%xmm3
+	vpor	32(%rcx),%xmm4,%xmm4
+
+	vpaddq	0(%r11),%xmm0,%xmm0
+	vpaddq	16(%r11),%xmm1,%xmm1
+	vpaddq	32(%r11),%xmm2,%xmm2
+	vpaddq	48(%r11),%xmm3,%xmm3
+	vpaddq	64(%r11),%xmm4,%xmm4
+
+	leaq	32(%rsi),%rax
+	leaq	64(%rsi),%rsi
+	subq	$64,%rdx
+	cmovcq	%rax,%rsi
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%xmm0,%xmm9,%xmm5
+	vpmuludq	%xmm1,%xmm9,%xmm6
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vmovdqa	-128(%r11),%xmm7
+	vpmuludq	%xmm2,%xmm9,%xmm5
+	vpmuludq	%xmm3,%xmm9,%xmm6
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm4,%xmm9,%xmm9
+	vpmuludq	-112(%r11),%xmm4,%xmm5
+	vpaddq	%xmm9,%xmm14,%xmm14
+
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpmuludq	%xmm2,%xmm7,%xmm6
+	vpmuludq	%xmm3,%xmm7,%xmm5
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vmovdqa	-96(%r11),%xmm8
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpmuludq	%xmm1,%xmm7,%xmm6
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm6,%xmm12,%xmm12
+	vpaddq	%xmm7,%xmm11,%xmm11
+
+	vmovdqa	-80(%r11),%xmm9
+	vpmuludq	%xmm2,%xmm8,%xmm5
+	vpmuludq	%xmm1,%xmm8,%xmm6
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vmovdqa	-64(%r11),%xmm7
+	vpmuludq	%xmm0,%xmm8,%xmm8
+	vpmuludq	%xmm4,%xmm9,%xmm5
+	vpaddq	%xmm8,%xmm12,%xmm12
+	vpaddq	%xmm5,%xmm11,%xmm11
+	vmovdqa	-48(%r11),%xmm8
+	vpmuludq	%xmm3,%xmm9,%xmm9
+	vpmuludq	%xmm1,%xmm7,%xmm6
+	vpaddq	%xmm9,%xmm10,%xmm10
+
+	vmovdqa	-16(%r11),%xmm9
+	vpaddq	%xmm6,%xmm14,%xmm14
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpmuludq	%xmm4,%xmm8,%xmm5
+	vpaddq	%xmm7,%xmm13,%xmm13
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vmovdqu	32(%rsi),%xmm5
+	vpmuludq	%xmm3,%xmm8,%xmm7
+	vpmuludq	%xmm2,%xmm8,%xmm8
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vmovdqu	48(%rsi),%xmm6
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+	vpmuludq	%xmm2,%xmm9,%xmm2
+	vpmuludq	%xmm3,%xmm9,%xmm3
+	vpsrldq	$6,%xmm5,%xmm7
+	vpaddq	%xmm2,%xmm11,%xmm11
+	vpmuludq	%xmm4,%xmm9,%xmm4
+	vpsrldq	$6,%xmm6,%xmm8
+	vpaddq	%xmm3,%xmm12,%xmm2
+	vpaddq	%xmm4,%xmm13,%xmm3
+	vpmuludq	-32(%r11),%xmm0,%xmm4
+	vpmuludq	%xmm1,%xmm9,%xmm0
+	vpunpckhqdq	%xmm6,%xmm5,%xmm9
+	vpaddq	%xmm4,%xmm14,%xmm4
+	vpaddq	%xmm0,%xmm10,%xmm0
+
+	vpunpcklqdq	%xmm6,%xmm5,%xmm5
+	vpunpcklqdq	%xmm8,%xmm7,%xmm8
+
+
+	vpsrldq	$5,%xmm9,%xmm9
+	vpsrlq	$26,%xmm5,%xmm6
+	vmovdqa	0(%rsp),%xmm14
+	vpand	%xmm15,%xmm5,%xmm5
+	vpsrlq	$4,%xmm8,%xmm7
+	vpand	%xmm15,%xmm6,%xmm6
+	vpand	0(%rcx),%xmm9,%xmm9
+	vpsrlq	$30,%xmm8,%xmm8
+	vpand	%xmm15,%xmm7,%xmm7
+	vpand	%xmm15,%xmm8,%xmm8
+	vpor	32(%rcx),%xmm9,%xmm9
+
+
+
+
+
+	vpsrlq	$26,%xmm3,%xmm13
+	vpand	%xmm15,%xmm3,%xmm3
+	vpaddq	%xmm13,%xmm4,%xmm4
+
+	vpsrlq	$26,%xmm0,%xmm10
+	vpand	%xmm15,%xmm0,%xmm0
+	vpaddq	%xmm10,%xmm11,%xmm1
+
+	vpsrlq	$26,%xmm4,%xmm10
+	vpand	%xmm15,%xmm4,%xmm4
+
+	vpsrlq	$26,%xmm1,%xmm11
+	vpand	%xmm15,%xmm1,%xmm1
+	vpaddq	%xmm11,%xmm2,%xmm2
+
+	vpaddq	%xmm10,%xmm0,%xmm0
+	vpsllq	$2,%xmm10,%xmm10
+	vpaddq	%xmm10,%xmm0,%xmm0
+
+	vpsrlq	$26,%xmm2,%xmm12
+	vpand	%xmm15,%xmm2,%xmm2
+	vpaddq	%xmm12,%xmm3,%xmm3
+
+	vpsrlq	$26,%xmm0,%xmm10
+	vpand	%xmm15,%xmm0,%xmm0
+	vpaddq	%xmm10,%xmm1,%xmm1
+
+	vpsrlq	$26,%xmm3,%xmm13
+	vpand	%xmm15,%xmm3,%xmm3
+	vpaddq	%xmm13,%xmm4,%xmm4
+
+	ja	.Loop_avx
+
+.Lskip_loop_avx:
+
+
+
+	vpshufd	$0x10,%xmm14,%xmm14
+	addq	$32,%rdx
+	jnz	.Long_tail_avx
+
+	vpaddq	%xmm2,%xmm7,%xmm7
+	vpaddq	%xmm0,%xmm5,%xmm5
+	vpaddq	%xmm1,%xmm6,%xmm6
+	vpaddq	%xmm3,%xmm8,%xmm8
+	vpaddq	%xmm4,%xmm9,%xmm9
+
+.Long_tail_avx:
+	vmovdqa	%xmm2,32(%r11)
+	vmovdqa	%xmm0,0(%r11)
+	vmovdqa	%xmm1,16(%r11)
+	vmovdqa	%xmm3,48(%r11)
+	vmovdqa	%xmm4,64(%r11)
+
+
+
+
+
+
+
+	vpmuludq	%xmm7,%xmm14,%xmm12
+	vpmuludq	%xmm5,%xmm14,%xmm10
+	vpshufd	$0x10,-48(%rdi),%xmm2
+	vpmuludq	%xmm6,%xmm14,%xmm11
+	vpmuludq	%xmm8,%xmm14,%xmm13
+	vpmuludq	%xmm9,%xmm14,%xmm14
+
+	vpmuludq	%xmm8,%xmm2,%xmm0
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpshufd	$0x10,-32(%rdi),%xmm3
+	vpmuludq	%xmm7,%xmm2,%xmm1
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vpshufd	$0x10,-16(%rdi),%xmm4
+	vpmuludq	%xmm6,%xmm2,%xmm0
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm11,%xmm11
+	vpmuludq	%xmm9,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	vpshufd	$0x10,0(%rdi),%xmm2
+	vpmuludq	%xmm7,%xmm4,%xmm1
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vpmuludq	%xmm6,%xmm4,%xmm0
+	vpaddq	%xmm0,%xmm13,%xmm13
+	vpshufd	$0x10,16(%rdi),%xmm3
+	vpmuludq	%xmm5,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm12,%xmm12
+	vpmuludq	%xmm9,%xmm2,%xmm1
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpshufd	$0x10,32(%rdi),%xmm4
+	vpmuludq	%xmm8,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm10,%xmm10
+
+	vpmuludq	%xmm6,%xmm3,%xmm0
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpmuludq	%xmm5,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm13,%xmm13
+	vpshufd	$0x10,48(%rdi),%xmm2
+	vpmuludq	%xmm9,%xmm4,%xmm1
+	vpaddq	%xmm1,%xmm12,%xmm12
+	vpshufd	$0x10,64(%rdi),%xmm3
+	vpmuludq	%xmm8,%xmm4,%xmm0
+	vpaddq	%xmm0,%xmm11,%xmm11
+	vpmuludq	%xmm7,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm14,%xmm14
+	vpmuludq	%xmm9,%xmm3,%xmm1
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vpmuludq	%xmm8,%xmm3,%xmm0
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vpmuludq	%xmm7,%xmm3,%xmm1
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpmuludq	%xmm6,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	jz	.Lshort_tail_avx
+
+	vmovdqu	0(%rsi),%xmm0
+	vmovdqu	16(%rsi),%xmm1
+
+	vpsrldq	$6,%xmm0,%xmm2
+	vpsrldq	$6,%xmm1,%xmm3
+	vpunpckhqdq	%xmm1,%xmm0,%xmm4
+	vpunpcklqdq	%xmm1,%xmm0,%xmm0
+	vpunpcklqdq	%xmm3,%xmm2,%xmm3
+
+	vpsrlq	$40,%xmm4,%xmm4
+	vpsrlq	$26,%xmm0,%xmm1
+	vpand	%xmm15,%xmm0,%xmm0
+	vpsrlq	$4,%xmm3,%xmm2
+	vpand	%xmm15,%xmm1,%xmm1
+	vpsrlq	$30,%xmm3,%xmm3
+	vpand	%xmm15,%xmm2,%xmm2
+	vpand	%xmm15,%xmm3,%xmm3
+	vpor	32(%rcx),%xmm4,%xmm4
+
+	vpshufd	$0x32,-64(%rdi),%xmm9
+	vpaddq	0(%r11),%xmm0,%xmm0
+	vpaddq	16(%r11),%xmm1,%xmm1
+	vpaddq	32(%r11),%xmm2,%xmm2
+	vpaddq	48(%r11),%xmm3,%xmm3
+	vpaddq	64(%r11),%xmm4,%xmm4
+
+
+
+
+	vpmuludq	%xmm0,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpmuludq	%xmm1,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpmuludq	%xmm2,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpshufd	$0x32,-48(%rdi),%xmm7
+	vpmuludq	%xmm3,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm4,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm14,%xmm14
+
+	vpmuludq	%xmm3,%xmm7,%xmm5
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpshufd	$0x32,-32(%rdi),%xmm8
+	vpmuludq	%xmm2,%xmm7,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpshufd	$0x32,-16(%rdi),%xmm9
+	vpmuludq	%xmm1,%xmm7,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vpmuludq	%xmm4,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+	vpshufd	$0x32,0(%rdi),%xmm7
+	vpmuludq	%xmm2,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm14,%xmm14
+	vpmuludq	%xmm1,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm13,%xmm13
+	vpshufd	$0x32,16(%rdi),%xmm8
+	vpmuludq	%xmm0,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm12,%xmm12
+	vpmuludq	%xmm4,%xmm7,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpshufd	$0x32,32(%rdi),%xmm9
+	vpmuludq	%xmm3,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm10,%xmm10
+
+	vpmuludq	%xmm1,%xmm8,%xmm5
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpmuludq	%xmm0,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm13,%xmm13
+	vpshufd	$0x32,48(%rdi),%xmm7
+	vpmuludq	%xmm4,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm12,%xmm12
+	vpshufd	$0x32,64(%rdi),%xmm8
+	vpmuludq	%xmm3,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm11,%xmm11
+	vpmuludq	%xmm2,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm10,%xmm10
+
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm14,%xmm14
+	vpmuludq	%xmm4,%xmm8,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm3,%xmm8,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpmuludq	%xmm2,%xmm8,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpmuludq	%xmm1,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+.Lshort_tail_avx:
+
+
+
+	vpsrldq	$8,%xmm14,%xmm9
+	vpsrldq	$8,%xmm13,%xmm8
+	vpsrldq	$8,%xmm11,%xmm6
+	vpsrldq	$8,%xmm10,%xmm5
+	vpsrldq	$8,%xmm12,%xmm7
+	vpaddq	%xmm8,%xmm13,%xmm13
+	vpaddq	%xmm9,%xmm14,%xmm14
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpaddq	%xmm7,%xmm12,%xmm12
+
+
+
+
+	vpsrlq	$26,%xmm13,%xmm3
+	vpand	%xmm15,%xmm13,%xmm13
+	vpaddq	%xmm3,%xmm14,%xmm14
+
+	vpsrlq	$26,%xmm10,%xmm0
+	vpand	%xmm15,%xmm10,%xmm10
+	vpaddq	%xmm0,%xmm11,%xmm11
+
+	vpsrlq	$26,%xmm14,%xmm4
+	vpand	%xmm15,%xmm14,%xmm14
+
+	vpsrlq	$26,%xmm11,%xmm1
+	vpand	%xmm15,%xmm11,%xmm11
+	vpaddq	%xmm1,%xmm12,%xmm12
+
+	vpaddq	%xmm4,%xmm10,%xmm10
+	vpsllq	$2,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vpsrlq	$26,%xmm12,%xmm2
+	vpand	%xmm15,%xmm12,%xmm12
+	vpaddq	%xmm2,%xmm13,%xmm13
+
+	vpsrlq	$26,%xmm10,%xmm0
+	vpand	%xmm15,%xmm10,%xmm10
+	vpaddq	%xmm0,%xmm11,%xmm11
+
+	vpsrlq	$26,%xmm13,%xmm3
+	vpand	%xmm15,%xmm13,%xmm13
+	vpaddq	%xmm3,%xmm14,%xmm14
+
+	vmovd	%xmm10,-112(%rdi)
+	vmovd	%xmm11,-108(%rdi)
+	vmovd	%xmm12,-104(%rdi)
+	vmovd	%xmm13,-100(%rdi)
+	vmovd	%xmm14,-96(%rdi)
+	leaq	88(%r11),%rsp
+.cfi_def_cfa	%rsp,8
+	vzeroupper
+	ret
+.cfi_endproc	
+.size	poly1305_blocks_avx,.-poly1305_blocks_avx
+
+.type	poly1305_emit_avx,@function
+.align	32
+poly1305_emit_avx:
+	cmpl	$0,20(%rdi)
+	je	.Lemit
+
+	movl	0(%rdi),%eax
+	movl	4(%rdi),%ecx
+	movl	8(%rdi),%r8d
+	movl	12(%rdi),%r11d
+	movl	16(%rdi),%r10d
+
+	shlq	$26,%rcx
+	movq	%r8,%r9
+	shlq	$52,%r8
+	addq	%rcx,%rax
+	shrq	$12,%r9
+	addq	%rax,%r8
+	adcq	$0,%r9
+
+	shlq	$14,%r11
+	movq	%r10,%rax
+	shrq	$24,%r10
+	addq	%r11,%r9
+	shlq	$40,%rax
+	addq	%rax,%r9
+	adcq	$0,%r10
+
+	movq	%r10,%rax
+	movq	%r10,%rcx
+	andq	$3,%r10
+	shrq	$2,%rax
+	andq	$-4,%rcx
+	addq	%rcx,%rax
+	addq	%rax,%r8
+	adcq	$0,%r9
+	adcq	$0,%r10
+
+	movq	%r8,%rax
+	addq	$5,%r8
+	movq	%r9,%rcx
+	adcq	$0,%r9
+	adcq	$0,%r10
+	shrq	$2,%r10
+	cmovnzq	%r8,%rax
+	cmovnzq	%r9,%rcx
+
+	addq	0(%rdx),%rax
+	adcq	8(%rdx),%rcx
+	movq	%rax,0(%rsi)
+	movq	%rcx,8(%rsi)
+
+	ret
+.size	poly1305_emit_avx,.-poly1305_emit_avx
+.type	poly1305_blocks_avx2,@function
+.align	32
+poly1305_blocks_avx2:
+.cfi_startproc	
+	movl	20(%rdi),%r8d
+	cmpq	$128,%rdx
+	jae	.Lblocks_avx2
+	testl	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx2:
+	andq	$-16,%rdx
+	jz	.Lno_data_avx2
+
+	vzeroupper
+
+	testl	%r8d,%r8d
+	jz	.Lbase2_64_avx2
+
+	testq	$63,%rdx
+	jz	.Leven_avx2
+
+	pushq	%rbx
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r15,-56
+.Lblocks_avx2_body:
+
+	movq	%rdx,%r15
+
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movl	16(%rdi),%ebp
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+
+	movl	%r8d,%r14d
+	andq	$-2147483648,%r8
+	movq	%r9,%r12
+	movl	%r9d,%ebx
+	andq	$-2147483648,%r9
+
+	shrq	$6,%r8
+	shlq	$52,%r12
+	addq	%r8,%r14
+	shrq	$12,%rbx
+	shrq	$18,%r9
+	addq	%r12,%r14
+	adcq	%r9,%rbx
+
+	movq	%rbp,%r8
+	shlq	$40,%r8
+	shrq	$24,%rbp
+	addq	%r8,%rbx
+	adcq	$0,%rbp
+
+	movq	$-4,%r9
+	movq	%rbp,%r8
+	andq	%rbp,%r9
+	shrq	$2,%r8
+	andq	$3,%rbp
+	addq	%r9,%r8
+	addq	%r8,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+.Lbase2_26_pre_avx2:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	subq	$16,%r15
+
+	call	__poly1305_block
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	.Lbase2_26_pre_avx2
+
+	testq	%rcx,%rcx
+	jz	.Lstore_base2_64_avx2
+
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r11
+	movq	%rbx,%r12
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r11
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r11,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r12
+	andq	$0x3ffffff,%rbx
+	orq	%r12,%rbp
+
+	testq	%r15,%r15
+	jz	.Lstore_base2_26_avx2
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	jmp	.Lproceed_avx2
+
+.align	32
+.Lstore_base2_64_avx2:
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%rbp,16(%rdi)
+	jmp	.Ldone_avx2
+
+.align	16
+.Lstore_base2_26_avx2:
+	movl	%eax,0(%rdi)
+	movl	%edx,4(%rdi)
+	movl	%r14d,8(%rdi)
+	movl	%ebx,12(%rdi)
+	movl	%ebp,16(%rdi)
+.align	16
+.Ldone_avx2:
+	movq	0(%rsp),%r15
+.cfi_restore	%r15
+	movq	8(%rsp),%r14
+.cfi_restore	%r14
+	movq	16(%rsp),%r13
+.cfi_restore	%r13
+	movq	24(%rsp),%r12
+.cfi_restore	%r12
+	movq	32(%rsp),%rbp
+.cfi_restore	%rbp
+	movq	40(%rsp),%rbx
+.cfi_restore	%rbx
+	leaq	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lno_data_avx2:
+.Lblocks_avx2_epilogue:
+	ret
+.cfi_endproc	
+
+.align	32
+.Lbase2_64_avx2:
+.cfi_startproc	
+	pushq	%rbx
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r15,-56
+.Lbase2_64_avx2_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movl	16(%rdi),%ebp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	testq	$63,%rdx
+	jz	.Linit_avx2
+
+.Lbase2_64_pre_avx2:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	subq	$16,%r15
+
+	call	__poly1305_block
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	.Lbase2_64_pre_avx2
+
+.Linit_avx2:
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r8
+	movq	%rbx,%r9
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r8
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r8,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r9
+	andq	$0x3ffffff,%rbx
+	orq	%r9,%rbp
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	movl	$1,20(%rdi)
+
+	call	__poly1305_init_avx
+
+.Lproceed_avx2:
+	movq	%r15,%rdx
+
+
+
+	movq	0(%rsp),%r15
+.cfi_restore	%r15
+	movq	8(%rsp),%r14
+.cfi_restore	%r14
+	movq	16(%rsp),%r13
+.cfi_restore	%r13
+	movq	24(%rsp),%r12
+.cfi_restore	%r12
+	movq	32(%rsp),%rbp
+.cfi_restore	%rbp
+	movq	40(%rsp),%rbx
+.cfi_restore	%rbx
+	leaq	48(%rsp),%rax
+	leaq	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lbase2_64_avx2_epilogue:
+	jmp	.Ldo_avx2
+.cfi_endproc	
+
+.align	32
+.Leven_avx2:
+.cfi_startproc	
+
+	vmovd	0(%rdi),%xmm0
+	vmovd	4(%rdi),%xmm1
+	vmovd	8(%rdi),%xmm2
+	vmovd	12(%rdi),%xmm3
+	vmovd	16(%rdi),%xmm4
+
+.Ldo_avx2:
+	leaq	-8(%rsp),%r11
+.cfi_def_cfa	%r11,16
+	subq	$0x128,%rsp
+	leaq	.Lconst(%rip),%rcx
+	leaq	48+64(%rdi),%rdi
+	vmovdqa	96(%rcx),%ymm7
+
+
+	vmovdqu	-64(%rdi),%xmm9
+	andq	$-512,%rsp
+	vmovdqu	-48(%rdi),%xmm10
+	vmovdqu	-32(%rdi),%xmm6
+	vmovdqu	-16(%rdi),%xmm11
+	vmovdqu	0(%rdi),%xmm12
+	vmovdqu	16(%rdi),%xmm13
+	leaq	144(%rsp),%rax
+	vmovdqu	32(%rdi),%xmm14
+	vpermd	%ymm9,%ymm7,%ymm9
+	vmovdqu	48(%rdi),%xmm15
+	vpermd	%ymm10,%ymm7,%ymm10
+	vmovdqu	64(%rdi),%xmm5
+	vpermd	%ymm6,%ymm7,%ymm6
+	vmovdqa	%ymm9,0(%rsp)
+	vpermd	%ymm11,%ymm7,%ymm11
+	vmovdqa	%ymm10,32-144(%rax)
+	vpermd	%ymm12,%ymm7,%ymm12
+	vmovdqa	%ymm6,64-144(%rax)
+	vpermd	%ymm13,%ymm7,%ymm13
+	vmovdqa	%ymm11,96-144(%rax)
+	vpermd	%ymm14,%ymm7,%ymm14
+	vmovdqa	%ymm12,128-144(%rax)
+	vpermd	%ymm15,%ymm7,%ymm15
+	vmovdqa	%ymm13,160-144(%rax)
+	vpermd	%ymm5,%ymm7,%ymm5
+	vmovdqa	%ymm14,192-144(%rax)
+	vmovdqa	%ymm15,224-144(%rax)
+	vmovdqa	%ymm5,256-144(%rax)
+	vmovdqa	64(%rcx),%ymm5
+
+
+
+	vmovdqu	0(%rsi),%xmm7
+	vmovdqu	16(%rsi),%xmm8
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpsrldq	$6,%ymm7,%ymm9
+	vpsrldq	$6,%ymm8,%ymm10
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+	vpunpcklqdq	%ymm10,%ymm9,%ymm9
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+
+	vpsrlq	$30,%ymm9,%ymm10
+	vpsrlq	$4,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+	vpsrlq	$40,%ymm6,%ymm6
+	vpand	%ymm5,%ymm9,%ymm9
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	vpaddq	%ymm2,%ymm9,%ymm2
+	subq	$64,%rdx
+	jz	.Ltail_avx2
+	jmp	.Loop_avx2
+
+.align	32
+.Loop_avx2:
+
+
+
+
+
+
+
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqa	0(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqa	32(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqa	96(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqa	48(%rax),%ymm10
+	vmovdqa	112(%rax),%ymm5
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	64(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+	vmovdqa	-16(%rax),%ymm8
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vmovdqu	0(%rsi),%xmm7
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vmovdqu	16(%rsi),%xmm8
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqa	16(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpsrldq	$6,%ymm7,%ymm9
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpsrldq	$6,%ymm8,%ymm10
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpunpcklqdq	%ymm10,%ymm9,%ymm10
+	vpmuludq	80(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+
+
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$4,%ymm10,%ymm9
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpand	%ymm5,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpsrlq	$30,%ymm10,%ymm10
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$40,%ymm6,%ymm6
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	subq	$64,%rdx
+	jnz	.Loop_avx2
+
+.byte	0x66,0x90
+.Ltail_avx2:
+
+
+
+
+
+
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqu	4(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqu	36(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqu	100(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqu	52(%rax),%ymm10
+	vmovdqu	116(%rax),%ymm5
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	68(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vmovdqu	-12(%rax),%ymm8
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqu	20(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpmuludq	84(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+
+
+
+	vpsrldq	$8,%ymm12,%ymm8
+	vpsrldq	$8,%ymm2,%ymm9
+	vpsrldq	$8,%ymm3,%ymm10
+	vpsrldq	$8,%ymm4,%ymm6
+	vpsrldq	$8,%ymm0,%ymm7
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+
+	vpermq	$0x2,%ymm3,%ymm10
+	vpermq	$0x2,%ymm4,%ymm6
+	vpermq	$0x2,%ymm0,%ymm7
+	vpermq	$0x2,%ymm12,%ymm8
+	vpermq	$0x2,%ymm2,%ymm9
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+
+
+
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vmovd	%xmm0,-112(%rdi)
+	vmovd	%xmm1,-108(%rdi)
+	vmovd	%xmm2,-104(%rdi)
+	vmovd	%xmm3,-100(%rdi)
+	vmovd	%xmm4,-96(%rdi)
+	leaq	8(%r11),%rsp
+.cfi_def_cfa	%rsp,8
+	vzeroupper
+	ret
+.cfi_endproc	
+.size	poly1305_blocks_avx2,.-poly1305_blocks_avx2
+.type	poly1305_blocks_avx512,@function
+.align	32
+poly1305_blocks_avx512:
+.cfi_startproc	
+	movl	20(%rdi),%r8d
+	cmpq	$128,%rdx
+	jae	.Lblocks_avx2_512
+	testl	%r8d,%r8d
+	jz	.Lblocks
+
+.Lblocks_avx2_512:
+	andq	$-16,%rdx
+	jz	.Lno_data_avx2_512
+
+	vzeroupper
+
+	testl	%r8d,%r8d
+	jz	.Lbase2_64_avx2_512
+
+	testq	$63,%rdx
+	jz	.Leven_avx2_512
+
+	pushq	%rbx
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r15,-56
+.Lblocks_avx2_body_512:
+
+	movq	%rdx,%r15
+
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movl	16(%rdi),%ebp
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+
+	movl	%r8d,%r14d
+	andq	$-2147483648,%r8
+	movq	%r9,%r12
+	movl	%r9d,%ebx
+	andq	$-2147483648,%r9
+
+	shrq	$6,%r8
+	shlq	$52,%r12
+	addq	%r8,%r14
+	shrq	$12,%rbx
+	shrq	$18,%r9
+	addq	%r12,%r14
+	adcq	%r9,%rbx
+
+	movq	%rbp,%r8
+	shlq	$40,%r8
+	shrq	$24,%rbp
+	addq	%r8,%rbx
+	adcq	$0,%rbp
+
+	movq	$-4,%r9
+	movq	%rbp,%r8
+	andq	%rbp,%r9
+	shrq	$2,%r8
+	andq	$3,%rbp
+	addq	%r9,%r8
+	addq	%r8,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+.Lbase2_26_pre_avx2_512:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	subq	$16,%r15
+
+	call	__poly1305_block
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	.Lbase2_26_pre_avx2_512
+
+	testq	%rcx,%rcx
+	jz	.Lstore_base2_64_avx2_512
+
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r11
+	movq	%rbx,%r12
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r11
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r11,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r12
+	andq	$0x3ffffff,%rbx
+	orq	%r12,%rbp
+
+	testq	%r15,%r15
+	jz	.Lstore_base2_26_avx2_512
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	jmp	.Lproceed_avx2_512
+
+.align	32
+.Lstore_base2_64_avx2_512:
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%rbp,16(%rdi)
+	jmp	.Ldone_avx2_512
+
+.align	16
+.Lstore_base2_26_avx2_512:
+	movl	%eax,0(%rdi)
+	movl	%edx,4(%rdi)
+	movl	%r14d,8(%rdi)
+	movl	%ebx,12(%rdi)
+	movl	%ebp,16(%rdi)
+.align	16
+.Ldone_avx2_512:
+	movq	0(%rsp),%r15
+.cfi_restore	%r15
+	movq	8(%rsp),%r14
+.cfi_restore	%r14
+	movq	16(%rsp),%r13
+.cfi_restore	%r13
+	movq	24(%rsp),%r12
+.cfi_restore	%r12
+	movq	32(%rsp),%rbp
+.cfi_restore	%rbp
+	movq	40(%rsp),%rbx
+.cfi_restore	%rbx
+	leaq	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lno_data_avx2_512:
+.Lblocks_avx2_epilogue_512:
+	ret
+.cfi_endproc	
+
+.align	32
+.Lbase2_64_avx2_512:
+.cfi_startproc	
+	pushq	%rbx
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_adjust_cfa_offset	8
+.cfi_offset	%r15,-56
+.Lbase2_64_avx2_body_512:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movl	16(%rdi),%ebp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	testq	$63,%rdx
+	jz	.Linit_avx2_512
+
+.Lbase2_64_pre_avx2_512:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	subq	$16,%r15
+
+	call	__poly1305_block
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	.Lbase2_64_pre_avx2_512
+
+.Linit_avx2_512:
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r8
+	movq	%rbx,%r9
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r8
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r8,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r9
+	andq	$0x3ffffff,%rbx
+	orq	%r9,%rbp
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	movl	$1,20(%rdi)
+
+	call	__poly1305_init_avx
+
+.Lproceed_avx2_512:
+	movq	%r15,%rdx
+
+
+
+	movq	0(%rsp),%r15
+.cfi_restore	%r15
+	movq	8(%rsp),%r14
+.cfi_restore	%r14
+	movq	16(%rsp),%r13
+.cfi_restore	%r13
+	movq	24(%rsp),%r12
+.cfi_restore	%r12
+	movq	32(%rsp),%rbp
+.cfi_restore	%rbp
+	movq	40(%rsp),%rbx
+.cfi_restore	%rbx
+	leaq	48(%rsp),%rax
+	leaq	48(%rsp),%rsp
+.cfi_adjust_cfa_offset	-48
+.Lbase2_64_avx2_epilogue_512:
+	jmp	.Ldo_avx2_512
+.cfi_endproc	
+
+.align	32
+.Leven_avx2_512:
+.cfi_startproc	
+
+	vmovd	0(%rdi),%xmm0
+	vmovd	4(%rdi),%xmm1
+	vmovd	8(%rdi),%xmm2
+	vmovd	12(%rdi),%xmm3
+	vmovd	16(%rdi),%xmm4
+
+.Ldo_avx2_512:
+	cmpq	$512,%rdx
+	jae	.Lblocks_avx512
+.Lskip_avx512:
+	leaq	-8(%rsp),%r11
+.cfi_def_cfa	%r11,16
+	subq	$0x128,%rsp
+	leaq	.Lconst(%rip),%rcx
+	leaq	48+64(%rdi),%rdi
+	vmovdqa	96(%rcx),%ymm7
+
+
+	vmovdqu	-64(%rdi),%xmm9
+	andq	$-512,%rsp
+	vmovdqu	-48(%rdi),%xmm10
+	vmovdqu	-32(%rdi),%xmm6
+	vmovdqu	-16(%rdi),%xmm11
+	vmovdqu	0(%rdi),%xmm12
+	vmovdqu	16(%rdi),%xmm13
+	leaq	144(%rsp),%rax
+	vmovdqu	32(%rdi),%xmm14
+	vpermd	%ymm9,%ymm7,%ymm9
+	vmovdqu	48(%rdi),%xmm15
+	vpermd	%ymm10,%ymm7,%ymm10
+	vmovdqu	64(%rdi),%xmm5
+	vpermd	%ymm6,%ymm7,%ymm6
+	vmovdqa	%ymm9,0(%rsp)
+	vpermd	%ymm11,%ymm7,%ymm11
+	vmovdqa	%ymm10,32-144(%rax)
+	vpermd	%ymm12,%ymm7,%ymm12
+	vmovdqa	%ymm6,64-144(%rax)
+	vpermd	%ymm13,%ymm7,%ymm13
+	vmovdqa	%ymm11,96-144(%rax)
+	vpermd	%ymm14,%ymm7,%ymm14
+	vmovdqa	%ymm12,128-144(%rax)
+	vpermd	%ymm15,%ymm7,%ymm15
+	vmovdqa	%ymm13,160-144(%rax)
+	vpermd	%ymm5,%ymm7,%ymm5
+	vmovdqa	%ymm14,192-144(%rax)
+	vmovdqa	%ymm15,224-144(%rax)
+	vmovdqa	%ymm5,256-144(%rax)
+	vmovdqa	64(%rcx),%ymm5
+
+
+
+	vmovdqu	0(%rsi),%xmm7
+	vmovdqu	16(%rsi),%xmm8
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpsrldq	$6,%ymm7,%ymm9
+	vpsrldq	$6,%ymm8,%ymm10
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+	vpunpcklqdq	%ymm10,%ymm9,%ymm9
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+
+	vpsrlq	$30,%ymm9,%ymm10
+	vpsrlq	$4,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+	vpsrlq	$40,%ymm6,%ymm6
+	vpand	%ymm5,%ymm9,%ymm9
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	vpaddq	%ymm2,%ymm9,%ymm2
+	subq	$64,%rdx
+	jz	.Ltail_avx2_512
+	jmp	.Loop_avx2_512
+
+.align	32
+.Loop_avx2_512:
+
+
+
+
+
+
+
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqa	0(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqa	32(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqa	96(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqa	48(%rax),%ymm10
+	vmovdqa	112(%rax),%ymm5
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	64(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+	vmovdqa	-16(%rax),%ymm8
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vmovdqu	0(%rsi),%xmm7
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vmovdqu	16(%rsi),%xmm8
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqa	16(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpsrldq	$6,%ymm7,%ymm9
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpsrldq	$6,%ymm8,%ymm10
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpunpcklqdq	%ymm10,%ymm9,%ymm10
+	vpmuludq	80(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+
+
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$4,%ymm10,%ymm9
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpand	%ymm5,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpsrlq	$30,%ymm10,%ymm10
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$40,%ymm6,%ymm6
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	subq	$64,%rdx
+	jnz	.Loop_avx2_512
+
+.byte	0x66,0x90
+.Ltail_avx2_512:
+
+
+
+
+
+
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqu	4(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqu	36(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqu	100(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqu	52(%rax),%ymm10
+	vmovdqu	116(%rax),%ymm5
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	68(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vmovdqu	-12(%rax),%ymm8
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqu	20(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpmuludq	84(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+
+
+
+	vpsrldq	$8,%ymm12,%ymm8
+	vpsrldq	$8,%ymm2,%ymm9
+	vpsrldq	$8,%ymm3,%ymm10
+	vpsrldq	$8,%ymm4,%ymm6
+	vpsrldq	$8,%ymm0,%ymm7
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+
+	vpermq	$0x2,%ymm3,%ymm10
+	vpermq	$0x2,%ymm4,%ymm6
+	vpermq	$0x2,%ymm0,%ymm7
+	vpermq	$0x2,%ymm12,%ymm8
+	vpermq	$0x2,%ymm2,%ymm9
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+
+
+
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vmovd	%xmm0,-112(%rdi)
+	vmovd	%xmm1,-108(%rdi)
+	vmovd	%xmm2,-104(%rdi)
+	vmovd	%xmm3,-100(%rdi)
+	vmovd	%xmm4,-96(%rdi)
+	leaq	8(%r11),%rsp
+.cfi_def_cfa	%rsp,8
+	vzeroupper
+	ret
+.cfi_endproc	
+.size	poly1305_blocks_avx2,.-poly1305_blocks_avx2
+.cfi_startproc	
+.Lblocks_avx512:
+	movl	$15,%eax
+	kmovw	%eax,%k2
+	leaq	-8(%rsp),%r11
+.cfi_def_cfa	%r11,16
+	subq	$0x128,%rsp
+	leaq	.Lconst(%rip),%rcx
+	leaq	48+64(%rdi),%rdi
+	vmovdqa	96(%rcx),%ymm9
+
+
+	vmovdqu	-64(%rdi),%xmm11
+	andq	$-512,%rsp
+	vmovdqu	-48(%rdi),%xmm12
+	movq	$0x20,%rax
+	vmovdqu	-32(%rdi),%xmm7
+	vmovdqu	-16(%rdi),%xmm13
+	vmovdqu	0(%rdi),%xmm8
+	vmovdqu	16(%rdi),%xmm14
+	vmovdqu	32(%rdi),%xmm10
+	vmovdqu	48(%rdi),%xmm15
+	vmovdqu	64(%rdi),%xmm6
+	vpermd	%zmm11,%zmm9,%zmm16
+	vpbroadcastq	64(%rcx),%zmm5
+	vpermd	%zmm12,%zmm9,%zmm17
+	vpermd	%zmm7,%zmm9,%zmm21
+	vpermd	%zmm13,%zmm9,%zmm18
+	vmovdqa64	%zmm16,0(%rsp){%k2}
+	vpsrlq	$32,%zmm16,%zmm7
+	vpermd	%zmm8,%zmm9,%zmm22
+	vmovdqu64	%zmm17,0(%rsp,%rax,1){%k2}
+	vpsrlq	$32,%zmm17,%zmm8
+	vpermd	%zmm14,%zmm9,%zmm19
+	vmovdqa64	%zmm21,64(%rsp){%k2}
+	vpermd	%zmm10,%zmm9,%zmm23
+	vpermd	%zmm15,%zmm9,%zmm20
+	vmovdqu64	%zmm18,64(%rsp,%rax,1){%k2}
+	vpermd	%zmm6,%zmm9,%zmm24
+	vmovdqa64	%zmm22,128(%rsp){%k2}
+	vmovdqu64	%zmm19,128(%rsp,%rax,1){%k2}
+	vmovdqa64	%zmm23,192(%rsp){%k2}
+	vmovdqu64	%zmm20,192(%rsp,%rax,1){%k2}
+	vmovdqa64	%zmm24,256(%rsp){%k2}
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%zmm7,%zmm16,%zmm11
+	vpmuludq	%zmm7,%zmm17,%zmm12
+	vpmuludq	%zmm7,%zmm18,%zmm13
+	vpmuludq	%zmm7,%zmm19,%zmm14
+	vpmuludq	%zmm7,%zmm20,%zmm15
+	vpsrlq	$32,%zmm18,%zmm9
+
+	vpmuludq	%zmm8,%zmm24,%zmm25
+	vpmuludq	%zmm8,%zmm16,%zmm26
+	vpmuludq	%zmm8,%zmm17,%zmm27
+	vpmuludq	%zmm8,%zmm18,%zmm28
+	vpmuludq	%zmm8,%zmm19,%zmm29
+	vpsrlq	$32,%zmm19,%zmm10
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+
+	vpmuludq	%zmm9,%zmm23,%zmm25
+	vpmuludq	%zmm9,%zmm24,%zmm26
+	vpmuludq	%zmm9,%zmm17,%zmm28
+	vpmuludq	%zmm9,%zmm18,%zmm29
+	vpmuludq	%zmm9,%zmm16,%zmm27
+	vpsrlq	$32,%zmm20,%zmm6
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm10,%zmm22,%zmm25
+	vpmuludq	%zmm10,%zmm16,%zmm28
+	vpmuludq	%zmm10,%zmm17,%zmm29
+	vpmuludq	%zmm10,%zmm23,%zmm26
+	vpmuludq	%zmm10,%zmm24,%zmm27
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm6,%zmm24,%zmm28
+	vpmuludq	%zmm6,%zmm16,%zmm29
+	vpmuludq	%zmm6,%zmm21,%zmm25
+	vpmuludq	%zmm6,%zmm22,%zmm26
+	vpmuludq	%zmm6,%zmm23,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+
+
+	vmovdqu64	0(%rsi),%zmm10
+	vmovdqu64	64(%rsi),%zmm6
+	leaq	128(%rsi),%rsi
+
+
+
+
+	vpsrlq	$26,%zmm14,%zmm28
+	vpandq	%zmm5,%zmm14,%zmm14
+	vpaddq	%zmm28,%zmm15,%zmm15
+
+	vpsrlq	$26,%zmm11,%zmm25
+	vpandq	%zmm5,%zmm11,%zmm11
+	vpaddq	%zmm25,%zmm12,%zmm12
+
+	vpsrlq	$26,%zmm15,%zmm29
+	vpandq	%zmm5,%zmm15,%zmm15
+
+	vpsrlq	$26,%zmm12,%zmm26
+	vpandq	%zmm5,%zmm12,%zmm12
+	vpaddq	%zmm26,%zmm13,%zmm13
+
+	vpaddq	%zmm29,%zmm11,%zmm11
+	vpsllq	$2,%zmm29,%zmm29
+	vpaddq	%zmm29,%zmm11,%zmm11
+
+	vpsrlq	$26,%zmm13,%zmm27
+	vpandq	%zmm5,%zmm13,%zmm13
+	vpaddq	%zmm27,%zmm14,%zmm14
+
+	vpsrlq	$26,%zmm11,%zmm25
+	vpandq	%zmm5,%zmm11,%zmm11
+	vpaddq	%zmm25,%zmm12,%zmm12
+
+	vpsrlq	$26,%zmm14,%zmm28
+	vpandq	%zmm5,%zmm14,%zmm14
+	vpaddq	%zmm28,%zmm15,%zmm15
+
+
+
+
+
+	vpunpcklqdq	%zmm6,%zmm10,%zmm7
+	vpunpckhqdq	%zmm6,%zmm10,%zmm6
+
+
+
+
+
+
+	vmovdqa32	128(%rcx),%zmm25
+	movl	$0x7777,%eax
+	kmovw	%eax,%k1
+
+	vpermd	%zmm16,%zmm25,%zmm16
+	vpermd	%zmm17,%zmm25,%zmm17
+	vpermd	%zmm18,%zmm25,%zmm18
+	vpermd	%zmm19,%zmm25,%zmm19
+	vpermd	%zmm20,%zmm25,%zmm20
+
+	vpermd	%zmm11,%zmm25,%zmm16{%k1}
+	vpermd	%zmm12,%zmm25,%zmm17{%k1}
+	vpermd	%zmm13,%zmm25,%zmm18{%k1}
+	vpermd	%zmm14,%zmm25,%zmm19{%k1}
+	vpermd	%zmm15,%zmm25,%zmm20{%k1}
+
+	vpslld	$2,%zmm17,%zmm21
+	vpslld	$2,%zmm18,%zmm22
+	vpslld	$2,%zmm19,%zmm23
+	vpslld	$2,%zmm20,%zmm24
+	vpaddd	%zmm17,%zmm21,%zmm21
+	vpaddd	%zmm18,%zmm22,%zmm22
+	vpaddd	%zmm19,%zmm23,%zmm23
+	vpaddd	%zmm20,%zmm24,%zmm24
+
+	vpbroadcastq	32(%rcx),%zmm30
+
+	vpsrlq	$52,%zmm7,%zmm9
+	vpsllq	$12,%zmm6,%zmm10
+	vporq	%zmm10,%zmm9,%zmm9
+	vpsrlq	$26,%zmm7,%zmm8
+	vpsrlq	$14,%zmm6,%zmm10
+	vpsrlq	$40,%zmm6,%zmm6
+	vpandq	%zmm5,%zmm9,%zmm9
+	vpandq	%zmm5,%zmm7,%zmm7
+
+
+
+
+	vpaddq	%zmm2,%zmm9,%zmm2
+	subq	$192,%rdx
+	jbe	.Ltail_avx512
+	jmp	.Loop_avx512
+
+.align	32
+.Loop_avx512:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%zmm2,%zmm17,%zmm14
+	vpaddq	%zmm0,%zmm7,%zmm0
+	vpmuludq	%zmm2,%zmm18,%zmm15
+	vpandq	%zmm5,%zmm8,%zmm8
+	vpmuludq	%zmm2,%zmm23,%zmm11
+	vpandq	%zmm5,%zmm10,%zmm10
+	vpmuludq	%zmm2,%zmm24,%zmm12
+	vporq	%zmm30,%zmm6,%zmm6
+	vpmuludq	%zmm2,%zmm16,%zmm13
+	vpaddq	%zmm1,%zmm8,%zmm1
+	vpaddq	%zmm3,%zmm10,%zmm3
+	vpaddq	%zmm4,%zmm6,%zmm4
+
+	vmovdqu64	0(%rsi),%zmm10
+	vmovdqu64	64(%rsi),%zmm6
+	leaq	128(%rsi),%rsi
+	vpmuludq	%zmm0,%zmm19,%zmm28
+	vpmuludq	%zmm0,%zmm20,%zmm29
+	vpmuludq	%zmm0,%zmm16,%zmm25
+	vpmuludq	%zmm0,%zmm17,%zmm26
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+
+	vpmuludq	%zmm1,%zmm18,%zmm28
+	vpmuludq	%zmm1,%zmm19,%zmm29
+	vpmuludq	%zmm1,%zmm24,%zmm25
+	vpmuludq	%zmm0,%zmm18,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpunpcklqdq	%zmm6,%zmm10,%zmm7
+	vpunpckhqdq	%zmm6,%zmm10,%zmm6
+
+	vpmuludq	%zmm3,%zmm16,%zmm28
+	vpmuludq	%zmm3,%zmm17,%zmm29
+	vpmuludq	%zmm1,%zmm16,%zmm26
+	vpmuludq	%zmm1,%zmm17,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm4,%zmm24,%zmm28
+	vpmuludq	%zmm4,%zmm16,%zmm29
+	vpmuludq	%zmm3,%zmm22,%zmm25
+	vpmuludq	%zmm3,%zmm23,%zmm26
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpmuludq	%zmm3,%zmm24,%zmm27
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm4,%zmm21,%zmm25
+	vpmuludq	%zmm4,%zmm22,%zmm26
+	vpmuludq	%zmm4,%zmm23,%zmm27
+	vpaddq	%zmm25,%zmm11,%zmm0
+	vpaddq	%zmm26,%zmm12,%zmm1
+	vpaddq	%zmm27,%zmm13,%zmm2
+
+
+
+
+	vpsrlq	$52,%zmm7,%zmm9
+	vpsllq	$12,%zmm6,%zmm10
+
+	vpsrlq	$26,%zmm14,%zmm3
+	vpandq	%zmm5,%zmm14,%zmm14
+	vpaddq	%zmm3,%zmm15,%zmm4
+
+	vporq	%zmm10,%zmm9,%zmm9
+
+	vpsrlq	$26,%zmm0,%zmm11
+	vpandq	%zmm5,%zmm0,%zmm0
+	vpaddq	%zmm11,%zmm1,%zmm1
+
+	vpandq	%zmm5,%zmm9,%zmm9
+
+	vpsrlq	$26,%zmm4,%zmm15
+	vpandq	%zmm5,%zmm4,%zmm4
+
+	vpsrlq	$26,%zmm1,%zmm12
+	vpandq	%zmm5,%zmm1,%zmm1
+	vpaddq	%zmm12,%zmm2,%zmm2
+
+	vpaddq	%zmm15,%zmm0,%zmm0
+	vpsllq	$2,%zmm15,%zmm15
+	vpaddq	%zmm15,%zmm0,%zmm0
+
+	vpaddq	%zmm9,%zmm2,%zmm2
+	vpsrlq	$26,%zmm7,%zmm8
+
+	vpsrlq	$26,%zmm2,%zmm13
+	vpandq	%zmm5,%zmm2,%zmm2
+	vpaddq	%zmm13,%zmm14,%zmm3
+
+	vpsrlq	$14,%zmm6,%zmm10
+
+	vpsrlq	$26,%zmm0,%zmm11
+	vpandq	%zmm5,%zmm0,%zmm0
+	vpaddq	%zmm11,%zmm1,%zmm1
+
+	vpsrlq	$40,%zmm6,%zmm6
+
+	vpsrlq	$26,%zmm3,%zmm14
+	vpandq	%zmm5,%zmm3,%zmm3
+	vpaddq	%zmm14,%zmm4,%zmm4
+
+	vpandq	%zmm5,%zmm7,%zmm7
+
+
+
+
+	subq	$128,%rdx
+	ja	.Loop_avx512
+
+.Ltail_avx512:
+
+
+
+
+
+	vpsrlq	$32,%zmm16,%zmm16
+	vpsrlq	$32,%zmm17,%zmm17
+	vpsrlq	$32,%zmm18,%zmm18
+	vpsrlq	$32,%zmm23,%zmm23
+	vpsrlq	$32,%zmm24,%zmm24
+	vpsrlq	$32,%zmm19,%zmm19
+	vpsrlq	$32,%zmm20,%zmm20
+	vpsrlq	$32,%zmm21,%zmm21
+	vpsrlq	$32,%zmm22,%zmm22
+
+
+
+	leaq	(%rsi,%rdx,1),%rsi
+
+
+	vpaddq	%zmm0,%zmm7,%zmm0
+
+	vpmuludq	%zmm2,%zmm17,%zmm14
+	vpmuludq	%zmm2,%zmm18,%zmm15
+	vpmuludq	%zmm2,%zmm23,%zmm11
+	vpandq	%zmm5,%zmm8,%zmm8
+	vpmuludq	%zmm2,%zmm24,%zmm12
+	vpandq	%zmm5,%zmm10,%zmm10
+	vpmuludq	%zmm2,%zmm16,%zmm13
+	vporq	%zmm30,%zmm6,%zmm6
+	vpaddq	%zmm1,%zmm8,%zmm1
+	vpaddq	%zmm3,%zmm10,%zmm3
+	vpaddq	%zmm4,%zmm6,%zmm4
+
+	vmovdqu	0(%rsi),%xmm7
+	vpmuludq	%zmm0,%zmm19,%zmm28
+	vpmuludq	%zmm0,%zmm20,%zmm29
+	vpmuludq	%zmm0,%zmm16,%zmm25
+	vpmuludq	%zmm0,%zmm17,%zmm26
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+
+	vmovdqu	16(%rsi),%xmm8
+	vpmuludq	%zmm1,%zmm18,%zmm28
+	vpmuludq	%zmm1,%zmm19,%zmm29
+	vpmuludq	%zmm1,%zmm24,%zmm25
+	vpmuludq	%zmm0,%zmm18,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+	vpmuludq	%zmm3,%zmm16,%zmm28
+	vpmuludq	%zmm3,%zmm17,%zmm29
+	vpmuludq	%zmm1,%zmm16,%zmm26
+	vpmuludq	%zmm1,%zmm17,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm14
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	vpmuludq	%zmm4,%zmm24,%zmm28
+	vpmuludq	%zmm4,%zmm16,%zmm29
+	vpmuludq	%zmm3,%zmm22,%zmm25
+	vpmuludq	%zmm3,%zmm23,%zmm26
+	vpmuludq	%zmm3,%zmm24,%zmm27
+	vpaddq	%zmm28,%zmm14,%zmm3
+	vpaddq	%zmm29,%zmm15,%zmm15
+	vpaddq	%zmm25,%zmm11,%zmm11
+	vpaddq	%zmm26,%zmm12,%zmm12
+	vpaddq	%zmm27,%zmm13,%zmm13
+
+	vpmuludq	%zmm4,%zmm21,%zmm25
+	vpmuludq	%zmm4,%zmm22,%zmm26
+	vpmuludq	%zmm4,%zmm23,%zmm27
+	vpaddq	%zmm25,%zmm11,%zmm0
+	vpaddq	%zmm26,%zmm12,%zmm1
+	vpaddq	%zmm27,%zmm13,%zmm2
+
+
+
+
+	movl	$1,%eax
+	vpermq	$0xb1,%zmm3,%zmm14
+	vpermq	$0xb1,%zmm15,%zmm4
+	vpermq	$0xb1,%zmm0,%zmm11
+	vpermq	$0xb1,%zmm1,%zmm12
+	vpermq	$0xb1,%zmm2,%zmm13
+	vpaddq	%zmm14,%zmm3,%zmm3
+	vpaddq	%zmm15,%zmm4,%zmm4
+	vpaddq	%zmm11,%zmm0,%zmm0
+	vpaddq	%zmm12,%zmm1,%zmm1
+	vpaddq	%zmm13,%zmm2,%zmm2
+
+	kmovw	%eax,%k3
+	vpermq	$0x2,%zmm3,%zmm14
+	vpermq	$0x2,%zmm4,%zmm15
+	vpermq	$0x2,%zmm0,%zmm11
+	vpermq	$0x2,%zmm1,%zmm12
+	vpermq	$0x2,%zmm2,%zmm13
+	vpaddq	%zmm14,%zmm3,%zmm3
+	vpaddq	%zmm15,%zmm4,%zmm4
+	vpaddq	%zmm11,%zmm0,%zmm0
+	vpaddq	%zmm12,%zmm1,%zmm1
+	vpaddq	%zmm13,%zmm2,%zmm2
+
+	vextracti64x4	$0x1,%zmm3,%ymm14
+	vextracti64x4	$0x1,%zmm4,%ymm15
+	vextracti64x4	$0x1,%zmm0,%ymm11
+	vextracti64x4	$0x1,%zmm1,%ymm12
+	vextracti64x4	$0x1,%zmm2,%ymm13
+	vpaddq	%zmm14,%zmm3,%zmm3{%k3}{z}
+	vpaddq	%zmm15,%zmm4,%zmm4{%k3}{z}
+	vpaddq	%zmm11,%zmm0,%zmm0{%k3}{z}
+	vpaddq	%zmm12,%zmm1,%zmm1{%k3}{z}
+	vpaddq	%zmm13,%zmm2,%zmm2{%k3}{z}
+
+
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpsrldq	$6,%ymm7,%ymm9
+	vpsrldq	$6,%ymm8,%ymm10
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpunpcklqdq	%ymm10,%ymm9,%ymm9
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpsrlq	$30,%ymm9,%ymm10
+	vpsrlq	$4,%ymm9,%ymm9
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpsrlq	$26,%ymm7,%ymm8
+	vpsrlq	$40,%ymm6,%ymm6
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpand	%ymm5,%ymm9,%ymm9
+	vpand	%ymm5,%ymm7,%ymm7
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm2,%ymm9,%ymm2
+	vpand	%ymm5,%ymm8,%ymm8
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	leaq	144(%rsp),%rax
+	addq	$64,%rdx
+	jnz	.Ltail_avx2_512
+
+	vpsubq	%ymm9,%ymm2,%ymm2
+	vmovd	%xmm0,-112(%rdi)
+	vmovd	%xmm1,-108(%rdi)
+	vmovd	%xmm2,-104(%rdi)
+	vmovd	%xmm3,-100(%rdi)
+	vmovd	%xmm4,-96(%rdi)
+	vzeroall
+	leaq	8(%r11),%rsp
+.cfi_def_cfa	%rsp,8
+	ret
+.cfi_endproc	
+.size	poly1305_blocks_avx512,.-poly1305_blocks_avx512
diff --git a/crypto/poly1305_x64_gas_macosx.s b/crypto/poly1305_x64_gas_macosx.s
new file mode 100644
index 0000000..473b9f0
--- /dev/null
+++ b/crypto/poly1305_x64_gas_macosx.s
@@ -0,0 +1,1916 @@
+.p2align	6
+L$const:
+L$mask24:
+.long	0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0
+L$129:
+.long	16777216,0,16777216,0,16777216,0,16777216,0
+L$mask26:
+.long	0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0
+L$permd_avx2:
+.long	2,2,2,3,2,0,2,1
+L$permd_avx512:
+.long	0,0,0,1, 0,2,0,3, 0,4,0,5, 0,6,0,7
+
+L$2_44_inp_permd:
+.long	0,1,1,2,2,3,7,7
+L$2_44_inp_shift:
+.quad	0,12,24,64
+L$2_44_mask:
+.quad	0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff
+L$2_44_shift_rgt:
+.quad	44,44,42,64
+L$2_44_shift_lft:
+.quad	8,8,10,64
+
+.p2align	6
+L$x_mask44:
+.quad	0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff
+.quad	0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff
+L$x_mask42:
+.quad	0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff
+.quad	0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff
+
+.text	
+
+
+.global	_poly1305_init_x86_64
+.global	_poly1305_blocks_x86_64
+.global	_poly1305_emit_x86_64
+.global	_poly1305_emit_avx
+.global	_poly1305_blocks_avx
+.global	_poly1305_blocks_avx2
+.global	_poly1305_blocks_avx512
+
+
+
+.p2align	5
+_poly1305_init_x86_64:
+	xorq	%rax,%rax
+	movq	%rax,0(%rdi)
+	movq	%rax,8(%rdi)
+	movq	%rax,16(%rdi)
+
+	cmpq	$0,%rsi
+	je	L$no_key
+
+
+
+	movq	$0x0ffffffc0fffffff,%rax
+	movq	$0x0ffffffc0ffffffc,%rcx
+	andq	0(%rsi),%rax
+	andq	8(%rsi),%rcx
+	movq	%rax,24(%rdi)
+	movq	%rcx,32(%rdi)
+	movl	$1,%eax
+L$no_key:
+	ret
+
+
+
+.p2align	5
+_poly1305_blocks_x86_64:
+
+L$blocks:
+	shrq	$4,%rdx
+	jz	L$no_data
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+L$blocks_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movq	16(%rdi),%rbp
+
+	movq	%r13,%r12
+	shrq	$2,%r13
+	movq	%r12,%rax
+	addq	%r12,%r13
+	jmp	L$oop
+
+.p2align	5
+L$oop:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	mulq	%r14
+	movq	%rax,%r9
+	movq	%r11,%rax
+	movq	%rdx,%r10
+
+	mulq	%r14
+	movq	%rax,%r14
+	movq	%r11,%rax
+	movq	%rdx,%r8
+
+	mulq	%rbx
+	addq	%rax,%r9
+	movq	%r13,%rax
+	adcq	%rdx,%r10
+
+	mulq	%rbx
+	movq	%rbp,%rbx
+	addq	%rax,%r14
+	adcq	%rdx,%r8
+
+	imulq	%r13,%rbx
+	addq	%rbx,%r9
+	movq	%r8,%rbx
+	adcq	$0,%r10
+
+	imulq	%r11,%rbp
+	addq	%r9,%rbx
+	movq	$-4,%rax
+	adcq	%rbp,%r10
+
+	andq	%r10,%rax
+	movq	%r10,%rbp
+	shrq	$2,%r10
+	andq	$3,%rbp
+	addq	%r10,%rax
+	addq	%rax,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+	movq	%r12,%rax
+	decq	%r15
+	jnz	L$oop
+
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%rbp,16(%rdi)
+
+	movq	0(%rsp),%r15
+
+	movq	8(%rsp),%r14
+
+	movq	16(%rsp),%r13
+
+	movq	24(%rsp),%r12
+
+	movq	32(%rsp),%rbp
+
+	movq	40(%rsp),%rbx
+
+	leaq	48(%rsp),%rsp
+
+L$no_data:
+L$blocks_epilogue:
+	ret
+
+
+
+
+.p2align	5
+_poly1305_emit_x86_64:
+L$emit:
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movq	16(%rdi),%r10
+
+	movq	%r8,%rax
+	addq	$5,%r8
+	movq	%r9,%rcx
+	adcq	$0,%r9
+	adcq	$0,%r10
+	shrq	$2,%r10
+	cmovnzq	%r8,%rax
+	cmovnzq	%r9,%rcx
+
+	addq	0(%rdx),%rax
+	adcq	8(%rdx),%rcx
+	movq	%rax,0(%rsi)
+	movq	%rcx,8(%rsi)
+
+	ret
+
+
+.p2align	5
+__poly1305_block:
+	mulq	%r14
+	movq	%rax,%r9
+	movq	%r11,%rax
+	movq	%rdx,%r10
+
+	mulq	%r14
+	movq	%rax,%r14
+	movq	%r11,%rax
+	movq	%rdx,%r8
+
+	mulq	%rbx
+	addq	%rax,%r9
+	movq	%r13,%rax
+	adcq	%rdx,%r10
+
+	mulq	%rbx
+	movq	%rbp,%rbx
+	addq	%rax,%r14
+	adcq	%rdx,%r8
+
+	imulq	%r13,%rbx
+	addq	%rbx,%r9
+	movq	%r8,%rbx
+	adcq	$0,%r10
+
+	imulq	%r11,%rbp
+	addq	%r9,%rbx
+	movq	$-4,%rax
+	adcq	%rbp,%r10
+
+	andq	%r10,%rax
+	movq	%r10,%rbp
+	shrq	$2,%r10
+	andq	$3,%rbp
+	addq	%r10,%rax
+	addq	%rax,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+	ret
+
+
+
+.p2align	5
+__poly1305_init_avx:
+	movq	%r11,%r14
+	movq	%r12,%rbx
+	xorq	%rbp,%rbp
+
+	leaq	48+64(%rdi),%rdi
+
+	movq	%r12,%rax
+	call	__poly1305_block
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	movq	%r11,%r9
+	andl	%r11d,%edx
+	movl	%eax,-64(%rdi)
+	shrq	$26,%r8
+	movl	%edx,-60(%rdi)
+	shrq	$26,%r9
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%eax
+	andl	%r9d,%edx
+	movl	%eax,-48(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,-44(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,-32(%rdi)
+	shrq	$26,%r8
+	movl	%edx,-28(%rdi)
+	shrq	$26,%r9
+
+	movq	%rbx,%rax
+	movq	%r12,%rdx
+	shlq	$12,%rax
+	shlq	$12,%rdx
+	orq	%r8,%rax
+	orq	%r9,%rdx
+	andl	$0x3ffffff,%eax
+	andl	$0x3ffffff,%edx
+	movl	%eax,-16(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,-12(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,0(%rdi)
+	movq	%rbx,%r8
+	movl	%edx,4(%rdi)
+	movq	%r12,%r9
+
+	movl	$0x3ffffff,%eax
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	shrq	$14,%r9
+	andl	%r8d,%eax
+	andl	%r9d,%edx
+	movl	%eax,16(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movl	%edx,20(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	movl	%eax,32(%rdi)
+	shrq	$26,%r8
+	movl	%edx,36(%rdi)
+	shrq	$26,%r9
+
+	movq	%rbp,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,48(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r9d,52(%rdi)
+	leaq	(%r9,%r9,4),%r9
+	movl	%r8d,64(%rdi)
+	movl	%r9d,68(%rdi)
+
+	movq	%r12,%rax
+	call	__poly1305_block
+
+	movl	$0x3ffffff,%eax
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	shrq	$26,%r8
+	movl	%eax,-52(%rdi)
+
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%edx
+	movl	%edx,-36(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,-20(%rdi)
+
+	movq	%rbx,%rax
+	shlq	$12,%rax
+	orq	%r8,%rax
+	andl	$0x3ffffff,%eax
+	movl	%eax,-4(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movq	%rbx,%r8
+	movl	%eax,12(%rdi)
+
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	andl	%r8d,%edx
+	movl	%edx,28(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,44(%rdi)
+
+	movq	%rbp,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,60(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r8d,76(%rdi)
+
+	movq	%r12,%rax
+	call	__poly1305_block
+
+	movl	$0x3ffffff,%eax
+	movq	%r14,%r8
+	andl	%r14d,%eax
+	shrq	$26,%r8
+	movl	%eax,-56(%rdi)
+
+	movl	$0x3ffffff,%edx
+	andl	%r8d,%edx
+	movl	%edx,-40(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,-24(%rdi)
+
+	movq	%rbx,%rax
+	shlq	$12,%rax
+	orq	%r8,%rax
+	andl	$0x3ffffff,%eax
+	movl	%eax,-8(%rdi)
+	leal	(%rax,%rax,4),%eax
+	movq	%rbx,%r8
+	movl	%eax,8(%rdi)
+
+	movl	$0x3ffffff,%edx
+	shrq	$14,%r8
+	andl	%r8d,%edx
+	movl	%edx,24(%rdi)
+	leal	(%rdx,%rdx,4),%edx
+	shrq	$26,%r8
+	movl	%edx,40(%rdi)
+
+	movq	%rbp,%rax
+	shlq	$24,%rax
+	orq	%rax,%r8
+	movl	%r8d,56(%rdi)
+	leaq	(%r8,%r8,4),%r8
+	movl	%r8d,72(%rdi)
+
+	leaq	-48-64(%rdi),%rdi
+	ret
+
+
+
+.p2align	5
+_poly1305_blocks_avx:
+
+	movl	20(%rdi),%r8d
+	cmpq	$128,%rdx
+	jae	L$blocks_avx
+	testl	%r8d,%r8d
+	jz	L$blocks
+
+L$blocks_avx:
+	andq	$-16,%rdx
+	jz	L$no_data_avx
+
+	vzeroupper
+
+	testl	%r8d,%r8d
+	jz	L$base2_64_avx
+
+	testq	$31,%rdx
+	jz	L$even_avx
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+L$blocks_avx_body:
+
+	movq	%rdx,%r15
+
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movl	16(%rdi),%ebp
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+
+	movl	%r8d,%r14d
+	andq	$-2147483648,%r8
+	movq	%r9,%r12
+	movl	%r9d,%ebx
+	andq	$-2147483648,%r9
+
+	shrq	$6,%r8
+	shlq	$52,%r12
+	addq	%r8,%r14
+	shrq	$12,%rbx
+	shrq	$18,%r9
+	addq	%r12,%r14
+	adcq	%r9,%rbx
+
+	movq	%rbp,%r8
+	shlq	$40,%r8
+	shrq	$24,%rbp
+	addq	%r8,%rbx
+	adcq	$0,%rbp
+
+	movq	$-4,%r9
+	movq	%rbp,%r8
+	andq	%rbp,%r9
+	shrq	$2,%r8
+	andq	$3,%rbp
+	addq	%r9,%r8
+	addq	%r8,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+
+	call	__poly1305_block
+
+	testq	%rcx,%rcx
+	jz	L$store_base2_64_avx
+
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r11
+	movq	%rbx,%r12
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r11
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r11,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r12
+	andq	$0x3ffffff,%rbx
+	orq	%r12,%rbp
+
+	subq	$16,%r15
+	jz	L$store_base2_26_avx
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	jmp	L$proceed_avx
+
+.p2align	5
+L$store_base2_64_avx:
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%rbp,16(%rdi)
+	jmp	L$done_avx
+
+.p2align	4
+L$store_base2_26_avx:
+	movl	%eax,0(%rdi)
+	movl	%edx,4(%rdi)
+	movl	%r14d,8(%rdi)
+	movl	%ebx,12(%rdi)
+	movl	%ebp,16(%rdi)
+.p2align	4
+L$done_avx:
+	movq	0(%rsp),%r15
+
+	movq	8(%rsp),%r14
+
+	movq	16(%rsp),%r13
+
+	movq	24(%rsp),%r12
+
+	movq	32(%rsp),%rbp
+
+	movq	40(%rsp),%rbx
+
+	leaq	48(%rsp),%rsp
+
+L$no_data_avx:
+L$blocks_avx_epilogue:
+	ret
+
+
+.p2align	5
+L$base2_64_avx:
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+L$base2_64_avx_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movl	16(%rdi),%ebp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	testq	$31,%rdx
+	jz	L$init_avx
+
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	subq	$16,%r15
+
+	call	__poly1305_block
+
+L$init_avx:
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r8
+	movq	%rbx,%r9
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r8
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r8,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r9
+	andq	$0x3ffffff,%rbx
+	orq	%r9,%rbp
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	movl	$1,20(%rdi)
+
+	call	__poly1305_init_avx
+
+L$proceed_avx:
+	movq	%r15,%rdx
+
+	movq	0(%rsp),%r15
+
+	movq	8(%rsp),%r14
+
+	movq	16(%rsp),%r13
+
+	movq	24(%rsp),%r12
+
+	movq	32(%rsp),%rbp
+
+	movq	40(%rsp),%rbx
+
+	leaq	48(%rsp),%rax
+	leaq	48(%rsp),%rsp
+
+L$base2_64_avx_epilogue:
+	jmp	L$do_avx
+
+
+.p2align	5
+L$even_avx:
+
+	vmovd	0(%rdi),%xmm0
+	vmovd	4(%rdi),%xmm1
+	vmovd	8(%rdi),%xmm2
+	vmovd	12(%rdi),%xmm3
+	vmovd	16(%rdi),%xmm4
+
+L$do_avx:
+	leaq	-88(%rsp),%r11
+
+	subq	$0x178,%rsp
+	subq	$64,%rdx
+	leaq	-32(%rsi),%rax
+	cmovcq	%rax,%rsi
+
+	vmovdqu	48(%rdi),%xmm14
+	leaq	112(%rdi),%rdi
+	leaq	L$const(%rip),%rcx
+
+
+
+	vmovdqu	32(%rsi),%xmm5
+	vmovdqu	48(%rsi),%xmm6
+	vmovdqa	64(%rcx),%xmm15
+
+	vpsrldq	$6,%xmm5,%xmm7
+	vpsrldq	$6,%xmm6,%xmm8
+	vpunpckhqdq	%xmm6,%xmm5,%xmm9
+	vpunpcklqdq	%xmm6,%xmm5,%xmm5
+	vpunpcklqdq	%xmm8,%xmm7,%xmm8
+
+	vpsrlq	$40,%xmm9,%xmm9
+	vpsrlq	$26,%xmm5,%xmm6
+	vpand	%xmm15,%xmm5,%xmm5
+	vpsrlq	$4,%xmm8,%xmm7
+	vpand	%xmm15,%xmm6,%xmm6
+	vpsrlq	$30,%xmm8,%xmm8
+	vpand	%xmm15,%xmm7,%xmm7
+	vpand	%xmm15,%xmm8,%xmm8
+	vpor	32(%rcx),%xmm9,%xmm9
+
+	jbe	L$skip_loop_avx
+
+
+	vmovdqu	-48(%rdi),%xmm11
+	vmovdqu	-32(%rdi),%xmm12
+	vpshufd	$0xEE,%xmm14,%xmm13
+	vpshufd	$0x44,%xmm14,%xmm10
+	vmovdqa	%xmm13,-144(%r11)
+	vmovdqa	%xmm10,0(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm14
+	vmovdqu	-16(%rdi),%xmm10
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm14,-128(%r11)
+	vmovdqa	%xmm11,16(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm13
+	vmovdqu	0(%rdi),%xmm11
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm13,-112(%r11)
+	vmovdqa	%xmm12,32(%rsp)
+	vpshufd	$0xEE,%xmm10,%xmm14
+	vmovdqu	16(%rdi),%xmm12
+	vpshufd	$0x44,%xmm10,%xmm10
+	vmovdqa	%xmm14,-96(%r11)
+	vmovdqa	%xmm10,48(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm13
+	vmovdqu	32(%rdi),%xmm10
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm13,-80(%r11)
+	vmovdqa	%xmm11,64(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm14
+	vmovdqu	48(%rdi),%xmm11
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm14,-64(%r11)
+	vmovdqa	%xmm12,80(%rsp)
+	vpshufd	$0xEE,%xmm10,%xmm13
+	vmovdqu	64(%rdi),%xmm12
+	vpshufd	$0x44,%xmm10,%xmm10
+	vmovdqa	%xmm13,-48(%r11)
+	vmovdqa	%xmm10,96(%rsp)
+	vpshufd	$0xEE,%xmm11,%xmm14
+	vpshufd	$0x44,%xmm11,%xmm11
+	vmovdqa	%xmm14,-32(%r11)
+	vmovdqa	%xmm11,112(%rsp)
+	vpshufd	$0xEE,%xmm12,%xmm13
+	vmovdqa	0(%rsp),%xmm14
+	vpshufd	$0x44,%xmm12,%xmm12
+	vmovdqa	%xmm13,-16(%r11)
+	vmovdqa	%xmm12,128(%rsp)
+
+	jmp	L$oop_avx
+
+.p2align	5
+L$oop_avx:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%xmm5,%xmm14,%xmm10
+	vpmuludq	%xmm6,%xmm14,%xmm11
+	vmovdqa	%xmm2,32(%r11)
+	vpmuludq	%xmm7,%xmm14,%xmm12
+	vmovdqa	16(%rsp),%xmm2
+	vpmuludq	%xmm8,%xmm14,%xmm13
+	vpmuludq	%xmm9,%xmm14,%xmm14
+
+	vmovdqa	%xmm0,0(%r11)
+	vpmuludq	32(%rsp),%xmm9,%xmm0
+	vmovdqa	%xmm1,16(%r11)
+	vpmuludq	%xmm8,%xmm2,%xmm1
+	vpaddq	%xmm0,%xmm10,%xmm10
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vmovdqa	%xmm3,48(%r11)
+	vpmuludq	%xmm7,%xmm2,%xmm0
+	vpmuludq	%xmm6,%xmm2,%xmm1
+	vpaddq	%xmm0,%xmm13,%xmm13
+	vmovdqa	48(%rsp),%xmm3
+	vpaddq	%xmm1,%xmm12,%xmm12
+	vmovdqa	%xmm4,64(%r11)
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpmuludq	%xmm7,%xmm3,%xmm0
+	vpaddq	%xmm2,%xmm11,%xmm11
+
+	vmovdqa	64(%rsp),%xmm4
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpmuludq	%xmm6,%xmm3,%xmm1
+	vpmuludq	%xmm5,%xmm3,%xmm3
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vmovdqa	80(%rsp),%xmm2
+	vpaddq	%xmm3,%xmm12,%xmm12
+	vpmuludq	%xmm9,%xmm4,%xmm0
+	vpmuludq	%xmm8,%xmm4,%xmm4
+	vpaddq	%xmm0,%xmm11,%xmm11
+	vmovdqa	96(%rsp),%xmm3
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vmovdqa	128(%rsp),%xmm4
+	vpmuludq	%xmm6,%xmm2,%xmm1
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vpaddq	%xmm2,%xmm13,%xmm13
+	vpmuludq	%xmm9,%xmm3,%xmm0
+	vpmuludq	%xmm8,%xmm3,%xmm1
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vmovdqu	0(%rsi),%xmm0
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpmuludq	%xmm7,%xmm3,%xmm3
+	vpmuludq	%xmm7,%xmm4,%xmm7
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	vmovdqu	16(%rsi),%xmm1
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vpmuludq	%xmm8,%xmm4,%xmm8
+	vpmuludq	%xmm9,%xmm4,%xmm9
+	vpsrldq	$6,%xmm0,%xmm2
+	vpaddq	%xmm8,%xmm12,%xmm12
+	vpaddq	%xmm9,%xmm13,%xmm13
+	vpsrldq	$6,%xmm1,%xmm3
+	vpmuludq	112(%rsp),%xmm5,%xmm9
+	vpmuludq	%xmm6,%xmm4,%xmm5
+	vpunpckhqdq	%xmm1,%xmm0,%xmm4
+	vpaddq	%xmm9,%xmm14,%xmm14
+	vmovdqa	-144(%r11),%xmm9
+	vpaddq	%xmm5,%xmm10,%xmm10
+
+	vpunpcklqdq	%xmm1,%xmm0,%xmm0
+	vpunpcklqdq	%xmm3,%xmm2,%xmm3
+
+
+	vpsrldq	$5,%xmm4,%xmm4
+	vpsrlq	$26,%xmm0,%xmm1
+	vpand	%xmm15,%xmm0,%xmm0
+	vpsrlq	$4,%xmm3,%xmm2
+	vpand	%xmm15,%xmm1,%xmm1
+	vpand	0(%rcx),%xmm4,%xmm4
+	vpsrlq	$30,%xmm3,%xmm3
+	vpand	%xmm15,%xmm2,%xmm2
+	vpand	%xmm15,%xmm3,%xmm3
+	vpor	32(%rcx),%xmm4,%xmm4
+
+	vpaddq	0(%r11),%xmm0,%xmm0
+	vpaddq	16(%r11),%xmm1,%xmm1
+	vpaddq	32(%r11),%xmm2,%xmm2
+	vpaddq	48(%r11),%xmm3,%xmm3
+	vpaddq	64(%r11),%xmm4,%xmm4
+
+	leaq	32(%rsi),%rax
+	leaq	64(%rsi),%rsi
+	subq	$64,%rdx
+	cmovcq	%rax,%rsi
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%xmm0,%xmm9,%xmm5
+	vpmuludq	%xmm1,%xmm9,%xmm6
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vmovdqa	-128(%r11),%xmm7
+	vpmuludq	%xmm2,%xmm9,%xmm5
+	vpmuludq	%xmm3,%xmm9,%xmm6
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm4,%xmm9,%xmm9
+	vpmuludq	-112(%r11),%xmm4,%xmm5
+	vpaddq	%xmm9,%xmm14,%xmm14
+
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpmuludq	%xmm2,%xmm7,%xmm6
+	vpmuludq	%xmm3,%xmm7,%xmm5
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vmovdqa	-96(%r11),%xmm8
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpmuludq	%xmm1,%xmm7,%xmm6
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm6,%xmm12,%xmm12
+	vpaddq	%xmm7,%xmm11,%xmm11
+
+	vmovdqa	-80(%r11),%xmm9
+	vpmuludq	%xmm2,%xmm8,%xmm5
+	vpmuludq	%xmm1,%xmm8,%xmm6
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vmovdqa	-64(%r11),%xmm7
+	vpmuludq	%xmm0,%xmm8,%xmm8
+	vpmuludq	%xmm4,%xmm9,%xmm5
+	vpaddq	%xmm8,%xmm12,%xmm12
+	vpaddq	%xmm5,%xmm11,%xmm11
+	vmovdqa	-48(%r11),%xmm8
+	vpmuludq	%xmm3,%xmm9,%xmm9
+	vpmuludq	%xmm1,%xmm7,%xmm6
+	vpaddq	%xmm9,%xmm10,%xmm10
+
+	vmovdqa	-16(%r11),%xmm9
+	vpaddq	%xmm6,%xmm14,%xmm14
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpmuludq	%xmm4,%xmm8,%xmm5
+	vpaddq	%xmm7,%xmm13,%xmm13
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vmovdqu	32(%rsi),%xmm5
+	vpmuludq	%xmm3,%xmm8,%xmm7
+	vpmuludq	%xmm2,%xmm8,%xmm8
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vmovdqu	48(%rsi),%xmm6
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+	vpmuludq	%xmm2,%xmm9,%xmm2
+	vpmuludq	%xmm3,%xmm9,%xmm3
+	vpsrldq	$6,%xmm5,%xmm7
+	vpaddq	%xmm2,%xmm11,%xmm11
+	vpmuludq	%xmm4,%xmm9,%xmm4
+	vpsrldq	$6,%xmm6,%xmm8
+	vpaddq	%xmm3,%xmm12,%xmm2
+	vpaddq	%xmm4,%xmm13,%xmm3
+	vpmuludq	-32(%r11),%xmm0,%xmm4
+	vpmuludq	%xmm1,%xmm9,%xmm0
+	vpunpckhqdq	%xmm6,%xmm5,%xmm9
+	vpaddq	%xmm4,%xmm14,%xmm4
+	vpaddq	%xmm0,%xmm10,%xmm0
+
+	vpunpcklqdq	%xmm6,%xmm5,%xmm5
+	vpunpcklqdq	%xmm8,%xmm7,%xmm8
+
+
+	vpsrldq	$5,%xmm9,%xmm9
+	vpsrlq	$26,%xmm5,%xmm6
+	vmovdqa	0(%rsp),%xmm14
+	vpand	%xmm15,%xmm5,%xmm5
+	vpsrlq	$4,%xmm8,%xmm7
+	vpand	%xmm15,%xmm6,%xmm6
+	vpand	0(%rcx),%xmm9,%xmm9
+	vpsrlq	$30,%xmm8,%xmm8
+	vpand	%xmm15,%xmm7,%xmm7
+	vpand	%xmm15,%xmm8,%xmm8
+	vpor	32(%rcx),%xmm9,%xmm9
+
+
+
+
+
+	vpsrlq	$26,%xmm3,%xmm13
+	vpand	%xmm15,%xmm3,%xmm3
+	vpaddq	%xmm13,%xmm4,%xmm4
+
+	vpsrlq	$26,%xmm0,%xmm10
+	vpand	%xmm15,%xmm0,%xmm0
+	vpaddq	%xmm10,%xmm11,%xmm1
+
+	vpsrlq	$26,%xmm4,%xmm10
+	vpand	%xmm15,%xmm4,%xmm4
+
+	vpsrlq	$26,%xmm1,%xmm11
+	vpand	%xmm15,%xmm1,%xmm1
+	vpaddq	%xmm11,%xmm2,%xmm2
+
+	vpaddq	%xmm10,%xmm0,%xmm0
+	vpsllq	$2,%xmm10,%xmm10
+	vpaddq	%xmm10,%xmm0,%xmm0
+
+	vpsrlq	$26,%xmm2,%xmm12
+	vpand	%xmm15,%xmm2,%xmm2
+	vpaddq	%xmm12,%xmm3,%xmm3
+
+	vpsrlq	$26,%xmm0,%xmm10
+	vpand	%xmm15,%xmm0,%xmm0
+	vpaddq	%xmm10,%xmm1,%xmm1
+
+	vpsrlq	$26,%xmm3,%xmm13
+	vpand	%xmm15,%xmm3,%xmm3
+	vpaddq	%xmm13,%xmm4,%xmm4
+
+	ja	L$oop_avx
+
+L$skip_loop_avx:
+
+
+
+	vpshufd	$0x10,%xmm14,%xmm14
+	addq	$32,%rdx
+	jnz	L$ong_tail_avx
+
+	vpaddq	%xmm2,%xmm7,%xmm7
+	vpaddq	%xmm0,%xmm5,%xmm5
+	vpaddq	%xmm1,%xmm6,%xmm6
+	vpaddq	%xmm3,%xmm8,%xmm8
+	vpaddq	%xmm4,%xmm9,%xmm9
+
+L$ong_tail_avx:
+	vmovdqa	%xmm2,32(%r11)
+	vmovdqa	%xmm0,0(%r11)
+	vmovdqa	%xmm1,16(%r11)
+	vmovdqa	%xmm3,48(%r11)
+	vmovdqa	%xmm4,64(%r11)
+
+
+
+
+
+
+
+	vpmuludq	%xmm7,%xmm14,%xmm12
+	vpmuludq	%xmm5,%xmm14,%xmm10
+	vpshufd	$0x10,-48(%rdi),%xmm2
+	vpmuludq	%xmm6,%xmm14,%xmm11
+	vpmuludq	%xmm8,%xmm14,%xmm13
+	vpmuludq	%xmm9,%xmm14,%xmm14
+
+	vpmuludq	%xmm8,%xmm2,%xmm0
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpshufd	$0x10,-32(%rdi),%xmm3
+	vpmuludq	%xmm7,%xmm2,%xmm1
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vpshufd	$0x10,-16(%rdi),%xmm4
+	vpmuludq	%xmm6,%xmm2,%xmm0
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm11,%xmm11
+	vpmuludq	%xmm9,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	vpshufd	$0x10,0(%rdi),%xmm2
+	vpmuludq	%xmm7,%xmm4,%xmm1
+	vpaddq	%xmm1,%xmm14,%xmm14
+	vpmuludq	%xmm6,%xmm4,%xmm0
+	vpaddq	%xmm0,%xmm13,%xmm13
+	vpshufd	$0x10,16(%rdi),%xmm3
+	vpmuludq	%xmm5,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm12,%xmm12
+	vpmuludq	%xmm9,%xmm2,%xmm1
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpshufd	$0x10,32(%rdi),%xmm4
+	vpmuludq	%xmm8,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm10,%xmm10
+
+	vpmuludq	%xmm6,%xmm3,%xmm0
+	vpaddq	%xmm0,%xmm14,%xmm14
+	vpmuludq	%xmm5,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm13,%xmm13
+	vpshufd	$0x10,48(%rdi),%xmm2
+	vpmuludq	%xmm9,%xmm4,%xmm1
+	vpaddq	%xmm1,%xmm12,%xmm12
+	vpshufd	$0x10,64(%rdi),%xmm3
+	vpmuludq	%xmm8,%xmm4,%xmm0
+	vpaddq	%xmm0,%xmm11,%xmm11
+	vpmuludq	%xmm7,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vpmuludq	%xmm5,%xmm2,%xmm2
+	vpaddq	%xmm2,%xmm14,%xmm14
+	vpmuludq	%xmm9,%xmm3,%xmm1
+	vpaddq	%xmm1,%xmm13,%xmm13
+	vpmuludq	%xmm8,%xmm3,%xmm0
+	vpaddq	%xmm0,%xmm12,%xmm12
+	vpmuludq	%xmm7,%xmm3,%xmm1
+	vpaddq	%xmm1,%xmm11,%xmm11
+	vpmuludq	%xmm6,%xmm3,%xmm3
+	vpaddq	%xmm3,%xmm10,%xmm10
+
+	jz	L$short_tail_avx
+
+	vmovdqu	0(%rsi),%xmm0
+	vmovdqu	16(%rsi),%xmm1
+
+	vpsrldq	$6,%xmm0,%xmm2
+	vpsrldq	$6,%xmm1,%xmm3
+	vpunpckhqdq	%xmm1,%xmm0,%xmm4
+	vpunpcklqdq	%xmm1,%xmm0,%xmm0
+	vpunpcklqdq	%xmm3,%xmm2,%xmm3
+
+	vpsrlq	$40,%xmm4,%xmm4
+	vpsrlq	$26,%xmm0,%xmm1
+	vpand	%xmm15,%xmm0,%xmm0
+	vpsrlq	$4,%xmm3,%xmm2
+	vpand	%xmm15,%xmm1,%xmm1
+	vpsrlq	$30,%xmm3,%xmm3
+	vpand	%xmm15,%xmm2,%xmm2
+	vpand	%xmm15,%xmm3,%xmm3
+	vpor	32(%rcx),%xmm4,%xmm4
+
+	vpshufd	$0x32,-64(%rdi),%xmm9
+	vpaddq	0(%r11),%xmm0,%xmm0
+	vpaddq	16(%r11),%xmm1,%xmm1
+	vpaddq	32(%r11),%xmm2,%xmm2
+	vpaddq	48(%r11),%xmm3,%xmm3
+	vpaddq	64(%r11),%xmm4,%xmm4
+
+
+
+
+	vpmuludq	%xmm0,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpmuludq	%xmm1,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpmuludq	%xmm2,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpshufd	$0x32,-48(%rdi),%xmm7
+	vpmuludq	%xmm3,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm4,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm14,%xmm14
+
+	vpmuludq	%xmm3,%xmm7,%xmm5
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpshufd	$0x32,-32(%rdi),%xmm8
+	vpmuludq	%xmm2,%xmm7,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpshufd	$0x32,-16(%rdi),%xmm9
+	vpmuludq	%xmm1,%xmm7,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm11,%xmm11
+	vpmuludq	%xmm4,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+	vpshufd	$0x32,0(%rdi),%xmm7
+	vpmuludq	%xmm2,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm14,%xmm14
+	vpmuludq	%xmm1,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm13,%xmm13
+	vpshufd	$0x32,16(%rdi),%xmm8
+	vpmuludq	%xmm0,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm12,%xmm12
+	vpmuludq	%xmm4,%xmm7,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpshufd	$0x32,32(%rdi),%xmm9
+	vpmuludq	%xmm3,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm10,%xmm10
+
+	vpmuludq	%xmm1,%xmm8,%xmm5
+	vpaddq	%xmm5,%xmm14,%xmm14
+	vpmuludq	%xmm0,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm13,%xmm13
+	vpshufd	$0x32,48(%rdi),%xmm7
+	vpmuludq	%xmm4,%xmm9,%xmm6
+	vpaddq	%xmm6,%xmm12,%xmm12
+	vpshufd	$0x32,64(%rdi),%xmm8
+	vpmuludq	%xmm3,%xmm9,%xmm5
+	vpaddq	%xmm5,%xmm11,%xmm11
+	vpmuludq	%xmm2,%xmm9,%xmm9
+	vpaddq	%xmm9,%xmm10,%xmm10
+
+	vpmuludq	%xmm0,%xmm7,%xmm7
+	vpaddq	%xmm7,%xmm14,%xmm14
+	vpmuludq	%xmm4,%xmm8,%xmm6
+	vpaddq	%xmm6,%xmm13,%xmm13
+	vpmuludq	%xmm3,%xmm8,%xmm5
+	vpaddq	%xmm5,%xmm12,%xmm12
+	vpmuludq	%xmm2,%xmm8,%xmm6
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpmuludq	%xmm1,%xmm8,%xmm8
+	vpaddq	%xmm8,%xmm10,%xmm10
+
+L$short_tail_avx:
+
+
+
+	vpsrldq	$8,%xmm14,%xmm9
+	vpsrldq	$8,%xmm13,%xmm8
+	vpsrldq	$8,%xmm11,%xmm6
+	vpsrldq	$8,%xmm10,%xmm5
+	vpsrldq	$8,%xmm12,%xmm7
+	vpaddq	%xmm8,%xmm13,%xmm13
+	vpaddq	%xmm9,%xmm14,%xmm14
+	vpaddq	%xmm5,%xmm10,%xmm10
+	vpaddq	%xmm6,%xmm11,%xmm11
+	vpaddq	%xmm7,%xmm12,%xmm12
+
+
+
+
+	vpsrlq	$26,%xmm13,%xmm3
+	vpand	%xmm15,%xmm13,%xmm13
+	vpaddq	%xmm3,%xmm14,%xmm14
+
+	vpsrlq	$26,%xmm10,%xmm0
+	vpand	%xmm15,%xmm10,%xmm10
+	vpaddq	%xmm0,%xmm11,%xmm11
+
+	vpsrlq	$26,%xmm14,%xmm4
+	vpand	%xmm15,%xmm14,%xmm14
+
+	vpsrlq	$26,%xmm11,%xmm1
+	vpand	%xmm15,%xmm11,%xmm11
+	vpaddq	%xmm1,%xmm12,%xmm12
+
+	vpaddq	%xmm4,%xmm10,%xmm10
+	vpsllq	$2,%xmm4,%xmm4
+	vpaddq	%xmm4,%xmm10,%xmm10
+
+	vpsrlq	$26,%xmm12,%xmm2
+	vpand	%xmm15,%xmm12,%xmm12
+	vpaddq	%xmm2,%xmm13,%xmm13
+
+	vpsrlq	$26,%xmm10,%xmm0
+	vpand	%xmm15,%xmm10,%xmm10
+	vpaddq	%xmm0,%xmm11,%xmm11
+
+	vpsrlq	$26,%xmm13,%xmm3
+	vpand	%xmm15,%xmm13,%xmm13
+	vpaddq	%xmm3,%xmm14,%xmm14
+
+	vmovd	%xmm10,-112(%rdi)
+	vmovd	%xmm11,-108(%rdi)
+	vmovd	%xmm12,-104(%rdi)
+	vmovd	%xmm13,-100(%rdi)
+	vmovd	%xmm14,-96(%rdi)
+	leaq	88(%r11),%rsp
+
+	vzeroupper
+	ret
+
+
+
+
+.p2align	5
+_poly1305_emit_avx:
+	cmpl	$0,20(%rdi)
+	je	L$emit
+
+	movl	0(%rdi),%eax
+	movl	4(%rdi),%ecx
+	movl	8(%rdi),%r8d
+	movl	12(%rdi),%r11d
+	movl	16(%rdi),%r10d
+
+	shlq	$26,%rcx
+	movq	%r8,%r9
+	shlq	$52,%r8
+	addq	%rcx,%rax
+	shrq	$12,%r9
+	addq	%rax,%r8
+	adcq	$0,%r9
+
+	shlq	$14,%r11
+	movq	%r10,%rax
+	shrq	$24,%r10
+	addq	%r11,%r9
+	shlq	$40,%rax
+	addq	%rax,%r9
+	adcq	$0,%r10
+
+	movq	%r10,%rax
+	movq	%r10,%rcx
+	andq	$3,%r10
+	shrq	$2,%rax
+	andq	$-4,%rcx
+	addq	%rcx,%rax
+	addq	%rax,%r8
+	adcq	$0,%r9
+	adcq	$0,%r10
+
+	movq	%r8,%rax
+	addq	$5,%r8
+	movq	%r9,%rcx
+	adcq	$0,%r9
+	adcq	$0,%r10
+	shrq	$2,%r10
+	cmovnzq	%r8,%rax
+	cmovnzq	%r9,%rcx
+
+	addq	0(%rdx),%rax
+	adcq	8(%rdx),%rcx
+	movq	%rax,0(%rsi)
+	movq	%rcx,8(%rsi)
+
+	ret
+
+
+.p2align	5
+_poly1305_blocks_avx2:
+
+	movl	20(%rdi),%r8d
+	cmpq	$128,%rdx
+	jae	L$blocks_avx2
+	testl	%r8d,%r8d
+	jz	L$blocks
+
+L$blocks_avx2:
+	andq	$-16,%rdx
+	jz	L$no_data_avx2
+
+	vzeroupper
+
+	testl	%r8d,%r8d
+	jz	L$base2_64_avx2
+
+	testq	$63,%rdx
+	jz	L$even_avx2
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+L$blocks_avx2_body:
+
+	movq	%rdx,%r15
+
+	movq	0(%rdi),%r8
+	movq	8(%rdi),%r9
+	movl	16(%rdi),%ebp
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+
+	movl	%r8d,%r14d
+	andq	$-2147483648,%r8
+	movq	%r9,%r12
+	movl	%r9d,%ebx
+	andq	$-2147483648,%r9
+
+	shrq	$6,%r8
+	shlq	$52,%r12
+	addq	%r8,%r14
+	shrq	$12,%rbx
+	shrq	$18,%r9
+	addq	%r12,%r14
+	adcq	%r9,%rbx
+
+	movq	%rbp,%r8
+	shlq	$40,%r8
+	shrq	$24,%rbp
+	addq	%r8,%rbx
+	adcq	$0,%rbp
+
+	movq	$-4,%r9
+	movq	%rbp,%r8
+	andq	%rbp,%r9
+	shrq	$2,%r8
+	andq	$3,%rbp
+	addq	%r9,%r8
+	addq	%r8,%r14
+	adcq	$0,%rbx
+	adcq	$0,%rbp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+L$base2_26_pre_avx2:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	subq	$16,%r15
+
+	call	__poly1305_block
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	L$base2_26_pre_avx2
+
+	testq	%rcx,%rcx
+	jz	L$store_base2_64_avx2
+
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r11
+	movq	%rbx,%r12
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r11
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r11,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r12
+	andq	$0x3ffffff,%rbx
+	orq	%r12,%rbp
+
+	testq	%r15,%r15
+	jz	L$store_base2_26_avx2
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	jmp	L$proceed_avx2
+
+.p2align	5
+L$store_base2_64_avx2:
+	movq	%r14,0(%rdi)
+	movq	%rbx,8(%rdi)
+	movq	%rbp,16(%rdi)
+	jmp	L$done_avx2
+
+.p2align	4
+L$store_base2_26_avx2:
+	movl	%eax,0(%rdi)
+	movl	%edx,4(%rdi)
+	movl	%r14d,8(%rdi)
+	movl	%ebx,12(%rdi)
+	movl	%ebp,16(%rdi)
+.p2align	4
+L$done_avx2:
+	movq	0(%rsp),%r15
+
+	movq	8(%rsp),%r14
+
+	movq	16(%rsp),%r13
+
+	movq	24(%rsp),%r12
+
+	movq	32(%rsp),%rbp
+
+	movq	40(%rsp),%rbx
+
+	leaq	48(%rsp),%rsp
+
+L$no_data_avx2:
+L$blocks_avx2_epilogue:
+	ret
+
+
+.p2align	5
+L$base2_64_avx2:
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+L$base2_64_avx2_body:
+
+	movq	%rdx,%r15
+
+	movq	24(%rdi),%r11
+	movq	32(%rdi),%r13
+
+	movq	0(%rdi),%r14
+	movq	8(%rdi),%rbx
+	movl	16(%rdi),%ebp
+
+	movq	%r13,%r12
+	movq	%r13,%rax
+	shrq	$2,%r13
+	addq	%r12,%r13
+
+	testq	$63,%rdx
+	jz	L$init_avx2
+
+L$base2_64_pre_avx2:
+	addq	0(%rsi),%r14
+	adcq	8(%rsi),%rbx
+	leaq	16(%rsi),%rsi
+	adcq	%rcx,%rbp
+	subq	$16,%r15
+
+	call	__poly1305_block
+	movq	%r12,%rax
+
+	testq	$63,%r15
+	jnz	L$base2_64_pre_avx2
+
+L$init_avx2:
+
+	movq	%r14,%rax
+	movq	%r14,%rdx
+	shrq	$52,%r14
+	movq	%rbx,%r8
+	movq	%rbx,%r9
+	shrq	$26,%rdx
+	andq	$0x3ffffff,%rax
+	shlq	$12,%r8
+	andq	$0x3ffffff,%rdx
+	shrq	$14,%rbx
+	orq	%r8,%r14
+	shlq	$24,%rbp
+	andq	$0x3ffffff,%r14
+	shrq	$40,%r9
+	andq	$0x3ffffff,%rbx
+	orq	%r9,%rbp
+
+	vmovd	%eax,%xmm0
+	vmovd	%edx,%xmm1
+	vmovd	%r14d,%xmm2
+	vmovd	%ebx,%xmm3
+	vmovd	%ebp,%xmm4
+	movl	$1,20(%rdi)
+
+	call	__poly1305_init_avx
+
+L$proceed_avx2:
+	movq	%r15,%rdx
+
+
+
+	movq	0(%rsp),%r15
+
+	movq	8(%rsp),%r14
+
+	movq	16(%rsp),%r13
+
+	movq	24(%rsp),%r12
+
+	movq	32(%rsp),%rbp
+
+	movq	40(%rsp),%rbx
+
+	leaq	48(%rsp),%rax
+	leaq	48(%rsp),%rsp
+
+L$base2_64_avx2_epilogue:
+	jmp	L$do_avx2
+
+
+.p2align	5
+L$even_avx2:
+
+
+	vmovd	0(%rdi),%xmm0
+	vmovd	4(%rdi),%xmm1
+	vmovd	8(%rdi),%xmm2
+	vmovd	12(%rdi),%xmm3
+	vmovd	16(%rdi),%xmm4
+
+L$do_avx2:
+	leaq	-8(%rsp),%r11
+
+	subq	$0x128,%rsp
+	leaq	L$const(%rip),%rcx
+	leaq	48+64(%rdi),%rdi
+	vmovdqa	96(%rcx),%ymm7
+
+
+	vmovdqu	-64(%rdi),%xmm9
+	andq	$-512,%rsp
+	vmovdqu	-48(%rdi),%xmm10
+	vmovdqu	-32(%rdi),%xmm6
+	vmovdqu	-16(%rdi),%xmm11
+	vmovdqu	0(%rdi),%xmm12
+	vmovdqu	16(%rdi),%xmm13
+	leaq	144(%rsp),%rax
+	vmovdqu	32(%rdi),%xmm14
+	vpermd	%ymm9,%ymm7,%ymm9
+	vmovdqu	48(%rdi),%xmm15
+	vpermd	%ymm10,%ymm7,%ymm10
+	vmovdqu	64(%rdi),%xmm5
+	vpermd	%ymm6,%ymm7,%ymm6
+	vmovdqa	%ymm9,0(%rsp)
+	vpermd	%ymm11,%ymm7,%ymm11
+	vmovdqa	%ymm10,32-144(%rax)
+	vpermd	%ymm12,%ymm7,%ymm12
+	vmovdqa	%ymm6,64-144(%rax)
+	vpermd	%ymm13,%ymm7,%ymm13
+	vmovdqa	%ymm11,96-144(%rax)
+	vpermd	%ymm14,%ymm7,%ymm14
+	vmovdqa	%ymm12,128-144(%rax)
+	vpermd	%ymm15,%ymm7,%ymm15
+	vmovdqa	%ymm13,160-144(%rax)
+	vpermd	%ymm5,%ymm7,%ymm5
+	vmovdqa	%ymm14,192-144(%rax)
+	vmovdqa	%ymm15,224-144(%rax)
+	vmovdqa	%ymm5,256-144(%rax)
+	vmovdqa	64(%rcx),%ymm5
+
+
+
+	vmovdqu	0(%rsi),%xmm7
+	vmovdqu	16(%rsi),%xmm8
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpsrldq	$6,%ymm7,%ymm9
+	vpsrldq	$6,%ymm8,%ymm10
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+	vpunpcklqdq	%ymm10,%ymm9,%ymm9
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+
+	vpsrlq	$30,%ymm9,%ymm10
+	vpsrlq	$4,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+	vpsrlq	$40,%ymm6,%ymm6
+	vpand	%ymm5,%ymm9,%ymm9
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	vpaddq	%ymm2,%ymm9,%ymm2
+	subq	$64,%rdx
+	jz	L$tail_avx2
+	jmp	L$oop_avx2
+
+.p2align	5
+L$oop_avx2:
+
+
+
+
+
+
+
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqa	0(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqa	32(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqa	96(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqa	48(%rax),%ymm10
+	vmovdqa	112(%rax),%ymm5
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	64(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+	vmovdqa	-16(%rax),%ymm8
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vmovdqu	0(%rsi),%xmm7
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+	vinserti128	$1,32(%rsi),%ymm7,%ymm7
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vmovdqu	16(%rsi),%xmm8
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqa	16(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+	vinserti128	$1,48(%rsi),%ymm8,%ymm8
+	leaq	64(%rsi),%rsi
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpsrldq	$6,%ymm7,%ymm9
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpsrldq	$6,%ymm8,%ymm10
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpunpckhqdq	%ymm8,%ymm7,%ymm6
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpunpcklqdq	%ymm8,%ymm7,%ymm7
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpunpcklqdq	%ymm10,%ymm9,%ymm10
+	vpmuludq	80(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+
+
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$4,%ymm10,%ymm9
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpand	%ymm5,%ymm9,%ymm9
+	vpsrlq	$26,%ymm7,%ymm8
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpsrlq	$30,%ymm10,%ymm10
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$40,%ymm6,%ymm6
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpand	%ymm5,%ymm7,%ymm7
+	vpand	%ymm5,%ymm8,%ymm8
+	vpand	%ymm5,%ymm10,%ymm10
+	vpor	32(%rcx),%ymm6,%ymm6
+
+	subq	$64,%rdx
+	jnz	L$oop_avx2
+
+.byte	0x66,0x90
+L$tail_avx2:
+
+
+
+
+
+
+
+	vpaddq	%ymm0,%ymm7,%ymm0
+	vmovdqu	4(%rsp),%ymm7
+	vpaddq	%ymm1,%ymm8,%ymm1
+	vmovdqu	36(%rsp),%ymm8
+	vpaddq	%ymm3,%ymm10,%ymm3
+	vmovdqu	100(%rsp),%ymm9
+	vpaddq	%ymm4,%ymm6,%ymm4
+	vmovdqu	52(%rax),%ymm10
+	vmovdqu	116(%rax),%ymm5
+
+	vpmuludq	%ymm2,%ymm7,%ymm13
+	vpmuludq	%ymm2,%ymm8,%ymm14
+	vpmuludq	%ymm2,%ymm9,%ymm15
+	vpmuludq	%ymm2,%ymm10,%ymm11
+	vpmuludq	%ymm2,%ymm5,%ymm12
+
+	vpmuludq	%ymm0,%ymm8,%ymm6
+	vpmuludq	%ymm1,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	68(%rsp),%ymm4,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm11,%ymm11
+
+	vpmuludq	%ymm0,%ymm7,%ymm6
+	vpmuludq	%ymm1,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vmovdqu	-12(%rax),%ymm8
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vpmuludq	%ymm3,%ymm7,%ymm6
+	vpmuludq	%ymm4,%ymm7,%ymm2
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm2,%ymm15,%ymm15
+
+	vpmuludq	%ymm3,%ymm8,%ymm6
+	vpmuludq	%ymm4,%ymm8,%ymm2
+	vpaddq	%ymm6,%ymm11,%ymm11
+	vpaddq	%ymm2,%ymm12,%ymm12
+	vmovdqu	20(%rax),%ymm2
+	vpmuludq	%ymm1,%ymm9,%ymm6
+	vpmuludq	%ymm0,%ymm9,%ymm9
+	vpaddq	%ymm6,%ymm14,%ymm14
+	vpaddq	%ymm9,%ymm13,%ymm13
+
+	vpmuludq	%ymm1,%ymm2,%ymm6
+	vpmuludq	%ymm0,%ymm2,%ymm2
+	vpaddq	%ymm6,%ymm15,%ymm15
+	vpaddq	%ymm2,%ymm14,%ymm14
+	vpmuludq	%ymm3,%ymm10,%ymm6
+	vpmuludq	%ymm4,%ymm10,%ymm2
+	vpaddq	%ymm6,%ymm12,%ymm12
+	vpaddq	%ymm2,%ymm13,%ymm13
+
+	vpmuludq	%ymm3,%ymm5,%ymm3
+	vpmuludq	%ymm4,%ymm5,%ymm4
+	vpaddq	%ymm3,%ymm13,%ymm2
+	vpaddq	%ymm4,%ymm14,%ymm3
+	vpmuludq	84(%rax),%ymm0,%ymm4
+	vpmuludq	%ymm1,%ymm5,%ymm0
+	vmovdqa	64(%rcx),%ymm5
+	vpaddq	%ymm4,%ymm15,%ymm4
+	vpaddq	%ymm0,%ymm11,%ymm0
+
+
+
+
+	vpsrldq	$8,%ymm12,%ymm8
+	vpsrldq	$8,%ymm2,%ymm9
+	vpsrldq	$8,%ymm3,%ymm10
+	vpsrldq	$8,%ymm4,%ymm6
+	vpsrldq	$8,%ymm0,%ymm7
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+
+	vpermq	$0x2,%ymm3,%ymm10
+	vpermq	$0x2,%ymm4,%ymm6
+	vpermq	$0x2,%ymm0,%ymm7
+	vpermq	$0x2,%ymm12,%ymm8
+	vpermq	$0x2,%ymm2,%ymm9
+	vpaddq	%ymm10,%ymm3,%ymm3
+	vpaddq	%ymm6,%ymm4,%ymm4
+	vpaddq	%ymm7,%ymm0,%ymm0
+	vpaddq	%ymm8,%ymm12,%ymm12
+	vpaddq	%ymm9,%ymm2,%ymm2
+
+
+
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm12,%ymm1
+
+	vpsrlq	$26,%ymm4,%ymm15
+	vpand	%ymm5,%ymm4,%ymm4
+
+	vpsrlq	$26,%ymm1,%ymm12
+	vpand	%ymm5,%ymm1,%ymm1
+	vpaddq	%ymm12,%ymm2,%ymm2
+
+	vpaddq	%ymm15,%ymm0,%ymm0
+	vpsllq	$2,%ymm15,%ymm15
+	vpaddq	%ymm15,%ymm0,%ymm0
+
+	vpsrlq	$26,%ymm2,%ymm13
+	vpand	%ymm5,%ymm2,%ymm2
+	vpaddq	%ymm13,%ymm3,%ymm3
+
+	vpsrlq	$26,%ymm0,%ymm11
+	vpand	%ymm5,%ymm0,%ymm0
+	vpaddq	%ymm11,%ymm1,%ymm1
+
+	vpsrlq	$26,%ymm3,%ymm14
+	vpand	%ymm5,%ymm3,%ymm3
+	vpaddq	%ymm14,%ymm4,%ymm4
+
+	vmovd	%xmm0,-112(%rdi)
+	vmovd	%xmm1,-108(%rdi)
+	vmovd	%xmm2,-104(%rdi)
+	vmovd	%xmm3,-100(%rdi)
+	vmovd	%xmm4,-96(%rdi)
+	leaq	8(%r11),%rsp
+
+	vzeroupper
+	ret
+
+
diff --git a/crypto/poly1305_x64_nasm.asm b/crypto/poly1305_x64_nasm.asm
new file mode 100644
index 0000000..4f9d9f5
--- /dev/null
+++ b/crypto/poly1305_x64_nasm.asm
@@ -0,0 +1,3487 @@
+default	rel
+%define XMMWORD
+%define YMMWORD
+%define ZMMWORD
+ALIGN	64
+$L$const:
+$L$mask24:
+	DD	0x0ffffff,0,0x0ffffff,0,0x0ffffff,0,0x0ffffff,0
+$L$129:
+	DD	16777216,0,16777216,0,16777216,0,16777216,0
+$L$mask26:
+	DD	0x3ffffff,0,0x3ffffff,0,0x3ffffff,0,0x3ffffff,0
+$L$permd_avx2:
+	DD	2,2,2,3,2,0,2,1
+$L$permd_avx512:
+	DD	0,0,0,1,0,2,0,3,0,4,0,5,0,6,0,7
+
+$L$2_44_inp_permd:
+	DD	0,1,1,2,2,3,7,7
+$L$2_44_inp_shift:
+	DQ	0,12,24,64
+$L$2_44_mask:
+	DQ	0xfffffffffff,0xfffffffffff,0x3ffffffffff,0xffffffffffffffff
+$L$2_44_shift_rgt:
+	DQ	44,44,42,64
+$L$2_44_shift_lft:
+	DQ	8,8,10,64
+
+ALIGN	64
+$L$x_mask44:
+	DQ	0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff
+	DQ	0xfffffffffff,0xfffffffffff,0xfffffffffff,0xfffffffffff
+$L$x_mask42:
+	DQ	0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff
+	DQ	0x3ffffffffff,0x3ffffffffff,0x3ffffffffff,0x3ffffffffff
+
+section	.text code align=64
+
+
+
+global	poly1305_init_x86_64
+global	poly1305_blocks_x86_64
+global	poly1305_emit_x86_64
+global	poly1305_emit_avx
+global	poly1305_blocks_avx
+global	poly1305_blocks_avx2
+global	poly1305_blocks_avx512
+
+
+
+ALIGN	32
+poly1305_init_x86_64:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_poly1305_init_x86_64:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+
+
+	xor	rax,rax
+	mov	QWORD[rdi],rax
+	mov	QWORD[8+rdi],rax
+	mov	QWORD[16+rdi],rax
+
+	cmp	rsi,0
+	je	NEAR $L$no_key
+
+
+
+	mov	rax,0x0ffffffc0fffffff
+	mov	rcx,0x0ffffffc0ffffffc
+	and	rax,QWORD[rsi]
+	and	rcx,QWORD[8+rsi]
+	mov	QWORD[24+rdi],rax
+	mov	QWORD[32+rdi],rcx
+	mov	eax,1
+$L$no_key:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+$L$SEH_end_poly1305_init_x86_64:
+
+
+ALIGN	32
+poly1305_blocks_x86_64:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_poly1305_blocks_x86_64:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+
+
+
+$L$blocks:
+	shr	rdx,4
+	jz	NEAR $L$no_data
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+$L$blocks_body:
+
+	mov	r15,rdx
+
+	mov	r11,QWORD[24+rdi]
+	mov	r13,QWORD[32+rdi]
+
+	mov	r14,QWORD[rdi]
+	mov	rbx,QWORD[8+rdi]
+	mov	rbp,QWORD[16+rdi]
+
+	mov	r12,r13
+	shr	r13,2
+	mov	rax,r12
+	add	r13,r12
+	jmp	NEAR $L$oop
+
+ALIGN	32
+$L$oop:
+	add	r14,QWORD[rsi]
+	adc	rbx,QWORD[8+rsi]
+	lea	rsi,[16+rsi]
+	adc	rbp,rcx
+	mul	r14
+	mov	r9,rax
+	mov	rax,r11
+	mov	r10,rdx
+
+	mul	r14
+	mov	r14,rax
+	mov	rax,r11
+	mov	r8,rdx
+
+	mul	rbx
+	add	r9,rax
+	mov	rax,r13
+	adc	r10,rdx
+
+	mul	rbx
+	mov	rbx,rbp
+	add	r14,rax
+	adc	r8,rdx
+
+	imul	rbx,r13
+	add	r9,rbx
+	mov	rbx,r8
+	adc	r10,0
+
+	imul	rbp,r11
+	add	rbx,r9
+	mov	rax,-4
+	adc	r10,rbp
+
+	and	rax,r10
+	mov	rbp,r10
+	shr	r10,2
+	and	rbp,3
+	add	rax,r10
+	add	r14,rax
+	adc	rbx,0
+	adc	rbp,0
+	mov	rax,r12
+	dec	r15
+	jnz	NEAR $L$oop
+
+	mov	QWORD[rdi],r14
+	mov	QWORD[8+rdi],rbx
+	mov	QWORD[16+rdi],rbp
+
+	mov	r15,QWORD[rsp]
+
+	mov	r14,QWORD[8+rsp]
+
+	mov	r13,QWORD[16+rsp]
+
+	mov	r12,QWORD[24+rsp]
+
+	mov	rbp,QWORD[32+rsp]
+
+	mov	rbx,QWORD[40+rsp]
+
+	lea	rsp,[48+rsp]
+
+$L$no_data:
+$L$blocks_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_poly1305_blocks_x86_64:
+
+
+ALIGN	32
+poly1305_emit_x86_64:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_poly1305_emit_x86_64:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+
+
+$L$emit:
+	mov	r8,QWORD[rdi]
+	mov	r9,QWORD[8+rdi]
+	mov	r10,QWORD[16+rdi]
+
+	mov	rax,r8
+	add	r8,5
+	mov	rcx,r9
+	adc	r9,0
+	adc	r10,0
+	shr	r10,2
+	cmovnz	rax,r8
+	cmovnz	rcx,r9
+
+	add	rax,QWORD[rdx]
+	adc	rcx,QWORD[8+rdx]
+	mov	QWORD[rsi],rax
+	mov	QWORD[8+rsi],rcx
+
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+$L$SEH_end_poly1305_emit_x86_64:
+
+ALIGN	32
+__poly1305_block:
+	mul	r14
+	mov	r9,rax
+	mov	rax,r11
+	mov	r10,rdx
+
+	mul	r14
+	mov	r14,rax
+	mov	rax,r11
+	mov	r8,rdx
+
+	mul	rbx
+	add	r9,rax
+	mov	rax,r13
+	adc	r10,rdx
+
+	mul	rbx
+	mov	rbx,rbp
+	add	r14,rax
+	adc	r8,rdx
+
+	imul	rbx,r13
+	add	r9,rbx
+	mov	rbx,r8
+	adc	r10,0
+
+	imul	rbp,r11
+	add	rbx,r9
+	mov	rax,-4
+	adc	r10,rbp
+
+	and	rax,r10
+	mov	rbp,r10
+	shr	r10,2
+	and	rbp,3
+	add	rax,r10
+	add	r14,rax
+	adc	rbx,0
+	adc	rbp,0
+	DB	0F3h,0C3h		;repret
+
+
+
+ALIGN	32
+__poly1305_init_avx:
+	mov	r14,r11
+	mov	rbx,r12
+	xor	rbp,rbp
+
+	lea	rdi,[((48+64))+rdi]
+
+	mov	rax,r12
+	call	__poly1305_block
+
+	mov	eax,0x3ffffff
+	mov	edx,0x3ffffff
+	mov	r8,r14
+	and	eax,r14d
+	mov	r9,r11
+	and	edx,r11d
+	mov	DWORD[((-64))+rdi],eax
+	shr	r8,26
+	mov	DWORD[((-60))+rdi],edx
+	shr	r9,26
+
+	mov	eax,0x3ffffff
+	mov	edx,0x3ffffff
+	and	eax,r8d
+	and	edx,r9d
+	mov	DWORD[((-48))+rdi],eax
+	lea	eax,[rax*4+rax]
+	mov	DWORD[((-44))+rdi],edx
+	lea	edx,[rdx*4+rdx]
+	mov	DWORD[((-32))+rdi],eax
+	shr	r8,26
+	mov	DWORD[((-28))+rdi],edx
+	shr	r9,26
+
+	mov	rax,rbx
+	mov	rdx,r12
+	shl	rax,12
+	shl	rdx,12
+	or	rax,r8
+	or	rdx,r9
+	and	eax,0x3ffffff
+	and	edx,0x3ffffff
+	mov	DWORD[((-16))+rdi],eax
+	lea	eax,[rax*4+rax]
+	mov	DWORD[((-12))+rdi],edx
+	lea	edx,[rdx*4+rdx]
+	mov	DWORD[rdi],eax
+	mov	r8,rbx
+	mov	DWORD[4+rdi],edx
+	mov	r9,r12
+
+	mov	eax,0x3ffffff
+	mov	edx,0x3ffffff
+	shr	r8,14
+	shr	r9,14
+	and	eax,r8d
+	and	edx,r9d
+	mov	DWORD[16+rdi],eax
+	lea	eax,[rax*4+rax]
+	mov	DWORD[20+rdi],edx
+	lea	edx,[rdx*4+rdx]
+	mov	DWORD[32+rdi],eax
+	shr	r8,26
+	mov	DWORD[36+rdi],edx
+	shr	r9,26
+
+	mov	rax,rbp
+	shl	rax,24
+	or	r8,rax
+	mov	DWORD[48+rdi],r8d
+	lea	r8,[r8*4+r8]
+	mov	DWORD[52+rdi],r9d
+	lea	r9,[r9*4+r9]
+	mov	DWORD[64+rdi],r8d
+	mov	DWORD[68+rdi],r9d
+
+	mov	rax,r12
+	call	__poly1305_block
+
+	mov	eax,0x3ffffff
+	mov	r8,r14
+	and	eax,r14d
+	shr	r8,26
+	mov	DWORD[((-52))+rdi],eax
+
+	mov	edx,0x3ffffff
+	and	edx,r8d
+	mov	DWORD[((-36))+rdi],edx
+	lea	edx,[rdx*4+rdx]
+	shr	r8,26
+	mov	DWORD[((-20))+rdi],edx
+
+	mov	rax,rbx
+	shl	rax,12
+	or	rax,r8
+	and	eax,0x3ffffff
+	mov	DWORD[((-4))+rdi],eax
+	lea	eax,[rax*4+rax]
+	mov	r8,rbx
+	mov	DWORD[12+rdi],eax
+
+	mov	edx,0x3ffffff
+	shr	r8,14
+	and	edx,r8d
+	mov	DWORD[28+rdi],edx
+	lea	edx,[rdx*4+rdx]
+	shr	r8,26
+	mov	DWORD[44+rdi],edx
+
+	mov	rax,rbp
+	shl	rax,24
+	or	r8,rax
+	mov	DWORD[60+rdi],r8d
+	lea	r8,[r8*4+r8]
+	mov	DWORD[76+rdi],r8d
+
+	mov	rax,r12
+	call	__poly1305_block
+
+	mov	eax,0x3ffffff
+	mov	r8,r14
+	and	eax,r14d
+	shr	r8,26
+	mov	DWORD[((-56))+rdi],eax
+
+	mov	edx,0x3ffffff
+	and	edx,r8d
+	mov	DWORD[((-40))+rdi],edx
+	lea	edx,[rdx*4+rdx]
+	shr	r8,26
+	mov	DWORD[((-24))+rdi],edx
+
+	mov	rax,rbx
+	shl	rax,12
+	or	rax,r8
+	and	eax,0x3ffffff
+	mov	DWORD[((-8))+rdi],eax
+	lea	eax,[rax*4+rax]
+	mov	r8,rbx
+	mov	DWORD[8+rdi],eax
+
+	mov	edx,0x3ffffff
+	shr	r8,14
+	and	edx,r8d
+	mov	DWORD[24+rdi],edx
+	lea	edx,[rdx*4+rdx]
+	shr	r8,26
+	mov	DWORD[40+rdi],edx
+
+	mov	rax,rbp
+	shl	rax,24
+	or	r8,rax
+	mov	DWORD[56+rdi],r8d
+	lea	r8,[r8*4+r8]
+	mov	DWORD[72+rdi],r8d
+
+	lea	rdi,[((-48-64))+rdi]
+	DB	0F3h,0C3h		;repret
+
+
+
+ALIGN	32
+poly1305_blocks_avx:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_poly1305_blocks_avx:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+
+
+
+	mov	r8d,DWORD[20+rdi]
+	cmp	rdx,128
+	jae	NEAR $L$blocks_avx
+	test	r8d,r8d
+	jz	NEAR $L$blocks
+
+$L$blocks_avx:
+	and	rdx,-16
+	jz	NEAR $L$no_data_avx
+
+	vzeroupper
+
+	test	r8d,r8d
+	jz	NEAR $L$base2_64_avx
+
+	test	rdx,31
+	jz	NEAR $L$even_avx
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+$L$blocks_avx_body:
+
+	mov	r15,rdx
+
+	mov	r8,QWORD[rdi]
+	mov	r9,QWORD[8+rdi]
+	mov	ebp,DWORD[16+rdi]
+
+	mov	r11,QWORD[24+rdi]
+	mov	r13,QWORD[32+rdi]
+
+
+	mov	r14d,r8d
+	and	r8,-2147483648
+	mov	r12,r9
+	mov	ebx,r9d
+	and	r9,-2147483648
+
+	shr	r8,6
+	shl	r12,52
+	add	r14,r8
+	shr	rbx,12
+	shr	r9,18
+	add	r14,r12
+	adc	rbx,r9
+
+	mov	r8,rbp
+	shl	r8,40
+	shr	rbp,24
+	add	rbx,r8
+	adc	rbp,0
+
+	mov	r9,-4
+	mov	r8,rbp
+	and	r9,rbp
+	shr	r8,2
+	and	rbp,3
+	add	r8,r9
+	add	r14,r8
+	adc	rbx,0
+	adc	rbp,0
+
+	mov	r12,r13
+	mov	rax,r13
+	shr	r13,2
+	add	r13,r12
+
+	add	r14,QWORD[rsi]
+	adc	rbx,QWORD[8+rsi]
+	lea	rsi,[16+rsi]
+	adc	rbp,rcx
+
+	call	__poly1305_block
+
+	test	rcx,rcx
+	jz	NEAR $L$store_base2_64_avx
+
+
+	mov	rax,r14
+	mov	rdx,r14
+	shr	r14,52
+	mov	r11,rbx
+	mov	r12,rbx
+	shr	rdx,26
+	and	rax,0x3ffffff
+	shl	r11,12
+	and	rdx,0x3ffffff
+	shr	rbx,14
+	or	r14,r11
+	shl	rbp,24
+	and	r14,0x3ffffff
+	shr	r12,40
+	and	rbx,0x3ffffff
+	or	rbp,r12
+
+	sub	r15,16
+	jz	NEAR $L$store_base2_26_avx
+
+	vmovd	xmm0,eax
+	vmovd	xmm1,edx
+	vmovd	xmm2,r14d
+	vmovd	xmm3,ebx
+	vmovd	xmm4,ebp
+	jmp	NEAR $L$proceed_avx
+
+ALIGN	32
+$L$store_base2_64_avx:
+	mov	QWORD[rdi],r14
+	mov	QWORD[8+rdi],rbx
+	mov	QWORD[16+rdi],rbp
+	jmp	NEAR $L$done_avx
+
+ALIGN	16
+$L$store_base2_26_avx:
+	mov	DWORD[rdi],eax
+	mov	DWORD[4+rdi],edx
+	mov	DWORD[8+rdi],r14d
+	mov	DWORD[12+rdi],ebx
+	mov	DWORD[16+rdi],ebp
+ALIGN	16
+$L$done_avx:
+	mov	r15,QWORD[rsp]
+
+	mov	r14,QWORD[8+rsp]
+
+	mov	r13,QWORD[16+rsp]
+
+	mov	r12,QWORD[24+rsp]
+
+	mov	rbp,QWORD[32+rsp]
+
+	mov	rbx,QWORD[40+rsp]
+
+	lea	rsp,[48+rsp]
+
+$L$no_data_avx:
+$L$blocks_avx_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+
+ALIGN	32
+$L$base2_64_avx:
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+$L$base2_64_avx_body:
+
+	mov	r15,rdx
+
+	mov	r11,QWORD[24+rdi]
+	mov	r13,QWORD[32+rdi]
+
+	mov	r14,QWORD[rdi]
+	mov	rbx,QWORD[8+rdi]
+	mov	ebp,DWORD[16+rdi]
+
+	mov	r12,r13
+	mov	rax,r13
+	shr	r13,2
+	add	r13,r12
+
+	test	rdx,31
+	jz	NEAR $L$init_avx
+
+	add	r14,QWORD[rsi]
+	adc	rbx,QWORD[8+rsi]
+	lea	rsi,[16+rsi]
+	adc	rbp,rcx
+	sub	r15,16
+
+	call	__poly1305_block
+
+$L$init_avx:
+
+	mov	rax,r14
+	mov	rdx,r14
+	shr	r14,52
+	mov	r8,rbx
+	mov	r9,rbx
+	shr	rdx,26
+	and	rax,0x3ffffff
+	shl	r8,12
+	and	rdx,0x3ffffff
+	shr	rbx,14
+	or	r14,r8
+	shl	rbp,24
+	and	r14,0x3ffffff
+	shr	r9,40
+	and	rbx,0x3ffffff
+	or	rbp,r9
+
+	vmovd	xmm0,eax
+	vmovd	xmm1,edx
+	vmovd	xmm2,r14d
+	vmovd	xmm3,ebx
+	vmovd	xmm4,ebp
+	mov	DWORD[20+rdi],1
+
+	call	__poly1305_init_avx
+
+$L$proceed_avx:
+	mov	rdx,r15
+
+	mov	r15,QWORD[rsp]
+
+	mov	r14,QWORD[8+rsp]
+
+	mov	r13,QWORD[16+rsp]
+
+	mov	r12,QWORD[24+rsp]
+
+	mov	rbp,QWORD[32+rsp]
+
+	mov	rbx,QWORD[40+rsp]
+
+	lea	rax,[48+rsp]
+	lea	rsp,[48+rsp]
+
+$L$base2_64_avx_epilogue:
+	jmp	NEAR $L$do_avx
+
+
+ALIGN	32
+$L$even_avx:
+
+	vmovd	xmm0,DWORD[rdi]
+	vmovd	xmm1,DWORD[4+rdi]
+	vmovd	xmm2,DWORD[8+rdi]
+	vmovd	xmm3,DWORD[12+rdi]
+	vmovd	xmm4,DWORD[16+rdi]
+
+$L$do_avx:
+	lea	r11,[((-248))+rsp]
+	sub	rsp,0x218
+	vmovdqa	XMMWORD[80+r11],xmm6
+	vmovdqa	XMMWORD[96+r11],xmm7
+	vmovdqa	XMMWORD[112+r11],xmm8
+	vmovdqa	XMMWORD[128+r11],xmm9
+	vmovdqa	XMMWORD[144+r11],xmm10
+	vmovdqa	XMMWORD[160+r11],xmm11
+	vmovdqa	XMMWORD[176+r11],xmm12
+	vmovdqa	XMMWORD[192+r11],xmm13
+	vmovdqa	XMMWORD[208+r11],xmm14
+	vmovdqa	XMMWORD[224+r11],xmm15
+$L$do_avx_body:
+	sub	rdx,64
+	lea	rax,[((-32))+rsi]
+	cmovc	rsi,rax
+
+	vmovdqu	xmm14,XMMWORD[48+rdi]
+	lea	rdi,[112+rdi]
+	lea	rcx,[$L$const]
+
+
+
+	vmovdqu	xmm5,XMMWORD[32+rsi]
+	vmovdqu	xmm6,XMMWORD[48+rsi]
+	vmovdqa	xmm15,XMMWORD[64+rcx]
+
+	vpsrldq	xmm7,xmm5,6
+	vpsrldq	xmm8,xmm6,6
+	vpunpckhqdq	xmm9,xmm5,xmm6
+	vpunpcklqdq	xmm5,xmm5,xmm6
+	vpunpcklqdq	xmm8,xmm7,xmm8
+
+	vpsrlq	xmm9,xmm9,40
+	vpsrlq	xmm6,xmm5,26
+	vpand	xmm5,xmm5,xmm15
+	vpsrlq	xmm7,xmm8,4
+	vpand	xmm6,xmm6,xmm15
+	vpsrlq	xmm8,xmm8,30
+	vpand	xmm7,xmm7,xmm15
+	vpand	xmm8,xmm8,xmm15
+	vpor	xmm9,xmm9,XMMWORD[32+rcx]
+
+	jbe	NEAR $L$skip_loop_avx
+
+
+	vmovdqu	xmm11,XMMWORD[((-48))+rdi]
+	vmovdqu	xmm12,XMMWORD[((-32))+rdi]
+	vpshufd	xmm13,xmm14,0xEE
+	vpshufd	xmm10,xmm14,0x44
+	vmovdqa	XMMWORD[(-144)+r11],xmm13
+	vmovdqa	XMMWORD[rsp],xmm10
+	vpshufd	xmm14,xmm11,0xEE
+	vmovdqu	xmm10,XMMWORD[((-16))+rdi]
+	vpshufd	xmm11,xmm11,0x44
+	vmovdqa	XMMWORD[(-128)+r11],xmm14
+	vmovdqa	XMMWORD[16+rsp],xmm11
+	vpshufd	xmm13,xmm12,0xEE
+	vmovdqu	xmm11,XMMWORD[rdi]
+	vpshufd	xmm12,xmm12,0x44
+	vmovdqa	XMMWORD[(-112)+r11],xmm13
+	vmovdqa	XMMWORD[32+rsp],xmm12
+	vpshufd	xmm14,xmm10,0xEE
+	vmovdqu	xmm12,XMMWORD[16+rdi]
+	vpshufd	xmm10,xmm10,0x44
+	vmovdqa	XMMWORD[(-96)+r11],xmm14
+	vmovdqa	XMMWORD[48+rsp],xmm10
+	vpshufd	xmm13,xmm11,0xEE
+	vmovdqu	xmm10,XMMWORD[32+rdi]
+	vpshufd	xmm11,xmm11,0x44
+	vmovdqa	XMMWORD[(-80)+r11],xmm13
+	vmovdqa	XMMWORD[64+rsp],xmm11
+	vpshufd	xmm14,xmm12,0xEE
+	vmovdqu	xmm11,XMMWORD[48+rdi]
+	vpshufd	xmm12,xmm12,0x44
+	vmovdqa	XMMWORD[(-64)+r11],xmm14
+	vmovdqa	XMMWORD[80+rsp],xmm12
+	vpshufd	xmm13,xmm10,0xEE
+	vmovdqu	xmm12,XMMWORD[64+rdi]
+	vpshufd	xmm10,xmm10,0x44
+	vmovdqa	XMMWORD[(-48)+r11],xmm13
+	vmovdqa	XMMWORD[96+rsp],xmm10
+	vpshufd	xmm14,xmm11,0xEE
+	vpshufd	xmm11,xmm11,0x44
+	vmovdqa	XMMWORD[(-32)+r11],xmm14
+	vmovdqa	XMMWORD[112+rsp],xmm11
+	vpshufd	xmm13,xmm12,0xEE
+	vmovdqa	xmm14,XMMWORD[rsp]
+	vpshufd	xmm12,xmm12,0x44
+	vmovdqa	XMMWORD[(-16)+r11],xmm13
+	vmovdqa	XMMWORD[128+rsp],xmm12
+
+	jmp	NEAR $L$oop_avx
+
+ALIGN	32
+$L$oop_avx:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	xmm10,xmm14,xmm5
+	vpmuludq	xmm11,xmm14,xmm6
+	vmovdqa	XMMWORD[32+r11],xmm2
+	vpmuludq	xmm12,xmm14,xmm7
+	vmovdqa	xmm2,XMMWORD[16+rsp]
+	vpmuludq	xmm13,xmm14,xmm8
+	vpmuludq	xmm14,xmm14,xmm9
+
+	vmovdqa	XMMWORD[r11],xmm0
+	vpmuludq	xmm0,xmm9,XMMWORD[32+rsp]
+	vmovdqa	XMMWORD[16+r11],xmm1
+	vpmuludq	xmm1,xmm2,xmm8
+	vpaddq	xmm10,xmm10,xmm0
+	vpaddq	xmm14,xmm14,xmm1
+	vmovdqa	XMMWORD[48+r11],xmm3
+	vpmuludq	xmm0,xmm2,xmm7
+	vpmuludq	xmm1,xmm2,xmm6
+	vpaddq	xmm13,xmm13,xmm0
+	vmovdqa	xmm3,XMMWORD[48+rsp]
+	vpaddq	xmm12,xmm12,xmm1
+	vmovdqa	XMMWORD[64+r11],xmm4
+	vpmuludq	xmm2,xmm2,xmm5
+	vpmuludq	xmm0,xmm3,xmm7
+	vpaddq	xmm11,xmm11,xmm2
+
+	vmovdqa	xmm4,XMMWORD[64+rsp]
+	vpaddq	xmm14,xmm14,xmm0
+	vpmuludq	xmm1,xmm3,xmm6
+	vpmuludq	xmm3,xmm3,xmm5
+	vpaddq	xmm13,xmm13,xmm1
+	vmovdqa	xmm2,XMMWORD[80+rsp]
+	vpaddq	xmm12,xmm12,xmm3
+	vpmuludq	xmm0,xmm4,xmm9
+	vpmuludq	xmm4,xmm4,xmm8
+	vpaddq	xmm11,xmm11,xmm0
+	vmovdqa	xmm3,XMMWORD[96+rsp]
+	vpaddq	xmm10,xmm10,xmm4
+
+	vmovdqa	xmm4,XMMWORD[128+rsp]
+	vpmuludq	xmm1,xmm2,xmm6
+	vpmuludq	xmm2,xmm2,xmm5
+	vpaddq	xmm14,xmm14,xmm1
+	vpaddq	xmm13,xmm13,xmm2
+	vpmuludq	xmm0,xmm3,xmm9
+	vpmuludq	xmm1,xmm3,xmm8
+	vpaddq	xmm12,xmm12,xmm0
+	vmovdqu	xmm0,XMMWORD[rsi]
+	vpaddq	xmm11,xmm11,xmm1
+	vpmuludq	xmm3,xmm3,xmm7
+	vpmuludq	xmm7,xmm4,xmm7
+	vpaddq	xmm10,xmm10,xmm3
+
+	vmovdqu	xmm1,XMMWORD[16+rsi]
+	vpaddq	xmm11,xmm11,xmm7
+	vpmuludq	xmm8,xmm4,xmm8
+	vpmuludq	xmm9,xmm4,xmm9
+	vpsrldq	xmm2,xmm0,6
+	vpaddq	xmm12,xmm12,xmm8
+	vpaddq	xmm13,xmm13,xmm9
+	vpsrldq	xmm3,xmm1,6
+	vpmuludq	xmm9,xmm5,XMMWORD[112+rsp]
+	vpmuludq	xmm5,xmm4,xmm6
+	vpunpckhqdq	xmm4,xmm0,xmm1
+	vpaddq	xmm14,xmm14,xmm9
+	vmovdqa	xmm9,XMMWORD[((-144))+r11]
+	vpaddq	xmm10,xmm10,xmm5
+
+	vpunpcklqdq	xmm0,xmm0,xmm1
+	vpunpcklqdq	xmm3,xmm2,xmm3
+
+
+	vpsrldq	xmm4,xmm4,5
+	vpsrlq	xmm1,xmm0,26
+	vpand	xmm0,xmm0,xmm15
+	vpsrlq	xmm2,xmm3,4
+	vpand	xmm1,xmm1,xmm15
+	vpand	xmm4,xmm4,XMMWORD[rcx]
+	vpsrlq	xmm3,xmm3,30
+	vpand	xmm2,xmm2,xmm15
+	vpand	xmm3,xmm3,xmm15
+	vpor	xmm4,xmm4,XMMWORD[32+rcx]
+
+	vpaddq	xmm0,xmm0,XMMWORD[r11]
+	vpaddq	xmm1,xmm1,XMMWORD[16+r11]
+	vpaddq	xmm2,xmm2,XMMWORD[32+r11]
+	vpaddq	xmm3,xmm3,XMMWORD[48+r11]
+	vpaddq	xmm4,xmm4,XMMWORD[64+r11]
+
+	lea	rax,[32+rsi]
+	lea	rsi,[64+rsi]
+	sub	rdx,64
+	cmovc	rsi,rax
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	xmm5,xmm9,xmm0
+	vpmuludq	xmm6,xmm9,xmm1
+	vpaddq	xmm10,xmm10,xmm5
+	vpaddq	xmm11,xmm11,xmm6
+	vmovdqa	xmm7,XMMWORD[((-128))+r11]
+	vpmuludq	xmm5,xmm9,xmm2
+	vpmuludq	xmm6,xmm9,xmm3
+	vpaddq	xmm12,xmm12,xmm5
+	vpaddq	xmm13,xmm13,xmm6
+	vpmuludq	xmm9,xmm9,xmm4
+	vpmuludq	xmm5,xmm4,XMMWORD[((-112))+r11]
+	vpaddq	xmm14,xmm14,xmm9
+
+	vpaddq	xmm10,xmm10,xmm5
+	vpmuludq	xmm6,xmm7,xmm2
+	vpmuludq	xmm5,xmm7,xmm3
+	vpaddq	xmm13,xmm13,xmm6
+	vmovdqa	xmm8,XMMWORD[((-96))+r11]
+	vpaddq	xmm14,xmm14,xmm5
+	vpmuludq	xmm6,xmm7,xmm1
+	vpmuludq	xmm7,xmm7,xmm0
+	vpaddq	xmm12,xmm12,xmm6
+	vpaddq	xmm11,xmm11,xmm7
+
+	vmovdqa	xmm9,XMMWORD[((-80))+r11]
+	vpmuludq	xmm5,xmm8,xmm2
+	vpmuludq	xmm6,xmm8,xmm1
+	vpaddq	xmm14,xmm14,xmm5
+	vpaddq	xmm13,xmm13,xmm6
+	vmovdqa	xmm7,XMMWORD[((-64))+r11]
+	vpmuludq	xmm8,xmm8,xmm0
+	vpmuludq	xmm5,xmm9,xmm4
+	vpaddq	xmm12,xmm12,xmm8
+	vpaddq	xmm11,xmm11,xmm5
+	vmovdqa	xmm8,XMMWORD[((-48))+r11]
+	vpmuludq	xmm9,xmm9,xmm3
+	vpmuludq	xmm6,xmm7,xmm1
+	vpaddq	xmm10,xmm10,xmm9
+
+	vmovdqa	xmm9,XMMWORD[((-16))+r11]
+	vpaddq	xmm14,xmm14,xmm6
+	vpmuludq	xmm7,xmm7,xmm0
+	vpmuludq	xmm5,xmm8,xmm4
+	vpaddq	xmm13,xmm13,xmm7
+	vpaddq	xmm12,xmm12,xmm5
+	vmovdqu	xmm5,XMMWORD[32+rsi]
+	vpmuludq	xmm7,xmm8,xmm3
+	vpmuludq	xmm8,xmm8,xmm2
+	vpaddq	xmm11,xmm11,xmm7
+	vmovdqu	xmm6,XMMWORD[48+rsi]
+	vpaddq	xmm10,xmm10,xmm8
+
+	vpmuludq	xmm2,xmm9,xmm2
+	vpmuludq	xmm3,xmm9,xmm3
+	vpsrldq	xmm7,xmm5,6
+	vpaddq	xmm11,xmm11,xmm2
+	vpmuludq	xmm4,xmm9,xmm4
+	vpsrldq	xmm8,xmm6,6
+	vpaddq	xmm2,xmm12,xmm3
+	vpaddq	xmm3,xmm13,xmm4
+	vpmuludq	xmm4,xmm0,XMMWORD[((-32))+r11]
+	vpmuludq	xmm0,xmm9,xmm1
+	vpunpckhqdq	xmm9,xmm5,xmm6
+	vpaddq	xmm4,xmm14,xmm4
+	vpaddq	xmm0,xmm10,xmm0
+
+	vpunpcklqdq	xmm5,xmm5,xmm6
+	vpunpcklqdq	xmm8,xmm7,xmm8
+
+
+	vpsrldq	xmm9,xmm9,5
+	vpsrlq	xmm6,xmm5,26
+	vmovdqa	xmm14,XMMWORD[rsp]
+	vpand	xmm5,xmm5,xmm15
+	vpsrlq	xmm7,xmm8,4
+	vpand	xmm6,xmm6,xmm15
+	vpand	xmm9,xmm9,XMMWORD[rcx]
+	vpsrlq	xmm8,xmm8,30
+	vpand	xmm7,xmm7,xmm15
+	vpand	xmm8,xmm8,xmm15
+	vpor	xmm9,xmm9,XMMWORD[32+rcx]
+
+
+
+
+
+	vpsrlq	xmm13,xmm3,26
+	vpand	xmm3,xmm3,xmm15
+	vpaddq	xmm4,xmm4,xmm13
+
+	vpsrlq	xmm10,xmm0,26
+	vpand	xmm0,xmm0,xmm15
+	vpaddq	xmm1,xmm11,xmm10
+
+	vpsrlq	xmm10,xmm4,26
+	vpand	xmm4,xmm4,xmm15
+
+	vpsrlq	xmm11,xmm1,26
+	vpand	xmm1,xmm1,xmm15
+	vpaddq	xmm2,xmm2,xmm11
+
+	vpaddq	xmm0,xmm0,xmm10
+	vpsllq	xmm10,xmm10,2
+	vpaddq	xmm0,xmm0,xmm10
+
+	vpsrlq	xmm12,xmm2,26
+	vpand	xmm2,xmm2,xmm15
+	vpaddq	xmm3,xmm3,xmm12
+
+	vpsrlq	xmm10,xmm0,26
+	vpand	xmm0,xmm0,xmm15
+	vpaddq	xmm1,xmm1,xmm10
+
+	vpsrlq	xmm13,xmm3,26
+	vpand	xmm3,xmm3,xmm15
+	vpaddq	xmm4,xmm4,xmm13
+
+	ja	NEAR $L$oop_avx
+
+$L$skip_loop_avx:
+
+
+
+	vpshufd	xmm14,xmm14,0x10
+	add	rdx,32
+	jnz	NEAR $L$ong_tail_avx
+
+	vpaddq	xmm7,xmm7,xmm2
+	vpaddq	xmm5,xmm5,xmm0
+	vpaddq	xmm6,xmm6,xmm1
+	vpaddq	xmm8,xmm8,xmm3
+	vpaddq	xmm9,xmm9,xmm4
+
+$L$ong_tail_avx:
+	vmovdqa	XMMWORD[32+r11],xmm2
+	vmovdqa	XMMWORD[r11],xmm0
+	vmovdqa	XMMWORD[16+r11],xmm1
+	vmovdqa	XMMWORD[48+r11],xmm3
+	vmovdqa	XMMWORD[64+r11],xmm4
+
+
+
+
+
+
+
+	vpmuludq	xmm12,xmm14,xmm7
+	vpmuludq	xmm10,xmm14,xmm5
+	vpshufd	xmm2,XMMWORD[((-48))+rdi],0x10
+	vpmuludq	xmm11,xmm14,xmm6
+	vpmuludq	xmm13,xmm14,xmm8
+	vpmuludq	xmm14,xmm14,xmm9
+
+	vpmuludq	xmm0,xmm2,xmm8
+	vpaddq	xmm14,xmm14,xmm0
+	vpshufd	xmm3,XMMWORD[((-32))+rdi],0x10
+	vpmuludq	xmm1,xmm2,xmm7
+	vpaddq	xmm13,xmm13,xmm1
+	vpshufd	xmm4,XMMWORD[((-16))+rdi],0x10
+	vpmuludq	xmm0,xmm2,xmm6
+	vpaddq	xmm12,xmm12,xmm0
+	vpmuludq	xmm2,xmm2,xmm5
+	vpaddq	xmm11,xmm11,xmm2
+	vpmuludq	xmm3,xmm3,xmm9
+	vpaddq	xmm10,xmm10,xmm3
+
+	vpshufd	xmm2,XMMWORD[rdi],0x10
+	vpmuludq	xmm1,xmm4,xmm7
+	vpaddq	xmm14,xmm14,xmm1
+	vpmuludq	xmm0,xmm4,xmm6
+	vpaddq	xmm13,xmm13,xmm0
+	vpshufd	xmm3,XMMWORD[16+rdi],0x10
+	vpmuludq	xmm4,xmm4,xmm5
+	vpaddq	xmm12,xmm12,xmm4
+	vpmuludq	xmm1,xmm2,xmm9
+	vpaddq	xmm11,xmm11,xmm1
+	vpshufd	xmm4,XMMWORD[32+rdi],0x10
+	vpmuludq	xmm2,xmm2,xmm8
+	vpaddq	xmm10,xmm10,xmm2
+
+	vpmuludq	xmm0,xmm3,xmm6
+	vpaddq	xmm14,xmm14,xmm0
+	vpmuludq	xmm3,xmm3,xmm5
+	vpaddq	xmm13,xmm13,xmm3
+	vpshufd	xmm2,XMMWORD[48+rdi],0x10
+	vpmuludq	xmm1,xmm4,xmm9
+	vpaddq	xmm12,xmm12,xmm1
+	vpshufd	xmm3,XMMWORD[64+rdi],0x10
+	vpmuludq	xmm0,xmm4,xmm8
+	vpaddq	xmm11,xmm11,xmm0
+	vpmuludq	xmm4,xmm4,xmm7
+	vpaddq	xmm10,xmm10,xmm4
+
+	vpmuludq	xmm2,xmm2,xmm5
+	vpaddq	xmm14,xmm14,xmm2
+	vpmuludq	xmm1,xmm3,xmm9
+	vpaddq	xmm13,xmm13,xmm1
+	vpmuludq	xmm0,xmm3,xmm8
+	vpaddq	xmm12,xmm12,xmm0
+	vpmuludq	xmm1,xmm3,xmm7
+	vpaddq	xmm11,xmm11,xmm1
+	vpmuludq	xmm3,xmm3,xmm6
+	vpaddq	xmm10,xmm10,xmm3
+
+	jz	NEAR $L$short_tail_avx
+
+	vmovdqu	xmm0,XMMWORD[rsi]
+	vmovdqu	xmm1,XMMWORD[16+rsi]
+
+	vpsrldq	xmm2,xmm0,6
+	vpsrldq	xmm3,xmm1,6
+	vpunpckhqdq	xmm4,xmm0,xmm1
+	vpunpcklqdq	xmm0,xmm0,xmm1
+	vpunpcklqdq	xmm3,xmm2,xmm3
+
+	vpsrlq	xmm4,xmm4,40
+	vpsrlq	xmm1,xmm0,26
+	vpand	xmm0,xmm0,xmm15
+	vpsrlq	xmm2,xmm3,4
+	vpand	xmm1,xmm1,xmm15
+	vpsrlq	xmm3,xmm3,30
+	vpand	xmm2,xmm2,xmm15
+	vpand	xmm3,xmm3,xmm15
+	vpor	xmm4,xmm4,XMMWORD[32+rcx]
+
+	vpshufd	xmm9,XMMWORD[((-64))+rdi],0x32
+	vpaddq	xmm0,xmm0,XMMWORD[r11]
+	vpaddq	xmm1,xmm1,XMMWORD[16+r11]
+	vpaddq	xmm2,xmm2,XMMWORD[32+r11]
+	vpaddq	xmm3,xmm3,XMMWORD[48+r11]
+	vpaddq	xmm4,xmm4,XMMWORD[64+r11]
+
+
+
+
+	vpmuludq	xmm5,xmm9,xmm0
+	vpaddq	xmm10,xmm10,xmm5
+	vpmuludq	xmm6,xmm9,xmm1
+	vpaddq	xmm11,xmm11,xmm6
+	vpmuludq	xmm5,xmm9,xmm2
+	vpaddq	xmm12,xmm12,xmm5
+	vpshufd	xmm7,XMMWORD[((-48))+rdi],0x32
+	vpmuludq	xmm6,xmm9,xmm3
+	vpaddq	xmm13,xmm13,xmm6
+	vpmuludq	xmm9,xmm9,xmm4
+	vpaddq	xmm14,xmm14,xmm9
+
+	vpmuludq	xmm5,xmm7,xmm3
+	vpaddq	xmm14,xmm14,xmm5
+	vpshufd	xmm8,XMMWORD[((-32))+rdi],0x32
+	vpmuludq	xmm6,xmm7,xmm2
+	vpaddq	xmm13,xmm13,xmm6
+	vpshufd	xmm9,XMMWORD[((-16))+rdi],0x32
+	vpmuludq	xmm5,xmm7,xmm1
+	vpaddq	xmm12,xmm12,xmm5
+	vpmuludq	xmm7,xmm7,xmm0
+	vpaddq	xmm11,xmm11,xmm7
+	vpmuludq	xmm8,xmm8,xmm4
+	vpaddq	xmm10,xmm10,xmm8
+
+	vpshufd	xmm7,XMMWORD[rdi],0x32
+	vpmuludq	xmm6,xmm9,xmm2
+	vpaddq	xmm14,xmm14,xmm6
+	vpmuludq	xmm5,xmm9,xmm1
+	vpaddq	xmm13,xmm13,xmm5
+	vpshufd	xmm8,XMMWORD[16+rdi],0x32
+	vpmuludq	xmm9,xmm9,xmm0
+	vpaddq	xmm12,xmm12,xmm9
+	vpmuludq	xmm6,xmm7,xmm4
+	vpaddq	xmm11,xmm11,xmm6
+	vpshufd	xmm9,XMMWORD[32+rdi],0x32
+	vpmuludq	xmm7,xmm7,xmm3
+	vpaddq	xmm10,xmm10,xmm7
+
+	vpmuludq	xmm5,xmm8,xmm1
+	vpaddq	xmm14,xmm14,xmm5
+	vpmuludq	xmm8,xmm8,xmm0
+	vpaddq	xmm13,xmm13,xmm8
+	vpshufd	xmm7,XMMWORD[48+rdi],0x32
+	vpmuludq	xmm6,xmm9,xmm4
+	vpaddq	xmm12,xmm12,xmm6
+	vpshufd	xmm8,XMMWORD[64+rdi],0x32
+	vpmuludq	xmm5,xmm9,xmm3
+	vpaddq	xmm11,xmm11,xmm5
+	vpmuludq	xmm9,xmm9,xmm2
+	vpaddq	xmm10,xmm10,xmm9
+
+	vpmuludq	xmm7,xmm7,xmm0
+	vpaddq	xmm14,xmm14,xmm7
+	vpmuludq	xmm6,xmm8,xmm4
+	vpaddq	xmm13,xmm13,xmm6
+	vpmuludq	xmm5,xmm8,xmm3
+	vpaddq	xmm12,xmm12,xmm5
+	vpmuludq	xmm6,xmm8,xmm2
+	vpaddq	xmm11,xmm11,xmm6
+	vpmuludq	xmm8,xmm8,xmm1
+	vpaddq	xmm10,xmm10,xmm8
+
+$L$short_tail_avx:
+
+
+
+	vpsrldq	xmm9,xmm14,8
+	vpsrldq	xmm8,xmm13,8
+	vpsrldq	xmm6,xmm11,8
+	vpsrldq	xmm5,xmm10,8
+	vpsrldq	xmm7,xmm12,8
+	vpaddq	xmm13,xmm13,xmm8
+	vpaddq	xmm14,xmm14,xmm9
+	vpaddq	xmm10,xmm10,xmm5
+	vpaddq	xmm11,xmm11,xmm6
+	vpaddq	xmm12,xmm12,xmm7
+
+
+
+
+	vpsrlq	xmm3,xmm13,26
+	vpand	xmm13,xmm13,xmm15
+	vpaddq	xmm14,xmm14,xmm3
+
+	vpsrlq	xmm0,xmm10,26
+	vpand	xmm10,xmm10,xmm15
+	vpaddq	xmm11,xmm11,xmm0
+
+	vpsrlq	xmm4,xmm14,26
+	vpand	xmm14,xmm14,xmm15
+
+	vpsrlq	xmm1,xmm11,26
+	vpand	xmm11,xmm11,xmm15
+	vpaddq	xmm12,xmm12,xmm1
+
+	vpaddq	xmm10,xmm10,xmm4
+	vpsllq	xmm4,xmm4,2
+	vpaddq	xmm10,xmm10,xmm4
+
+	vpsrlq	xmm2,xmm12,26
+	vpand	xmm12,xmm12,xmm15
+	vpaddq	xmm13,xmm13,xmm2
+
+	vpsrlq	xmm0,xmm10,26
+	vpand	xmm10,xmm10,xmm15
+	vpaddq	xmm11,xmm11,xmm0
+
+	vpsrlq	xmm3,xmm13,26
+	vpand	xmm13,xmm13,xmm15
+	vpaddq	xmm14,xmm14,xmm3
+
+	vmovd	DWORD[(-112)+rdi],xmm10
+	vmovd	DWORD[(-108)+rdi],xmm11
+	vmovd	DWORD[(-104)+rdi],xmm12
+	vmovd	DWORD[(-100)+rdi],xmm13
+	vmovd	DWORD[(-96)+rdi],xmm14
+	vmovdqa	xmm6,XMMWORD[80+r11]
+	vmovdqa	xmm7,XMMWORD[96+r11]
+	vmovdqa	xmm8,XMMWORD[112+r11]
+	vmovdqa	xmm9,XMMWORD[128+r11]
+	vmovdqa	xmm10,XMMWORD[144+r11]
+	vmovdqa	xmm11,XMMWORD[160+r11]
+	vmovdqa	xmm12,XMMWORD[176+r11]
+	vmovdqa	xmm13,XMMWORD[192+r11]
+	vmovdqa	xmm14,XMMWORD[208+r11]
+	vmovdqa	xmm15,XMMWORD[224+r11]
+	lea	rsp,[248+r11]
+$L$do_avx_epilogue:
+	vzeroupper
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_poly1305_blocks_avx:
+
+
+ALIGN	32
+poly1305_emit_avx:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_poly1305_emit_avx:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+
+
+	cmp	DWORD[20+rdi],0
+	je	NEAR $L$emit
+
+	mov	eax,DWORD[rdi]
+	mov	ecx,DWORD[4+rdi]
+	mov	r8d,DWORD[8+rdi]
+	mov	r11d,DWORD[12+rdi]
+	mov	r10d,DWORD[16+rdi]
+
+	shl	rcx,26
+	mov	r9,r8
+	shl	r8,52
+	add	rax,rcx
+	shr	r9,12
+	add	r8,rax
+	adc	r9,0
+
+	shl	r11,14
+	mov	rax,r10
+	shr	r10,24
+	add	r9,r11
+	shl	rax,40
+	add	r9,rax
+	adc	r10,0
+
+	mov	rax,r10
+	mov	rcx,r10
+	and	r10,3
+	shr	rax,2
+	and	rcx,-4
+	add	rax,rcx
+	add	r8,rax
+	adc	r9,0
+	adc	r10,0
+
+	mov	rax,r8
+	add	r8,5
+	mov	rcx,r9
+	adc	r9,0
+	adc	r10,0
+	shr	r10,2
+	cmovnz	rax,r8
+	cmovnz	rcx,r9
+
+	add	rax,QWORD[rdx]
+	adc	rcx,QWORD[8+rdx]
+	mov	QWORD[rsi],rax
+	mov	QWORD[8+rsi],rcx
+
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+$L$SEH_end_poly1305_emit_avx:
+
+ALIGN	32
+poly1305_blocks_avx2:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_poly1305_blocks_avx2:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+
+
+
+	mov	r8d,DWORD[20+rdi]
+	cmp	rdx,128
+	jae	NEAR $L$blocks_avx2
+	test	r8d,r8d
+	jz	NEAR $L$blocks
+
+$L$blocks_avx2:
+	and	rdx,-16
+	jz	NEAR $L$no_data_avx2
+
+	vzeroupper
+
+	test	r8d,r8d
+	jz	NEAR $L$base2_64_avx2
+
+	test	rdx,63
+	jz	NEAR $L$even_avx2
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+$L$blocks_avx2_body:
+
+	mov	r15,rdx
+
+	mov	r8,QWORD[rdi]
+	mov	r9,QWORD[8+rdi]
+	mov	ebp,DWORD[16+rdi]
+
+	mov	r11,QWORD[24+rdi]
+	mov	r13,QWORD[32+rdi]
+
+
+	mov	r14d,r8d
+	and	r8,-2147483648
+	mov	r12,r9
+	mov	ebx,r9d
+	and	r9,-2147483648
+
+	shr	r8,6
+	shl	r12,52
+	add	r14,r8
+	shr	rbx,12
+	shr	r9,18
+	add	r14,r12
+	adc	rbx,r9
+
+	mov	r8,rbp
+	shl	r8,40
+	shr	rbp,24
+	add	rbx,r8
+	adc	rbp,0
+
+	mov	r9,-4
+	mov	r8,rbp
+	and	r9,rbp
+	shr	r8,2
+	and	rbp,3
+	add	r8,r9
+	add	r14,r8
+	adc	rbx,0
+	adc	rbp,0
+
+	mov	r12,r13
+	mov	rax,r13
+	shr	r13,2
+	add	r13,r12
+
+$L$base2_26_pre_avx2:
+	add	r14,QWORD[rsi]
+	adc	rbx,QWORD[8+rsi]
+	lea	rsi,[16+rsi]
+	adc	rbp,rcx
+	sub	r15,16
+
+	call	__poly1305_block
+	mov	rax,r12
+
+	test	r15,63
+	jnz	NEAR $L$base2_26_pre_avx2
+
+	test	rcx,rcx
+	jz	NEAR $L$store_base2_64_avx2
+
+
+	mov	rax,r14
+	mov	rdx,r14
+	shr	r14,52
+	mov	r11,rbx
+	mov	r12,rbx
+	shr	rdx,26
+	and	rax,0x3ffffff
+	shl	r11,12
+	and	rdx,0x3ffffff
+	shr	rbx,14
+	or	r14,r11
+	shl	rbp,24
+	and	r14,0x3ffffff
+	shr	r12,40
+	and	rbx,0x3ffffff
+	or	rbp,r12
+
+	test	r15,r15
+	jz	NEAR $L$store_base2_26_avx2
+
+	vmovd	xmm0,eax
+	vmovd	xmm1,edx
+	vmovd	xmm2,r14d
+	vmovd	xmm3,ebx
+	vmovd	xmm4,ebp
+	jmp	NEAR $L$proceed_avx2
+
+ALIGN	32
+$L$store_base2_64_avx2:
+	mov	QWORD[rdi],r14
+	mov	QWORD[8+rdi],rbx
+	mov	QWORD[16+rdi],rbp
+	jmp	NEAR $L$done_avx2
+
+ALIGN	16
+$L$store_base2_26_avx2:
+	mov	DWORD[rdi],eax
+	mov	DWORD[4+rdi],edx
+	mov	DWORD[8+rdi],r14d
+	mov	DWORD[12+rdi],ebx
+	mov	DWORD[16+rdi],ebp
+ALIGN	16
+$L$done_avx2:
+	mov	r15,QWORD[rsp]
+
+	mov	r14,QWORD[8+rsp]
+
+	mov	r13,QWORD[16+rsp]
+
+	mov	r12,QWORD[24+rsp]
+
+	mov	rbp,QWORD[32+rsp]
+
+	mov	rbx,QWORD[40+rsp]
+
+	lea	rsp,[48+rsp]
+
+$L$no_data_avx2:
+$L$blocks_avx2_epilogue:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+
+ALIGN	32
+$L$base2_64_avx2:
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+$L$base2_64_avx2_body:
+
+	mov	r15,rdx
+
+	mov	r11,QWORD[24+rdi]
+	mov	r13,QWORD[32+rdi]
+
+	mov	r14,QWORD[rdi]
+	mov	rbx,QWORD[8+rdi]
+	mov	ebp,DWORD[16+rdi]
+
+	mov	r12,r13
+	mov	rax,r13
+	shr	r13,2
+	add	r13,r12
+
+	test	rdx,63
+	jz	NEAR $L$init_avx2
+
+$L$base2_64_pre_avx2:
+	add	r14,QWORD[rsi]
+	adc	rbx,QWORD[8+rsi]
+	lea	rsi,[16+rsi]
+	adc	rbp,rcx
+	sub	r15,16
+
+	call	__poly1305_block
+	mov	rax,r12
+
+	test	r15,63
+	jnz	NEAR $L$base2_64_pre_avx2
+
+$L$init_avx2:
+
+	mov	rax,r14
+	mov	rdx,r14
+	shr	r14,52
+	mov	r8,rbx
+	mov	r9,rbx
+	shr	rdx,26
+	and	rax,0x3ffffff
+	shl	r8,12
+	and	rdx,0x3ffffff
+	shr	rbx,14
+	or	r14,r8
+	shl	rbp,24
+	and	r14,0x3ffffff
+	shr	r9,40
+	and	rbx,0x3ffffff
+	or	rbp,r9
+
+	vmovd	xmm0,eax
+	vmovd	xmm1,edx
+	vmovd	xmm2,r14d
+	vmovd	xmm3,ebx
+	vmovd	xmm4,ebp
+	mov	DWORD[20+rdi],1
+
+	call	__poly1305_init_avx
+
+$L$proceed_avx2:
+	mov	rdx,r15
+
+
+
+	mov	r15,QWORD[rsp]
+
+	mov	r14,QWORD[8+rsp]
+
+	mov	r13,QWORD[16+rsp]
+
+	mov	r12,QWORD[24+rsp]
+
+	mov	rbp,QWORD[32+rsp]
+
+	mov	rbx,QWORD[40+rsp]
+
+	lea	rax,[48+rsp]
+	lea	rsp,[48+rsp]
+
+$L$base2_64_avx2_epilogue:
+	jmp	NEAR $L$do_avx2
+
+
+ALIGN	32
+$L$even_avx2:
+
+
+	vmovd	xmm0,DWORD[rdi]
+	vmovd	xmm1,DWORD[4+rdi]
+	vmovd	xmm2,DWORD[8+rdi]
+	vmovd	xmm3,DWORD[12+rdi]
+	vmovd	xmm4,DWORD[16+rdi]
+
+$L$do_avx2:
+	lea	r11,[((-248))+rsp]
+	sub	rsp,0x1c8
+	vmovdqa	XMMWORD[80+r11],xmm6
+	vmovdqa	XMMWORD[96+r11],xmm7
+	vmovdqa	XMMWORD[112+r11],xmm8
+	vmovdqa	XMMWORD[128+r11],xmm9
+	vmovdqa	XMMWORD[144+r11],xmm10
+	vmovdqa	XMMWORD[160+r11],xmm11
+	vmovdqa	XMMWORD[176+r11],xmm12
+	vmovdqa	XMMWORD[192+r11],xmm13
+	vmovdqa	XMMWORD[208+r11],xmm14
+	vmovdqa	XMMWORD[224+r11],xmm15
+$L$do_avx2_body:
+	lea	rcx,[$L$const]
+	lea	rdi,[((48+64))+rdi]
+	vmovdqa	ymm7,YMMWORD[96+rcx]
+
+
+	vmovdqu	xmm9,XMMWORD[((-64))+rdi]
+	and	rsp,-512
+	vmovdqu	xmm10,XMMWORD[((-48))+rdi]
+	vmovdqu	xmm6,XMMWORD[((-32))+rdi]
+	vmovdqu	xmm11,XMMWORD[((-16))+rdi]
+	vmovdqu	xmm12,XMMWORD[rdi]
+	vmovdqu	xmm13,XMMWORD[16+rdi]
+	lea	rax,[144+rsp]
+	vmovdqu	xmm14,XMMWORD[32+rdi]
+	vpermd	ymm9,ymm7,ymm9
+	vmovdqu	xmm15,XMMWORD[48+rdi]
+	vpermd	ymm10,ymm7,ymm10
+	vmovdqu	xmm5,XMMWORD[64+rdi]
+	vpermd	ymm6,ymm7,ymm6
+	vmovdqa	YMMWORD[rsp],ymm9
+	vpermd	ymm11,ymm7,ymm11
+	vmovdqa	YMMWORD[(32-144)+rax],ymm10
+	vpermd	ymm12,ymm7,ymm12
+	vmovdqa	YMMWORD[(64-144)+rax],ymm6
+	vpermd	ymm13,ymm7,ymm13
+	vmovdqa	YMMWORD[(96-144)+rax],ymm11
+	vpermd	ymm14,ymm7,ymm14
+	vmovdqa	YMMWORD[(128-144)+rax],ymm12
+	vpermd	ymm15,ymm7,ymm15
+	vmovdqa	YMMWORD[(160-144)+rax],ymm13
+	vpermd	ymm5,ymm7,ymm5
+	vmovdqa	YMMWORD[(192-144)+rax],ymm14
+	vmovdqa	YMMWORD[(224-144)+rax],ymm15
+	vmovdqa	YMMWORD[(256-144)+rax],ymm5
+	vmovdqa	ymm5,YMMWORD[64+rcx]
+
+
+
+	vmovdqu	xmm7,XMMWORD[rsi]
+	vmovdqu	xmm8,XMMWORD[16+rsi]
+	vinserti128	ymm7,ymm7,XMMWORD[32+rsi],1
+	vinserti128	ymm8,ymm8,XMMWORD[48+rsi],1
+	lea	rsi,[64+rsi]
+
+	vpsrldq	ymm9,ymm7,6
+	vpsrldq	ymm10,ymm8,6
+	vpunpckhqdq	ymm6,ymm7,ymm8
+	vpunpcklqdq	ymm9,ymm9,ymm10
+	vpunpcklqdq	ymm7,ymm7,ymm8
+
+	vpsrlq	ymm10,ymm9,30
+	vpsrlq	ymm9,ymm9,4
+	vpsrlq	ymm8,ymm7,26
+	vpsrlq	ymm6,ymm6,40
+	vpand	ymm9,ymm9,ymm5
+	vpand	ymm7,ymm7,ymm5
+	vpand	ymm8,ymm8,ymm5
+	vpand	ymm10,ymm10,ymm5
+	vpor	ymm6,ymm6,YMMWORD[32+rcx]
+
+	vpaddq	ymm2,ymm9,ymm2
+	sub	rdx,64
+	jz	NEAR $L$tail_avx2
+	jmp	NEAR $L$oop_avx2
+
+ALIGN	32
+$L$oop_avx2:
+
+
+
+
+
+
+
+
+	vpaddq	ymm0,ymm7,ymm0
+	vmovdqa	ymm7,YMMWORD[rsp]
+	vpaddq	ymm1,ymm8,ymm1
+	vmovdqa	ymm8,YMMWORD[32+rsp]
+	vpaddq	ymm3,ymm10,ymm3
+	vmovdqa	ymm9,YMMWORD[96+rsp]
+	vpaddq	ymm4,ymm6,ymm4
+	vmovdqa	ymm10,YMMWORD[48+rax]
+	vmovdqa	ymm5,YMMWORD[112+rax]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	ymm13,ymm7,ymm2
+	vpmuludq	ymm14,ymm8,ymm2
+	vpmuludq	ymm15,ymm9,ymm2
+	vpmuludq	ymm11,ymm10,ymm2
+	vpmuludq	ymm12,ymm5,ymm2
+
+	vpmuludq	ymm6,ymm8,ymm0
+	vpmuludq	ymm2,ymm8,ymm1
+	vpaddq	ymm12,ymm12,ymm6
+	vpaddq	ymm13,ymm13,ymm2
+	vpmuludq	ymm6,ymm8,ymm3
+	vpmuludq	ymm2,ymm4,YMMWORD[64+rsp]
+	vpaddq	ymm15,ymm15,ymm6
+	vpaddq	ymm11,ymm11,ymm2
+	vmovdqa	ymm8,YMMWORD[((-16))+rax]
+
+	vpmuludq	ymm6,ymm7,ymm0
+	vpmuludq	ymm2,ymm7,ymm1
+	vpaddq	ymm11,ymm11,ymm6
+	vpaddq	ymm12,ymm12,ymm2
+	vpmuludq	ymm6,ymm7,ymm3
+	vpmuludq	ymm2,ymm7,ymm4
+	vmovdqu	xmm7,XMMWORD[rsi]
+	vpaddq	ymm14,ymm14,ymm6
+	vpaddq	ymm15,ymm15,ymm2
+	vinserti128	ymm7,ymm7,XMMWORD[32+rsi],1
+
+	vpmuludq	ymm6,ymm8,ymm3
+	vpmuludq	ymm2,ymm8,ymm4
+	vmovdqu	xmm8,XMMWORD[16+rsi]
+	vpaddq	ymm11,ymm11,ymm6
+	vpaddq	ymm12,ymm12,ymm2
+	vmovdqa	ymm2,YMMWORD[16+rax]
+	vpmuludq	ymm6,ymm9,ymm1
+	vpmuludq	ymm9,ymm9,ymm0
+	vpaddq	ymm14,ymm14,ymm6
+	vpaddq	ymm13,ymm13,ymm9
+	vinserti128	ymm8,ymm8,XMMWORD[48+rsi],1
+	lea	rsi,[64+rsi]
+
+	vpmuludq	ymm6,ymm2,ymm1
+	vpmuludq	ymm2,ymm2,ymm0
+	vpsrldq	ymm9,ymm7,6
+	vpaddq	ymm15,ymm15,ymm6
+	vpaddq	ymm14,ymm14,ymm2
+	vpmuludq	ymm6,ymm10,ymm3
+	vpmuludq	ymm2,ymm10,ymm4
+	vpsrldq	ymm10,ymm8,6
+	vpaddq	ymm12,ymm12,ymm6
+	vpaddq	ymm13,ymm13,ymm2
+	vpunpckhqdq	ymm6,ymm7,ymm8
+
+	vpmuludq	ymm3,ymm5,ymm3
+	vpmuludq	ymm4,ymm5,ymm4
+	vpunpcklqdq	ymm7,ymm7,ymm8
+	vpaddq	ymm2,ymm13,ymm3
+	vpaddq	ymm3,ymm14,ymm4
+	vpunpcklqdq	ymm10,ymm9,ymm10
+	vpmuludq	ymm4,ymm0,YMMWORD[80+rax]
+	vpmuludq	ymm0,ymm5,ymm1
+	vmovdqa	ymm5,YMMWORD[64+rcx]
+	vpaddq	ymm4,ymm15,ymm4
+	vpaddq	ymm0,ymm11,ymm0
+
+
+
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpaddq	ymm4,ymm4,ymm14
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm1,ymm12,ymm11
+
+	vpsrlq	ymm15,ymm4,26
+	vpand	ymm4,ymm4,ymm5
+
+	vpsrlq	ymm9,ymm10,4
+
+	vpsrlq	ymm12,ymm1,26
+	vpand	ymm1,ymm1,ymm5
+	vpaddq	ymm2,ymm2,ymm12
+
+	vpaddq	ymm0,ymm0,ymm15
+	vpsllq	ymm15,ymm15,2
+	vpaddq	ymm0,ymm0,ymm15
+
+	vpand	ymm9,ymm9,ymm5
+	vpsrlq	ymm8,ymm7,26
+
+	vpsrlq	ymm13,ymm2,26
+	vpand	ymm2,ymm2,ymm5
+	vpaddq	ymm3,ymm3,ymm13
+
+	vpaddq	ymm2,ymm2,ymm9
+	vpsrlq	ymm10,ymm10,30
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm1,ymm1,ymm11
+
+	vpsrlq	ymm6,ymm6,40
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpaddq	ymm4,ymm4,ymm14
+
+	vpand	ymm7,ymm7,ymm5
+	vpand	ymm8,ymm8,ymm5
+	vpand	ymm10,ymm10,ymm5
+	vpor	ymm6,ymm6,YMMWORD[32+rcx]
+
+	sub	rdx,64
+	jnz	NEAR $L$oop_avx2
+
+DB	0x66,0x90
+$L$tail_avx2:
+
+
+
+
+
+
+
+	vpaddq	ymm0,ymm7,ymm0
+	vmovdqu	ymm7,YMMWORD[4+rsp]
+	vpaddq	ymm1,ymm8,ymm1
+	vmovdqu	ymm8,YMMWORD[36+rsp]
+	vpaddq	ymm3,ymm10,ymm3
+	vmovdqu	ymm9,YMMWORD[100+rsp]
+	vpaddq	ymm4,ymm6,ymm4
+	vmovdqu	ymm10,YMMWORD[52+rax]
+	vmovdqu	ymm5,YMMWORD[116+rax]
+
+	vpmuludq	ymm13,ymm7,ymm2
+	vpmuludq	ymm14,ymm8,ymm2
+	vpmuludq	ymm15,ymm9,ymm2
+	vpmuludq	ymm11,ymm10,ymm2
+	vpmuludq	ymm12,ymm5,ymm2
+
+	vpmuludq	ymm6,ymm8,ymm0
+	vpmuludq	ymm2,ymm8,ymm1
+	vpaddq	ymm12,ymm12,ymm6
+	vpaddq	ymm13,ymm13,ymm2
+	vpmuludq	ymm6,ymm8,ymm3
+	vpmuludq	ymm2,ymm4,YMMWORD[68+rsp]
+	vpaddq	ymm15,ymm15,ymm6
+	vpaddq	ymm11,ymm11,ymm2
+
+	vpmuludq	ymm6,ymm7,ymm0
+	vpmuludq	ymm2,ymm7,ymm1
+	vpaddq	ymm11,ymm11,ymm6
+	vmovdqu	ymm8,YMMWORD[((-12))+rax]
+	vpaddq	ymm12,ymm12,ymm2
+	vpmuludq	ymm6,ymm7,ymm3
+	vpmuludq	ymm2,ymm7,ymm4
+	vpaddq	ymm14,ymm14,ymm6
+	vpaddq	ymm15,ymm15,ymm2
+
+	vpmuludq	ymm6,ymm8,ymm3
+	vpmuludq	ymm2,ymm8,ymm4
+	vpaddq	ymm11,ymm11,ymm6
+	vpaddq	ymm12,ymm12,ymm2
+	vmovdqu	ymm2,YMMWORD[20+rax]
+	vpmuludq	ymm6,ymm9,ymm1
+	vpmuludq	ymm9,ymm9,ymm0
+	vpaddq	ymm14,ymm14,ymm6
+	vpaddq	ymm13,ymm13,ymm9
+
+	vpmuludq	ymm6,ymm2,ymm1
+	vpmuludq	ymm2,ymm2,ymm0
+	vpaddq	ymm15,ymm15,ymm6
+	vpaddq	ymm14,ymm14,ymm2
+	vpmuludq	ymm6,ymm10,ymm3
+	vpmuludq	ymm2,ymm10,ymm4
+	vpaddq	ymm12,ymm12,ymm6
+	vpaddq	ymm13,ymm13,ymm2
+
+	vpmuludq	ymm3,ymm5,ymm3
+	vpmuludq	ymm4,ymm5,ymm4
+	vpaddq	ymm2,ymm13,ymm3
+	vpaddq	ymm3,ymm14,ymm4
+	vpmuludq	ymm4,ymm0,YMMWORD[84+rax]
+	vpmuludq	ymm0,ymm5,ymm1
+	vmovdqa	ymm5,YMMWORD[64+rcx]
+	vpaddq	ymm4,ymm15,ymm4
+	vpaddq	ymm0,ymm11,ymm0
+
+
+
+
+	vpsrldq	ymm8,ymm12,8
+	vpsrldq	ymm9,ymm2,8
+	vpsrldq	ymm10,ymm3,8
+	vpsrldq	ymm6,ymm4,8
+	vpsrldq	ymm7,ymm0,8
+	vpaddq	ymm12,ymm12,ymm8
+	vpaddq	ymm2,ymm2,ymm9
+	vpaddq	ymm3,ymm3,ymm10
+	vpaddq	ymm4,ymm4,ymm6
+	vpaddq	ymm0,ymm0,ymm7
+
+	vpermq	ymm10,ymm3,0x2
+	vpermq	ymm6,ymm4,0x2
+	vpermq	ymm7,ymm0,0x2
+	vpermq	ymm8,ymm12,0x2
+	vpermq	ymm9,ymm2,0x2
+	vpaddq	ymm3,ymm3,ymm10
+	vpaddq	ymm4,ymm4,ymm6
+	vpaddq	ymm0,ymm0,ymm7
+	vpaddq	ymm12,ymm12,ymm8
+	vpaddq	ymm2,ymm2,ymm9
+
+
+
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpaddq	ymm4,ymm4,ymm14
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm1,ymm12,ymm11
+
+	vpsrlq	ymm15,ymm4,26
+	vpand	ymm4,ymm4,ymm5
+
+	vpsrlq	ymm12,ymm1,26
+	vpand	ymm1,ymm1,ymm5
+	vpaddq	ymm2,ymm2,ymm12
+
+	vpaddq	ymm0,ymm0,ymm15
+	vpsllq	ymm15,ymm15,2
+	vpaddq	ymm0,ymm0,ymm15
+
+	vpsrlq	ymm13,ymm2,26
+	vpand	ymm2,ymm2,ymm5
+	vpaddq	ymm3,ymm3,ymm13
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm1,ymm1,ymm11
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpaddq	ymm4,ymm4,ymm14
+
+	vmovd	DWORD[(-112)+rdi],xmm0
+	vmovd	DWORD[(-108)+rdi],xmm1
+	vmovd	DWORD[(-104)+rdi],xmm2
+	vmovd	DWORD[(-100)+rdi],xmm3
+	vmovd	DWORD[(-96)+rdi],xmm4
+	vmovdqa	xmm6,XMMWORD[80+r11]
+	vmovdqa	xmm7,XMMWORD[96+r11]
+	vmovdqa	xmm8,XMMWORD[112+r11]
+	vmovdqa	xmm9,XMMWORD[128+r11]
+	vmovdqa	xmm10,XMMWORD[144+r11]
+	vmovdqa	xmm11,XMMWORD[160+r11]
+	vmovdqa	xmm12,XMMWORD[176+r11]
+	vmovdqa	xmm13,XMMWORD[192+r11]
+	vmovdqa	xmm14,XMMWORD[208+r11]
+	vmovdqa	xmm15,XMMWORD[224+r11]
+	lea	rsp,[248+r11]
+$L$do_avx2_epilogue:
+	vzeroupper
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_poly1305_blocks_avx2:
+
+ALIGN	32
+poly1305_blocks_avx512:
+	mov	QWORD[8+rsp],rdi	;WIN64 prologue
+	mov	QWORD[16+rsp],rsi
+	mov	rax,rsp
+$L$SEH_begin_poly1305_blocks_avx512:
+	mov	rdi,rcx
+	mov	rsi,rdx
+	mov	rdx,r8
+	mov	rcx,r9
+
+
+
+	mov	r8d,DWORD[20+rdi]
+	cmp	rdx,128
+	jae	NEAR $L$blocks_avx2_512
+	test	r8d,r8d
+	jz	NEAR $L$blocks
+
+$L$blocks_avx2_512:
+	and	rdx,-16
+	jz	NEAR $L$no_data_avx2_512
+
+	vzeroupper
+
+	test	r8d,r8d
+	jz	NEAR $L$base2_64_avx2_512
+
+	test	rdx,63
+	jz	NEAR $L$even_avx2_512
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+$L$blocks_avx2_body_512:
+
+	mov	r15,rdx
+
+	mov	r8,QWORD[rdi]
+	mov	r9,QWORD[8+rdi]
+	mov	ebp,DWORD[16+rdi]
+
+	mov	r11,QWORD[24+rdi]
+	mov	r13,QWORD[32+rdi]
+
+
+	mov	r14d,r8d
+	and	r8,-2147483648
+	mov	r12,r9
+	mov	ebx,r9d
+	and	r9,-2147483648
+
+	shr	r8,6
+	shl	r12,52
+	add	r14,r8
+	shr	rbx,12
+	shr	r9,18
+	add	r14,r12
+	adc	rbx,r9
+
+	mov	r8,rbp
+	shl	r8,40
+	shr	rbp,24
+	add	rbx,r8
+	adc	rbp,0
+
+	mov	r9,-4
+	mov	r8,rbp
+	and	r9,rbp
+	shr	r8,2
+	and	rbp,3
+	add	r8,r9
+	add	r14,r8
+	adc	rbx,0
+	adc	rbp,0
+
+	mov	r12,r13
+	mov	rax,r13
+	shr	r13,2
+	add	r13,r12
+
+$L$base2_26_pre_avx2_512:
+	add	r14,QWORD[rsi]
+	adc	rbx,QWORD[8+rsi]
+	lea	rsi,[16+rsi]
+	adc	rbp,rcx
+	sub	r15,16
+
+	call	__poly1305_block
+	mov	rax,r12
+
+	test	r15,63
+	jnz	NEAR $L$base2_26_pre_avx2_512
+
+	test	rcx,rcx
+	jz	NEAR $L$store_base2_64_avx2_512
+
+
+	mov	rax,r14
+	mov	rdx,r14
+	shr	r14,52
+	mov	r11,rbx
+	mov	r12,rbx
+	shr	rdx,26
+	and	rax,0x3ffffff
+	shl	r11,12
+	and	rdx,0x3ffffff
+	shr	rbx,14
+	or	r14,r11
+	shl	rbp,24
+	and	r14,0x3ffffff
+	shr	r12,40
+	and	rbx,0x3ffffff
+	or	rbp,r12
+
+	test	r15,r15
+	jz	NEAR $L$store_base2_26_avx2_512
+
+	vmovd	xmm0,eax
+	vmovd	xmm1,edx
+	vmovd	xmm2,r14d
+	vmovd	xmm3,ebx
+	vmovd	xmm4,ebp
+	jmp	NEAR $L$proceed_avx2_512
+
+ALIGN	32
+$L$store_base2_64_avx2_512:
+	mov	QWORD[rdi],r14
+	mov	QWORD[8+rdi],rbx
+	mov	QWORD[16+rdi],rbp
+	jmp	NEAR $L$done_avx2_512
+
+ALIGN	16
+$L$store_base2_26_avx2_512:
+	mov	DWORD[rdi],eax
+	mov	DWORD[4+rdi],edx
+	mov	DWORD[8+rdi],r14d
+	mov	DWORD[12+rdi],ebx
+	mov	DWORD[16+rdi],ebp
+ALIGN	16
+$L$done_avx2_512:
+	mov	r15,QWORD[rsp]
+
+	mov	r14,QWORD[8+rsp]
+
+	mov	r13,QWORD[16+rsp]
+
+	mov	r12,QWORD[24+rsp]
+
+	mov	rbp,QWORD[32+rsp]
+
+	mov	rbx,QWORD[40+rsp]
+
+	lea	rsp,[48+rsp]
+
+$L$no_data_avx2_512:
+$L$blocks_avx2_epilogue_512:
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+
+ALIGN	32
+$L$base2_64_avx2_512:
+
+	push	rbx
+
+	push	rbp
+
+	push	r12
+
+	push	r13
+
+	push	r14
+
+	push	r15
+
+$L$base2_64_avx2_body_512:
+
+	mov	r15,rdx
+
+	mov	r11,QWORD[24+rdi]
+	mov	r13,QWORD[32+rdi]
+
+	mov	r14,QWORD[rdi]
+	mov	rbx,QWORD[8+rdi]
+	mov	ebp,DWORD[16+rdi]
+
+	mov	r12,r13
+	mov	rax,r13
+	shr	r13,2
+	add	r13,r12
+
+	test	rdx,63
+	jz	NEAR $L$init_avx2_512
+
+$L$base2_64_pre_avx2_512:
+	add	r14,QWORD[rsi]
+	adc	rbx,QWORD[8+rsi]
+	lea	rsi,[16+rsi]
+	adc	rbp,rcx
+	sub	r15,16
+
+	call	__poly1305_block
+	mov	rax,r12
+
+	test	r15,63
+	jnz	NEAR $L$base2_64_pre_avx2_512
+
+$L$init_avx2_512:
+
+	mov	rax,r14
+	mov	rdx,r14
+	shr	r14,52
+	mov	r8,rbx
+	mov	r9,rbx
+	shr	rdx,26
+	and	rax,0x3ffffff
+	shl	r8,12
+	and	rdx,0x3ffffff
+	shr	rbx,14
+	or	r14,r8
+	shl	rbp,24
+	and	r14,0x3ffffff
+	shr	r9,40
+	and	rbx,0x3ffffff
+	or	rbp,r9
+
+	vmovd	xmm0,eax
+	vmovd	xmm1,edx
+	vmovd	xmm2,r14d
+	vmovd	xmm3,ebx
+	vmovd	xmm4,ebp
+	mov	DWORD[20+rdi],1
+
+	call	__poly1305_init_avx
+
+$L$proceed_avx2_512:
+	mov	rdx,r15
+
+
+
+	mov	r15,QWORD[rsp]
+
+	mov	r14,QWORD[8+rsp]
+
+	mov	r13,QWORD[16+rsp]
+
+	mov	r12,QWORD[24+rsp]
+
+	mov	rbp,QWORD[32+rsp]
+
+	mov	rbx,QWORD[40+rsp]
+
+	lea	rax,[48+rsp]
+	lea	rsp,[48+rsp]
+
+$L$base2_64_avx2_epilogue_512:
+	jmp	NEAR $L$do_avx2_512
+
+
+ALIGN	32
+$L$even_avx2_512:
+
+
+	vmovd	xmm0,DWORD[rdi]
+	vmovd	xmm1,DWORD[4+rdi]
+	vmovd	xmm2,DWORD[8+rdi]
+	vmovd	xmm3,DWORD[12+rdi]
+	vmovd	xmm4,DWORD[16+rdi]
+
+$L$do_avx2_512:
+	cmp	rdx,512
+	jae	NEAR $L$blocks_avx512
+$L$skip_avx512:
+	lea	r11,[((-248))+rsp]
+	sub	rsp,0x1c8
+	vmovdqa	XMMWORD[80+r11],xmm6
+	vmovdqa	XMMWORD[96+r11],xmm7
+	vmovdqa	XMMWORD[112+r11],xmm8
+	vmovdqa	XMMWORD[128+r11],xmm9
+	vmovdqa	XMMWORD[144+r11],xmm10
+	vmovdqa	XMMWORD[160+r11],xmm11
+	vmovdqa	XMMWORD[176+r11],xmm12
+	vmovdqa	XMMWORD[192+r11],xmm13
+	vmovdqa	XMMWORD[208+r11],xmm14
+	vmovdqa	XMMWORD[224+r11],xmm15
+$L$do_avx2_body_512:
+	lea	rcx,[$L$const]
+	lea	rdi,[((48+64))+rdi]
+	vmovdqa	ymm7,YMMWORD[96+rcx]
+
+
+	vmovdqu	xmm9,XMMWORD[((-64))+rdi]
+	and	rsp,-512
+	vmovdqu	xmm10,XMMWORD[((-48))+rdi]
+	vmovdqu	xmm6,XMMWORD[((-32))+rdi]
+	vmovdqu	xmm11,XMMWORD[((-16))+rdi]
+	vmovdqu	xmm12,XMMWORD[rdi]
+	vmovdqu	xmm13,XMMWORD[16+rdi]
+	lea	rax,[144+rsp]
+	vmovdqu	xmm14,XMMWORD[32+rdi]
+	vpermd	ymm9,ymm7,ymm9
+	vmovdqu	xmm15,XMMWORD[48+rdi]
+	vpermd	ymm10,ymm7,ymm10
+	vmovdqu	xmm5,XMMWORD[64+rdi]
+	vpermd	ymm6,ymm7,ymm6
+	vmovdqa	YMMWORD[rsp],ymm9
+	vpermd	ymm11,ymm7,ymm11
+	vmovdqa	YMMWORD[(32-144)+rax],ymm10
+	vpermd	ymm12,ymm7,ymm12
+	vmovdqa	YMMWORD[(64-144)+rax],ymm6
+	vpermd	ymm13,ymm7,ymm13
+	vmovdqa	YMMWORD[(96-144)+rax],ymm11
+	vpermd	ymm14,ymm7,ymm14
+	vmovdqa	YMMWORD[(128-144)+rax],ymm12
+	vpermd	ymm15,ymm7,ymm15
+	vmovdqa	YMMWORD[(160-144)+rax],ymm13
+	vpermd	ymm5,ymm7,ymm5
+	vmovdqa	YMMWORD[(192-144)+rax],ymm14
+	vmovdqa	YMMWORD[(224-144)+rax],ymm15
+	vmovdqa	YMMWORD[(256-144)+rax],ymm5
+	vmovdqa	ymm5,YMMWORD[64+rcx]
+
+
+
+	vmovdqu	xmm7,XMMWORD[rsi]
+	vmovdqu	xmm8,XMMWORD[16+rsi]
+	vinserti128	ymm7,ymm7,XMMWORD[32+rsi],1
+	vinserti128	ymm8,ymm8,XMMWORD[48+rsi],1
+	lea	rsi,[64+rsi]
+
+	vpsrldq	ymm9,ymm7,6
+	vpsrldq	ymm10,ymm8,6
+	vpunpckhqdq	ymm6,ymm7,ymm8
+	vpunpcklqdq	ymm9,ymm9,ymm10
+	vpunpcklqdq	ymm7,ymm7,ymm8
+
+	vpsrlq	ymm10,ymm9,30
+	vpsrlq	ymm9,ymm9,4
+	vpsrlq	ymm8,ymm7,26
+	vpsrlq	ymm6,ymm6,40
+	vpand	ymm9,ymm9,ymm5
+	vpand	ymm7,ymm7,ymm5
+	vpand	ymm8,ymm8,ymm5
+	vpand	ymm10,ymm10,ymm5
+	vpor	ymm6,ymm6,YMMWORD[32+rcx]
+
+	vpaddq	ymm2,ymm9,ymm2
+	sub	rdx,64
+	jz	NEAR $L$tail_avx2_512
+	jmp	NEAR $L$oop_avx2_512
+
+ALIGN	32
+$L$oop_avx2_512:
+
+
+
+
+
+
+
+
+	vpaddq	ymm0,ymm7,ymm0
+	vmovdqa	ymm7,YMMWORD[rsp]
+	vpaddq	ymm1,ymm8,ymm1
+	vmovdqa	ymm8,YMMWORD[32+rsp]
+	vpaddq	ymm3,ymm10,ymm3
+	vmovdqa	ymm9,YMMWORD[96+rsp]
+	vpaddq	ymm4,ymm6,ymm4
+	vmovdqa	ymm10,YMMWORD[48+rax]
+	vmovdqa	ymm5,YMMWORD[112+rax]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	ymm13,ymm7,ymm2
+	vpmuludq	ymm14,ymm8,ymm2
+	vpmuludq	ymm15,ymm9,ymm2
+	vpmuludq	ymm11,ymm10,ymm2
+	vpmuludq	ymm12,ymm5,ymm2
+
+	vpmuludq	ymm6,ymm8,ymm0
+	vpmuludq	ymm2,ymm8,ymm1
+	vpaddq	ymm12,ymm12,ymm6
+	vpaddq	ymm13,ymm13,ymm2
+	vpmuludq	ymm6,ymm8,ymm3
+	vpmuludq	ymm2,ymm4,YMMWORD[64+rsp]
+	vpaddq	ymm15,ymm15,ymm6
+	vpaddq	ymm11,ymm11,ymm2
+	vmovdqa	ymm8,YMMWORD[((-16))+rax]
+
+	vpmuludq	ymm6,ymm7,ymm0
+	vpmuludq	ymm2,ymm7,ymm1
+	vpaddq	ymm11,ymm11,ymm6
+	vpaddq	ymm12,ymm12,ymm2
+	vpmuludq	ymm6,ymm7,ymm3
+	vpmuludq	ymm2,ymm7,ymm4
+	vmovdqu	xmm7,XMMWORD[rsi]
+	vpaddq	ymm14,ymm14,ymm6
+	vpaddq	ymm15,ymm15,ymm2
+	vinserti128	ymm7,ymm7,XMMWORD[32+rsi],1
+
+	vpmuludq	ymm6,ymm8,ymm3
+	vpmuludq	ymm2,ymm8,ymm4
+	vmovdqu	xmm8,XMMWORD[16+rsi]
+	vpaddq	ymm11,ymm11,ymm6
+	vpaddq	ymm12,ymm12,ymm2
+	vmovdqa	ymm2,YMMWORD[16+rax]
+	vpmuludq	ymm6,ymm9,ymm1
+	vpmuludq	ymm9,ymm9,ymm0
+	vpaddq	ymm14,ymm14,ymm6
+	vpaddq	ymm13,ymm13,ymm9
+	vinserti128	ymm8,ymm8,XMMWORD[48+rsi],1
+	lea	rsi,[64+rsi]
+
+	vpmuludq	ymm6,ymm2,ymm1
+	vpmuludq	ymm2,ymm2,ymm0
+	vpsrldq	ymm9,ymm7,6
+	vpaddq	ymm15,ymm15,ymm6
+	vpaddq	ymm14,ymm14,ymm2
+	vpmuludq	ymm6,ymm10,ymm3
+	vpmuludq	ymm2,ymm10,ymm4
+	vpsrldq	ymm10,ymm8,6
+	vpaddq	ymm12,ymm12,ymm6
+	vpaddq	ymm13,ymm13,ymm2
+	vpunpckhqdq	ymm6,ymm7,ymm8
+
+	vpmuludq	ymm3,ymm5,ymm3
+	vpmuludq	ymm4,ymm5,ymm4
+	vpunpcklqdq	ymm7,ymm7,ymm8
+	vpaddq	ymm2,ymm13,ymm3
+	vpaddq	ymm3,ymm14,ymm4
+	vpunpcklqdq	ymm10,ymm9,ymm10
+	vpmuludq	ymm4,ymm0,YMMWORD[80+rax]
+	vpmuludq	ymm0,ymm5,ymm1
+	vmovdqa	ymm5,YMMWORD[64+rcx]
+	vpaddq	ymm4,ymm15,ymm4
+	vpaddq	ymm0,ymm11,ymm0
+
+
+
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpaddq	ymm4,ymm4,ymm14
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm1,ymm12,ymm11
+
+	vpsrlq	ymm15,ymm4,26
+	vpand	ymm4,ymm4,ymm5
+
+	vpsrlq	ymm9,ymm10,4
+
+	vpsrlq	ymm12,ymm1,26
+	vpand	ymm1,ymm1,ymm5
+	vpaddq	ymm2,ymm2,ymm12
+
+	vpaddq	ymm0,ymm0,ymm15
+	vpsllq	ymm15,ymm15,2
+	vpaddq	ymm0,ymm0,ymm15
+
+	vpand	ymm9,ymm9,ymm5
+	vpsrlq	ymm8,ymm7,26
+
+	vpsrlq	ymm13,ymm2,26
+	vpand	ymm2,ymm2,ymm5
+	vpaddq	ymm3,ymm3,ymm13
+
+	vpaddq	ymm2,ymm2,ymm9
+	vpsrlq	ymm10,ymm10,30
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm1,ymm1,ymm11
+
+	vpsrlq	ymm6,ymm6,40
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpaddq	ymm4,ymm4,ymm14
+
+	vpand	ymm7,ymm7,ymm5
+	vpand	ymm8,ymm8,ymm5
+	vpand	ymm10,ymm10,ymm5
+	vpor	ymm6,ymm6,YMMWORD[32+rcx]
+
+	sub	rdx,64
+	jnz	NEAR $L$oop_avx2_512
+
+DB	0x66,0x90
+$L$tail_avx2_512:
+
+
+
+
+
+
+
+	vpaddq	ymm0,ymm7,ymm0
+	vmovdqu	ymm7,YMMWORD[4+rsp]
+	vpaddq	ymm1,ymm8,ymm1
+	vmovdqu	ymm8,YMMWORD[36+rsp]
+	vpaddq	ymm3,ymm10,ymm3
+	vmovdqu	ymm9,YMMWORD[100+rsp]
+	vpaddq	ymm4,ymm6,ymm4
+	vmovdqu	ymm10,YMMWORD[52+rax]
+	vmovdqu	ymm5,YMMWORD[116+rax]
+
+	vpmuludq	ymm13,ymm7,ymm2
+	vpmuludq	ymm14,ymm8,ymm2
+	vpmuludq	ymm15,ymm9,ymm2
+	vpmuludq	ymm11,ymm10,ymm2
+	vpmuludq	ymm12,ymm5,ymm2
+
+	vpmuludq	ymm6,ymm8,ymm0
+	vpmuludq	ymm2,ymm8,ymm1
+	vpaddq	ymm12,ymm12,ymm6
+	vpaddq	ymm13,ymm13,ymm2
+	vpmuludq	ymm6,ymm8,ymm3
+	vpmuludq	ymm2,ymm4,YMMWORD[68+rsp]
+	vpaddq	ymm15,ymm15,ymm6
+	vpaddq	ymm11,ymm11,ymm2
+
+	vpmuludq	ymm6,ymm7,ymm0
+	vpmuludq	ymm2,ymm7,ymm1
+	vpaddq	ymm11,ymm11,ymm6
+	vmovdqu	ymm8,YMMWORD[((-12))+rax]
+	vpaddq	ymm12,ymm12,ymm2
+	vpmuludq	ymm6,ymm7,ymm3
+	vpmuludq	ymm2,ymm7,ymm4
+	vpaddq	ymm14,ymm14,ymm6
+	vpaddq	ymm15,ymm15,ymm2
+
+	vpmuludq	ymm6,ymm8,ymm3
+	vpmuludq	ymm2,ymm8,ymm4
+	vpaddq	ymm11,ymm11,ymm6
+	vpaddq	ymm12,ymm12,ymm2
+	vmovdqu	ymm2,YMMWORD[20+rax]
+	vpmuludq	ymm6,ymm9,ymm1
+	vpmuludq	ymm9,ymm9,ymm0
+	vpaddq	ymm14,ymm14,ymm6
+	vpaddq	ymm13,ymm13,ymm9
+
+	vpmuludq	ymm6,ymm2,ymm1
+	vpmuludq	ymm2,ymm2,ymm0
+	vpaddq	ymm15,ymm15,ymm6
+	vpaddq	ymm14,ymm14,ymm2
+	vpmuludq	ymm6,ymm10,ymm3
+	vpmuludq	ymm2,ymm10,ymm4
+	vpaddq	ymm12,ymm12,ymm6
+	vpaddq	ymm13,ymm13,ymm2
+
+	vpmuludq	ymm3,ymm5,ymm3
+	vpmuludq	ymm4,ymm5,ymm4
+	vpaddq	ymm2,ymm13,ymm3
+	vpaddq	ymm3,ymm14,ymm4
+	vpmuludq	ymm4,ymm0,YMMWORD[84+rax]
+	vpmuludq	ymm0,ymm5,ymm1
+	vmovdqa	ymm5,YMMWORD[64+rcx]
+	vpaddq	ymm4,ymm15,ymm4
+	vpaddq	ymm0,ymm11,ymm0
+
+
+
+
+	vpsrldq	ymm8,ymm12,8
+	vpsrldq	ymm9,ymm2,8
+	vpsrldq	ymm10,ymm3,8
+	vpsrldq	ymm6,ymm4,8
+	vpsrldq	ymm7,ymm0,8
+	vpaddq	ymm12,ymm12,ymm8
+	vpaddq	ymm2,ymm2,ymm9
+	vpaddq	ymm3,ymm3,ymm10
+	vpaddq	ymm4,ymm4,ymm6
+	vpaddq	ymm0,ymm0,ymm7
+
+	vpermq	ymm10,ymm3,0x2
+	vpermq	ymm6,ymm4,0x2
+	vpermq	ymm7,ymm0,0x2
+	vpermq	ymm8,ymm12,0x2
+	vpermq	ymm9,ymm2,0x2
+	vpaddq	ymm3,ymm3,ymm10
+	vpaddq	ymm4,ymm4,ymm6
+	vpaddq	ymm0,ymm0,ymm7
+	vpaddq	ymm12,ymm12,ymm8
+	vpaddq	ymm2,ymm2,ymm9
+
+
+
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpaddq	ymm4,ymm4,ymm14
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm1,ymm12,ymm11
+
+	vpsrlq	ymm15,ymm4,26
+	vpand	ymm4,ymm4,ymm5
+
+	vpsrlq	ymm12,ymm1,26
+	vpand	ymm1,ymm1,ymm5
+	vpaddq	ymm2,ymm2,ymm12
+
+	vpaddq	ymm0,ymm0,ymm15
+	vpsllq	ymm15,ymm15,2
+	vpaddq	ymm0,ymm0,ymm15
+
+	vpsrlq	ymm13,ymm2,26
+	vpand	ymm2,ymm2,ymm5
+	vpaddq	ymm3,ymm3,ymm13
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm1,ymm1,ymm11
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpaddq	ymm4,ymm4,ymm14
+
+	vmovd	DWORD[(-112)+rdi],xmm0
+	vmovd	DWORD[(-108)+rdi],xmm1
+	vmovd	DWORD[(-104)+rdi],xmm2
+	vmovd	DWORD[(-100)+rdi],xmm3
+	vmovd	DWORD[(-96)+rdi],xmm4
+	vmovdqa	xmm6,XMMWORD[80+r11]
+	vmovdqa	xmm7,XMMWORD[96+r11]
+	vmovdqa	xmm8,XMMWORD[112+r11]
+	vmovdqa	xmm9,XMMWORD[128+r11]
+	vmovdqa	xmm10,XMMWORD[144+r11]
+	vmovdqa	xmm11,XMMWORD[160+r11]
+	vmovdqa	xmm12,XMMWORD[176+r11]
+	vmovdqa	xmm13,XMMWORD[192+r11]
+	vmovdqa	xmm14,XMMWORD[208+r11]
+	vmovdqa	xmm15,XMMWORD[224+r11]
+	lea	rsp,[248+r11]
+$L$do_avx2_epilogue_512:
+	vzeroupper
+	mov	rdi,QWORD[8+rsp]	;WIN64 epilogue
+	mov	rsi,QWORD[16+rsp]
+	DB	0F3h,0C3h		;repret
+
+$L$SEH_end_poly1305_blocks_avx512:
+$L$blocks_avx512:
+	mov	eax,15
+	kmovw	k2,eax
+	lea	r11,[((-248))+rsp]
+	sub	rsp,0x1c8
+	vmovdqa	XMMWORD[80+r11],xmm6
+	vmovdqa	XMMWORD[96+r11],xmm7
+	vmovdqa	XMMWORD[112+r11],xmm8
+	vmovdqa	XMMWORD[128+r11],xmm9
+	vmovdqa	XMMWORD[144+r11],xmm10
+	vmovdqa	XMMWORD[160+r11],xmm11
+	vmovdqa	XMMWORD[176+r11],xmm12
+	vmovdqa	XMMWORD[192+r11],xmm13
+	vmovdqa	XMMWORD[208+r11],xmm14
+	vmovdqa	XMMWORD[224+r11],xmm15
+$L$do_avx512_body:
+	lea	rcx,[$L$const]
+	lea	rdi,[((48+64))+rdi]
+	vmovdqa	ymm9,YMMWORD[96+rcx]
+
+
+	vmovdqu	xmm11,XMMWORD[((-64))+rdi]
+	and	rsp,-512
+	vmovdqu	xmm12,XMMWORD[((-48))+rdi]
+	mov	rax,0x20
+	vmovdqu	xmm7,XMMWORD[((-32))+rdi]
+	vmovdqu	xmm13,XMMWORD[((-16))+rdi]
+	vmovdqu	xmm8,XMMWORD[rdi]
+	vmovdqu	xmm14,XMMWORD[16+rdi]
+	vmovdqu	xmm10,XMMWORD[32+rdi]
+	vmovdqu	xmm15,XMMWORD[48+rdi]
+	vmovdqu	xmm6,XMMWORD[64+rdi]
+	vpermd	zmm16,zmm9,zmm11
+	vpbroadcastq	zmm5,QWORD[64+rcx]
+	vpermd	zmm17,zmm9,zmm12
+	vpermd	zmm21,zmm9,zmm7
+	vpermd	zmm18,zmm9,zmm13
+	vmovdqa64	ZMMWORD[rsp]{k2},zmm16
+	vpsrlq	zmm7,zmm16,32
+	vpermd	zmm22,zmm9,zmm8
+	vmovdqu64	ZMMWORD[rax*1+rsp]{k2},zmm17
+	vpsrlq	zmm8,zmm17,32
+	vpermd	zmm19,zmm9,zmm14
+	vmovdqa64	ZMMWORD[64+rsp]{k2},zmm21
+	vpermd	zmm23,zmm9,zmm10
+	vpermd	zmm20,zmm9,zmm15
+	vmovdqu64	ZMMWORD[64+rax*1+rsp]{k2},zmm18
+	vpermd	zmm24,zmm9,zmm6
+	vmovdqa64	ZMMWORD[128+rsp]{k2},zmm22
+	vmovdqu64	ZMMWORD[128+rax*1+rsp]{k2},zmm19
+	vmovdqa64	ZMMWORD[192+rsp]{k2},zmm23
+	vmovdqu64	ZMMWORD[192+rax*1+rsp]{k2},zmm20
+	vmovdqa64	ZMMWORD[256+rsp]{k2},zmm24
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	zmm11,zmm16,zmm7
+	vpmuludq	zmm12,zmm17,zmm7
+	vpmuludq	zmm13,zmm18,zmm7
+	vpmuludq	zmm14,zmm19,zmm7
+	vpmuludq	zmm15,zmm20,zmm7
+	vpsrlq	zmm9,zmm18,32
+
+	vpmuludq	zmm25,zmm24,zmm8
+	vpmuludq	zmm26,zmm16,zmm8
+	vpmuludq	zmm27,zmm17,zmm8
+	vpmuludq	zmm28,zmm18,zmm8
+	vpmuludq	zmm29,zmm19,zmm8
+	vpsrlq	zmm10,zmm19,32
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm12,zmm12,zmm26
+	vpaddq	zmm13,zmm13,zmm27
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+
+	vpmuludq	zmm25,zmm23,zmm9
+	vpmuludq	zmm26,zmm24,zmm9
+	vpmuludq	zmm28,zmm17,zmm9
+	vpmuludq	zmm29,zmm18,zmm9
+	vpmuludq	zmm27,zmm16,zmm9
+	vpsrlq	zmm6,zmm20,32
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm12,zmm12,zmm26
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm13,zmm13,zmm27
+
+	vpmuludq	zmm25,zmm22,zmm10
+	vpmuludq	zmm28,zmm16,zmm10
+	vpmuludq	zmm29,zmm17,zmm10
+	vpmuludq	zmm26,zmm23,zmm10
+	vpmuludq	zmm27,zmm24,zmm10
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm12,zmm12,zmm26
+	vpaddq	zmm13,zmm13,zmm27
+
+	vpmuludq	zmm28,zmm24,zmm6
+	vpmuludq	zmm29,zmm16,zmm6
+	vpmuludq	zmm25,zmm21,zmm6
+	vpmuludq	zmm26,zmm22,zmm6
+	vpmuludq	zmm27,zmm23,zmm6
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm12,zmm12,zmm26
+	vpaddq	zmm13,zmm13,zmm27
+
+
+
+	vmovdqu64	zmm10,ZMMWORD[rsi]
+	vmovdqu64	zmm6,ZMMWORD[64+rsi]
+	lea	rsi,[128+rsi]
+
+
+
+
+	vpsrlq	zmm28,zmm14,26
+	vpandq	zmm14,zmm14,zmm5
+	vpaddq	zmm15,zmm15,zmm28
+
+	vpsrlq	zmm25,zmm11,26
+	vpandq	zmm11,zmm11,zmm5
+	vpaddq	zmm12,zmm12,zmm25
+
+	vpsrlq	zmm29,zmm15,26
+	vpandq	zmm15,zmm15,zmm5
+
+	vpsrlq	zmm26,zmm12,26
+	vpandq	zmm12,zmm12,zmm5
+	vpaddq	zmm13,zmm13,zmm26
+
+	vpaddq	zmm11,zmm11,zmm29
+	vpsllq	zmm29,zmm29,2
+	vpaddq	zmm11,zmm11,zmm29
+
+	vpsrlq	zmm27,zmm13,26
+	vpandq	zmm13,zmm13,zmm5
+	vpaddq	zmm14,zmm14,zmm27
+
+	vpsrlq	zmm25,zmm11,26
+	vpandq	zmm11,zmm11,zmm5
+	vpaddq	zmm12,zmm12,zmm25
+
+	vpsrlq	zmm28,zmm14,26
+	vpandq	zmm14,zmm14,zmm5
+	vpaddq	zmm15,zmm15,zmm28
+
+
+
+
+
+	vpunpcklqdq	zmm7,zmm10,zmm6
+	vpunpckhqdq	zmm6,zmm10,zmm6
+
+
+
+
+
+
+	vmovdqa32	zmm25,ZMMWORD[128+rcx]
+	mov	eax,0x7777
+	kmovw	k1,eax
+
+	vpermd	zmm16,zmm25,zmm16
+	vpermd	zmm17,zmm25,zmm17
+	vpermd	zmm18,zmm25,zmm18
+	vpermd	zmm19,zmm25,zmm19
+	vpermd	zmm20,zmm25,zmm20
+
+	vpermd	zmm16{k1},zmm25,zmm11
+	vpermd	zmm17{k1},zmm25,zmm12
+	vpermd	zmm18{k1},zmm25,zmm13
+	vpermd	zmm19{k1},zmm25,zmm14
+	vpermd	zmm20{k1},zmm25,zmm15
+
+	vpslld	zmm21,zmm17,2
+	vpslld	zmm22,zmm18,2
+	vpslld	zmm23,zmm19,2
+	vpslld	zmm24,zmm20,2
+	vpaddd	zmm21,zmm21,zmm17
+	vpaddd	zmm22,zmm22,zmm18
+	vpaddd	zmm23,zmm23,zmm19
+	vpaddd	zmm24,zmm24,zmm20
+
+	vpbroadcastq	zmm30,QWORD[32+rcx]
+
+	vpsrlq	zmm9,zmm7,52
+	vpsllq	zmm10,zmm6,12
+	vporq	zmm9,zmm9,zmm10
+	vpsrlq	zmm8,zmm7,26
+	vpsrlq	zmm10,zmm6,14
+	vpsrlq	zmm6,zmm6,40
+	vpandq	zmm9,zmm9,zmm5
+	vpandq	zmm7,zmm7,zmm5
+
+
+
+
+	vpaddq	zmm2,zmm9,zmm2
+	sub	rdx,192
+	jbe	NEAR $L$tail_avx512
+	jmp	NEAR $L$oop_avx512
+
+ALIGN	32
+$L$oop_avx512:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	vpmuludq	zmm14,zmm17,zmm2
+	vpaddq	zmm0,zmm7,zmm0
+	vpmuludq	zmm15,zmm18,zmm2
+	vpandq	zmm8,zmm8,zmm5
+	vpmuludq	zmm11,zmm23,zmm2
+	vpandq	zmm10,zmm10,zmm5
+	vpmuludq	zmm12,zmm24,zmm2
+	vporq	zmm6,zmm6,zmm30
+	vpmuludq	zmm13,zmm16,zmm2
+	vpaddq	zmm1,zmm8,zmm1
+	vpaddq	zmm3,zmm10,zmm3
+	vpaddq	zmm4,zmm6,zmm4
+
+	vmovdqu64	zmm10,ZMMWORD[rsi]
+	vmovdqu64	zmm6,ZMMWORD[64+rsi]
+	lea	rsi,[128+rsi]
+	vpmuludq	zmm28,zmm19,zmm0
+	vpmuludq	zmm29,zmm20,zmm0
+	vpmuludq	zmm25,zmm16,zmm0
+	vpmuludq	zmm26,zmm17,zmm0
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm12,zmm12,zmm26
+
+	vpmuludq	zmm28,zmm18,zmm1
+	vpmuludq	zmm29,zmm19,zmm1
+	vpmuludq	zmm25,zmm24,zmm1
+	vpmuludq	zmm27,zmm18,zmm0
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm13,zmm13,zmm27
+
+	vpunpcklqdq	zmm7,zmm10,zmm6
+	vpunpckhqdq	zmm6,zmm10,zmm6
+
+	vpmuludq	zmm28,zmm16,zmm3
+	vpmuludq	zmm29,zmm17,zmm3
+	vpmuludq	zmm26,zmm16,zmm1
+	vpmuludq	zmm27,zmm17,zmm1
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm12,zmm12,zmm26
+	vpaddq	zmm13,zmm13,zmm27
+
+	vpmuludq	zmm28,zmm24,zmm4
+	vpmuludq	zmm29,zmm16,zmm4
+	vpmuludq	zmm25,zmm22,zmm3
+	vpmuludq	zmm26,zmm23,zmm3
+	vpaddq	zmm14,zmm14,zmm28
+	vpmuludq	zmm27,zmm24,zmm3
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm12,zmm12,zmm26
+	vpaddq	zmm13,zmm13,zmm27
+
+	vpmuludq	zmm25,zmm21,zmm4
+	vpmuludq	zmm26,zmm22,zmm4
+	vpmuludq	zmm27,zmm23,zmm4
+	vpaddq	zmm0,zmm11,zmm25
+	vpaddq	zmm1,zmm12,zmm26
+	vpaddq	zmm2,zmm13,zmm27
+
+
+
+
+	vpsrlq	zmm9,zmm7,52
+	vpsllq	zmm10,zmm6,12
+
+	vpsrlq	zmm3,zmm14,26
+	vpandq	zmm14,zmm14,zmm5
+	vpaddq	zmm4,zmm15,zmm3
+
+	vporq	zmm9,zmm9,zmm10
+
+	vpsrlq	zmm11,zmm0,26
+	vpandq	zmm0,zmm0,zmm5
+	vpaddq	zmm1,zmm1,zmm11
+
+	vpandq	zmm9,zmm9,zmm5
+
+	vpsrlq	zmm15,zmm4,26
+	vpandq	zmm4,zmm4,zmm5
+
+	vpsrlq	zmm12,zmm1,26
+	vpandq	zmm1,zmm1,zmm5
+	vpaddq	zmm2,zmm2,zmm12
+
+	vpaddq	zmm0,zmm0,zmm15
+	vpsllq	zmm15,zmm15,2
+	vpaddq	zmm0,zmm0,zmm15
+
+	vpaddq	zmm2,zmm2,zmm9
+	vpsrlq	zmm8,zmm7,26
+
+	vpsrlq	zmm13,zmm2,26
+	vpandq	zmm2,zmm2,zmm5
+	vpaddq	zmm3,zmm14,zmm13
+
+	vpsrlq	zmm10,zmm6,14
+
+	vpsrlq	zmm11,zmm0,26
+	vpandq	zmm0,zmm0,zmm5
+	vpaddq	zmm1,zmm1,zmm11
+
+	vpsrlq	zmm6,zmm6,40
+
+	vpsrlq	zmm14,zmm3,26
+	vpandq	zmm3,zmm3,zmm5
+	vpaddq	zmm4,zmm4,zmm14
+
+	vpandq	zmm7,zmm7,zmm5
+
+
+
+
+	sub	rdx,128
+	ja	NEAR $L$oop_avx512
+
+$L$tail_avx512:
+
+
+
+
+
+	vpsrlq	zmm16,zmm16,32
+	vpsrlq	zmm17,zmm17,32
+	vpsrlq	zmm18,zmm18,32
+	vpsrlq	zmm23,zmm23,32
+	vpsrlq	zmm24,zmm24,32
+	vpsrlq	zmm19,zmm19,32
+	vpsrlq	zmm20,zmm20,32
+	vpsrlq	zmm21,zmm21,32
+	vpsrlq	zmm22,zmm22,32
+
+
+
+	lea	rsi,[rdx*1+rsi]
+
+
+	vpaddq	zmm0,zmm7,zmm0
+
+	vpmuludq	zmm14,zmm17,zmm2
+	vpmuludq	zmm15,zmm18,zmm2
+	vpmuludq	zmm11,zmm23,zmm2
+	vpandq	zmm8,zmm8,zmm5
+	vpmuludq	zmm12,zmm24,zmm2
+	vpandq	zmm10,zmm10,zmm5
+	vpmuludq	zmm13,zmm16,zmm2
+	vporq	zmm6,zmm6,zmm30
+	vpaddq	zmm1,zmm8,zmm1
+	vpaddq	zmm3,zmm10,zmm3
+	vpaddq	zmm4,zmm6,zmm4
+
+	vmovdqu	xmm7,XMMWORD[rsi]
+	vpmuludq	zmm28,zmm19,zmm0
+	vpmuludq	zmm29,zmm20,zmm0
+	vpmuludq	zmm25,zmm16,zmm0
+	vpmuludq	zmm26,zmm17,zmm0
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm12,zmm12,zmm26
+
+	vmovdqu	xmm8,XMMWORD[16+rsi]
+	vpmuludq	zmm28,zmm18,zmm1
+	vpmuludq	zmm29,zmm19,zmm1
+	vpmuludq	zmm25,zmm24,zmm1
+	vpmuludq	zmm27,zmm18,zmm0
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm13,zmm13,zmm27
+
+	vinserti128	ymm7,ymm7,XMMWORD[32+rsi],1
+	vpmuludq	zmm28,zmm16,zmm3
+	vpmuludq	zmm29,zmm17,zmm3
+	vpmuludq	zmm26,zmm16,zmm1
+	vpmuludq	zmm27,zmm17,zmm1
+	vpaddq	zmm14,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm12,zmm12,zmm26
+	vpaddq	zmm13,zmm13,zmm27
+
+	vinserti128	ymm8,ymm8,XMMWORD[48+rsi],1
+	vpmuludq	zmm28,zmm24,zmm4
+	vpmuludq	zmm29,zmm16,zmm4
+	vpmuludq	zmm25,zmm22,zmm3
+	vpmuludq	zmm26,zmm23,zmm3
+	vpmuludq	zmm27,zmm24,zmm3
+	vpaddq	zmm3,zmm14,zmm28
+	vpaddq	zmm15,zmm15,zmm29
+	vpaddq	zmm11,zmm11,zmm25
+	vpaddq	zmm12,zmm12,zmm26
+	vpaddq	zmm13,zmm13,zmm27
+
+	vpmuludq	zmm25,zmm21,zmm4
+	vpmuludq	zmm26,zmm22,zmm4
+	vpmuludq	zmm27,zmm23,zmm4
+	vpaddq	zmm0,zmm11,zmm25
+	vpaddq	zmm1,zmm12,zmm26
+	vpaddq	zmm2,zmm13,zmm27
+
+
+
+
+	mov	eax,1
+	vpermq	zmm14,zmm3,0xb1
+	vpermq	zmm4,zmm15,0xb1
+	vpermq	zmm11,zmm0,0xb1
+	vpermq	zmm12,zmm1,0xb1
+	vpermq	zmm13,zmm2,0xb1
+	vpaddq	zmm3,zmm3,zmm14
+	vpaddq	zmm4,zmm4,zmm15
+	vpaddq	zmm0,zmm0,zmm11
+	vpaddq	zmm1,zmm1,zmm12
+	vpaddq	zmm2,zmm2,zmm13
+
+	kmovw	k3,eax
+	vpermq	zmm14,zmm3,0x2
+	vpermq	zmm15,zmm4,0x2
+	vpermq	zmm11,zmm0,0x2
+	vpermq	zmm12,zmm1,0x2
+	vpermq	zmm13,zmm2,0x2
+	vpaddq	zmm3,zmm3,zmm14
+	vpaddq	zmm4,zmm4,zmm15
+	vpaddq	zmm0,zmm0,zmm11
+	vpaddq	zmm1,zmm1,zmm12
+	vpaddq	zmm2,zmm2,zmm13
+
+	vextracti64x4	ymm14,zmm3,0x1
+	vextracti64x4	ymm15,zmm4,0x1
+	vextracti64x4	ymm11,zmm0,0x1
+	vextracti64x4	ymm12,zmm1,0x1
+	vextracti64x4	ymm13,zmm2,0x1
+	vpaddq	zmm3{k3}{z},zmm3,zmm14
+	vpaddq	zmm4{k3}{z},zmm4,zmm15
+	vpaddq	zmm0{k3}{z},zmm0,zmm11
+	vpaddq	zmm1{k3}{z},zmm1,zmm12
+	vpaddq	zmm2{k3}{z},zmm2,zmm13
+
+
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpsrldq	ymm9,ymm7,6
+	vpsrldq	ymm10,ymm8,6
+	vpunpckhqdq	ymm6,ymm7,ymm8
+	vpaddq	ymm4,ymm4,ymm14
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpunpcklqdq	ymm9,ymm9,ymm10
+	vpunpcklqdq	ymm7,ymm7,ymm8
+	vpaddq	ymm1,ymm1,ymm11
+
+	vpsrlq	ymm15,ymm4,26
+	vpand	ymm4,ymm4,ymm5
+
+	vpsrlq	ymm12,ymm1,26
+	vpand	ymm1,ymm1,ymm5
+	vpsrlq	ymm10,ymm9,30
+	vpsrlq	ymm9,ymm9,4
+	vpaddq	ymm2,ymm2,ymm12
+
+	vpaddq	ymm0,ymm0,ymm15
+	vpsllq	ymm15,ymm15,2
+	vpsrlq	ymm8,ymm7,26
+	vpsrlq	ymm6,ymm6,40
+	vpaddq	ymm0,ymm0,ymm15
+
+	vpsrlq	ymm13,ymm2,26
+	vpand	ymm2,ymm2,ymm5
+	vpand	ymm9,ymm9,ymm5
+	vpand	ymm7,ymm7,ymm5
+	vpaddq	ymm3,ymm3,ymm13
+
+	vpsrlq	ymm11,ymm0,26
+	vpand	ymm0,ymm0,ymm5
+	vpaddq	ymm2,ymm9,ymm2
+	vpand	ymm8,ymm8,ymm5
+	vpaddq	ymm1,ymm1,ymm11
+
+	vpsrlq	ymm14,ymm3,26
+	vpand	ymm3,ymm3,ymm5
+	vpand	ymm10,ymm10,ymm5
+	vpor	ymm6,ymm6,YMMWORD[32+rcx]
+	vpaddq	ymm4,ymm4,ymm14
+
+	lea	rax,[144+rsp]
+	add	rdx,64
+	jnz	NEAR $L$tail_avx2_512
+
+	vpsubq	ymm2,ymm2,ymm9
+	vmovd	DWORD[(-112)+rdi],xmm0
+	vmovd	DWORD[(-108)+rdi],xmm1
+	vmovd	DWORD[(-104)+rdi],xmm2
+	vmovd	DWORD[(-100)+rdi],xmm3
+	vmovd	DWORD[(-96)+rdi],xmm4
+	vzeroall
+	movdqa	xmm6,XMMWORD[80+r11]
+	movdqa	xmm7,XMMWORD[96+r11]
+	movdqa	xmm8,XMMWORD[112+r11]
+	movdqa	xmm9,XMMWORD[128+r11]
+	movdqa	xmm10,XMMWORD[144+r11]
+	movdqa	xmm11,XMMWORD[160+r11]
+	movdqa	xmm12,XMMWORD[176+r11]
+	movdqa	xmm13,XMMWORD[192+r11]
+	movdqa	xmm14,XMMWORD[208+r11]
+	movdqa	xmm15,XMMWORD[224+r11]
+	lea	rsp,[248+r11]
+$L$do_avx512_epilogue:
+	DB	0F3h,0C3h		;repret
+
+
+EXTERN	__imp_RtlVirtualUnwind
+
+ALIGN	16
+se_handler:
+	push	rsi
+	push	rdi
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	pushfq
+	sub	rsp,64
+
+	mov	rax,QWORD[120+r8]
+	mov	rbx,QWORD[248+r8]
+
+	mov	rsi,QWORD[8+r9]
+	mov	r11,QWORD[56+r9]
+
+	mov	r10d,DWORD[r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jb	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[152+r8]
+
+	mov	r10d,DWORD[4+r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jae	NEAR $L$common_seh_tail
+
+	lea	rax,[48+rax]
+
+	mov	rbx,QWORD[((-8))+rax]
+	mov	rbp,QWORD[((-16))+rax]
+	mov	r12,QWORD[((-24))+rax]
+	mov	r13,QWORD[((-32))+rax]
+	mov	r14,QWORD[((-40))+rax]
+	mov	r15,QWORD[((-48))+rax]
+	mov	QWORD[144+r8],rbx
+	mov	QWORD[160+r8],rbp
+	mov	QWORD[216+r8],r12
+	mov	QWORD[224+r8],r13
+	mov	QWORD[232+r8],r14
+	mov	QWORD[240+r8],r15
+
+	jmp	NEAR $L$common_seh_tail
+
+
+
+ALIGN	16
+avx_handler:
+	push	rsi
+	push	rdi
+	push	rbx
+	push	rbp
+	push	r12
+	push	r13
+	push	r14
+	push	r15
+	pushfq
+	sub	rsp,64
+
+	mov	rax,QWORD[120+r8]
+	mov	rbx,QWORD[248+r8]
+
+	mov	rsi,QWORD[8+r9]
+	mov	r11,QWORD[56+r9]
+
+	mov	r10d,DWORD[r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jb	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[152+r8]
+
+	mov	r10d,DWORD[4+r11]
+	lea	r10,[r10*1+rsi]
+	cmp	rbx,r10
+	jae	NEAR $L$common_seh_tail
+
+	mov	rax,QWORD[208+r8]
+
+	lea	rsi,[80+rax]
+	lea	rax,[248+rax]
+	lea	rdi,[512+r8]
+	mov	ecx,20
+	DD	0xa548f3fc
+
+$L$common_seh_tail:
+	mov	rdi,QWORD[8+rax]
+	mov	rsi,QWORD[16+rax]
+	mov	QWORD[152+r8],rax
+	mov	QWORD[168+r8],rsi
+	mov	QWORD[176+r8],rdi
+
+	mov	rdi,QWORD[40+r9]
+	mov	rsi,r8
+	mov	ecx,154
+	DD	0xa548f3fc
+
+	mov	rsi,r9
+	xor	rcx,rcx
+	mov	rdx,QWORD[8+rsi]
+	mov	r8,QWORD[rsi]
+	mov	r9,QWORD[16+rsi]
+	mov	r10,QWORD[40+rsi]
+	lea	r11,[56+rsi]
+	lea	r12,[24+rsi]
+	mov	QWORD[32+rsp],r10
+	mov	QWORD[40+rsp],r11
+	mov	QWORD[48+rsp],r12
+	mov	QWORD[56+rsp],rcx
+	call	QWORD[__imp_RtlVirtualUnwind]
+
+	mov	eax,1
+	add	rsp,64
+	popfq
+	pop	r15
+	pop	r14
+	pop	r13
+	pop	r12
+	pop	rbp
+	pop	rbx
+	pop	rdi
+	pop	rsi
+	DB	0F3h,0C3h		;repret
+
+
+section	.pdata rdata align=4
+ALIGN	4
+	DD	$L$SEH_begin_poly1305_init_x86_64 wrt ..imagebase
+	DD	$L$SEH_end_poly1305_init_x86_64 wrt ..imagebase
+	DD	$L$SEH_info_poly1305_init wrt ..imagebase
+
+	DD	$L$SEH_begin_poly1305_blocks_x86_64 wrt ..imagebase
+	DD	$L$SEH_end_poly1305_blocks_x86_64 wrt ..imagebase
+	DD	$L$SEH_info_poly1305_blocks wrt ..imagebase
+
+	DD	$L$SEH_begin_poly1305_emit_x86_64 wrt ..imagebase
+	DD	$L$SEH_end_poly1305_emit_x86_64 wrt ..imagebase
+	DD	$L$SEH_info_poly1305_emit wrt ..imagebase
+	DD	$L$SEH_begin_poly1305_blocks_avx wrt ..imagebase
+	DD	$L$base2_64_avx wrt ..imagebase
+	DD	$L$SEH_info_poly1305_blocks_avx_1 wrt ..imagebase
+
+	DD	$L$base2_64_avx wrt ..imagebase
+	DD	$L$even_avx wrt ..imagebase
+	DD	$L$SEH_info_poly1305_blocks_avx_2 wrt ..imagebase
+
+	DD	$L$even_avx wrt ..imagebase
+	DD	$L$SEH_end_poly1305_blocks_avx wrt ..imagebase
+	DD	$L$SEH_info_poly1305_blocks_avx_3 wrt ..imagebase
+
+	DD	$L$SEH_begin_poly1305_emit_avx wrt ..imagebase
+	DD	$L$SEH_end_poly1305_emit_avx wrt ..imagebase
+	DD	$L$SEH_info_poly1305_emit_avx wrt ..imagebase
+	DD	$L$SEH_begin_poly1305_blocks_avx2 wrt ..imagebase
+	DD	$L$base2_64_avx2 wrt ..imagebase
+	DD	$L$SEH_info_poly1305_blocks_avx2_1 wrt ..imagebase
+
+	DD	$L$base2_64_avx2 wrt ..imagebase
+	DD	$L$even_avx2 wrt ..imagebase
+	DD	$L$SEH_info_poly1305_blocks_avx2_2 wrt ..imagebase
+
+	DD	$L$even_avx2 wrt ..imagebase
+	DD	$L$SEH_end_poly1305_blocks_avx2 wrt ..imagebase
+	DD	$L$SEH_info_poly1305_blocks_avx2_3 wrt ..imagebase
+	DD	$L$SEH_begin_poly1305_blocks_avx512 wrt ..imagebase
+	DD	$L$SEH_end_poly1305_blocks_avx512 wrt ..imagebase
+	DD	$L$SEH_info_poly1305_blocks_avx512 wrt ..imagebase
+section	.xdata rdata align=8
+ALIGN	8
+$L$SEH_info_poly1305_init:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$SEH_begin_poly1305_init_x86_64 wrt ..imagebase,$L$SEH_begin_poly1305_init_x86_64 wrt ..imagebase
+
+$L$SEH_info_poly1305_blocks:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$blocks_body wrt ..imagebase,$L$blocks_epilogue wrt ..imagebase
+
+$L$SEH_info_poly1305_emit:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$SEH_begin_poly1305_emit_x86_64 wrt ..imagebase,$L$SEH_begin_poly1305_emit_x86_64 wrt ..imagebase
+$L$SEH_info_poly1305_blocks_avx_1:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$blocks_avx_body wrt ..imagebase,$L$blocks_avx_epilogue wrt ..imagebase
+
+$L$SEH_info_poly1305_blocks_avx_2:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$base2_64_avx_body wrt ..imagebase,$L$base2_64_avx_epilogue wrt ..imagebase
+
+$L$SEH_info_poly1305_blocks_avx_3:
+DB	9,0,0,0
+	DD	avx_handler wrt ..imagebase
+	DD	$L$do_avx_body wrt ..imagebase,$L$do_avx_epilogue wrt ..imagebase
+
+$L$SEH_info_poly1305_emit_avx:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$SEH_begin_poly1305_emit_avx wrt ..imagebase,$L$SEH_begin_poly1305_emit_avx wrt ..imagebase
+$L$SEH_info_poly1305_blocks_avx2_1:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$blocks_avx2_body wrt ..imagebase,$L$blocks_avx2_epilogue wrt ..imagebase
+
+$L$SEH_info_poly1305_blocks_avx2_2:
+DB	9,0,0,0
+	DD	se_handler wrt ..imagebase
+	DD	$L$base2_64_avx2_body wrt ..imagebase,$L$base2_64_avx2_epilogue wrt ..imagebase
+
+$L$SEH_info_poly1305_blocks_avx2_3:
+DB	9,0,0,0
+	DD	avx_handler wrt ..imagebase
+	DD	$L$do_avx2_body wrt ..imagebase,$L$do_avx2_epilogue wrt ..imagebase
+$L$SEH_info_poly1305_blocks_avx512:
+DB	9,0,0,0
+	DD	avx_handler wrt ..imagebase
+	DD	$L$do_avx512_body wrt ..imagebase,$L$do_avx512_epilogue wrt ..imagebase
diff --git a/crypto/siphash.cpp b/crypto/siphash.cpp
new file mode 100644
index 0000000..98033a9
--- /dev/null
+++ b/crypto/siphash.cpp
@@ -0,0 +1,193 @@
+/* Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.
+ *
+ * SipHash: a fast short-input PRF
+ * https://131002.net/siphash/
+ *
+ * This implementation is specifically for SipHash2-4 for a secure PRF
+ * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for
+ * hashtables.
+ */
+#include "stdafx.h"
+
+#include "crypto/siphash.h"
+#include "tunsafe_endian.h"
+
+#define SIPROUND \
+  do { \
+  v0 += v1; v1 = rol64(v1, 13); v1 ^= v0; v0 = rol64(v0, 32); \
+  v2 += v3; v3 = rol64(v3, 16); v3 ^= v2; \
+  v0 += v3; v3 = rol64(v3, 21); v3 ^= v0; \
+  v2 += v1; v1 = rol64(v1, 17); v1 ^= v2; v2 = rol64(v2, 32); \
+  } while (0)
+
+#define PREAMBLE(len) \
+  uint64 v0 = 0x736f6d6570736575ULL; \
+  uint64 v1 = 0x646f72616e646f6dULL; \
+  uint64 v2 = 0x6c7967656e657261ULL; \
+  uint64 v3 = 0x7465646279746573ULL; \
+  uint64 b = ((uint64)(len)) << 56; \
+  v3 ^= key->key[1]; \
+  v2 ^= key->key[0]; \
+  v1 ^= key->key[1]; \
+  v0 ^= key->key[0];
+
+#define POSTAMBLE \
+  v3 ^= b; \
+  SIPROUND; \
+  SIPROUND; \
+  v0 ^= b; \
+  v2 ^= 0xff; \
+  SIPROUND; \
+  SIPROUND; \
+  SIPROUND; \
+  SIPROUND; \
+  return (v0 ^ v1) ^ (v2 ^ v3);
+
+uint64 siphash(const void *data, size_t len, const siphash_key_t *key) {
+  const uint8 *end = (uint8*)data + len - (len % sizeof(uint64));
+  const uint8 left = len & (sizeof(uint64) - 1);
+  uint64 m;
+  PREAMBLE(len)
+  for (; data != end; data = (uint8*)data + sizeof(uint64)) {
+    m = ReadLE64(data);
+    v3 ^= m;
+    SIPROUND;
+    SIPROUND;
+    v0 ^= m;
+  }
+  switch (left) {
+  case 7: b |= ((uint64)end[6]) << 48;
+  case 6: b |= ((uint64)end[5]) << 40;
+  case 5: b |= ((uint64)end[4]) << 32;
+  case 4: b |= ReadLE32(data); break;
+  case 3: b |= ((uint64)end[2]) << 16;
+  case 2: b |= ReadLE16(data); break;
+  case 1: b |= end[0];
+  }
+  POSTAMBLE
+}
+
+/**
+ * siphash_1u64 - compute 64-bit siphash PRF value of a uint64
+ * @first: first uint64
+ * @key: the siphash key
+ */
+uint64 siphash_1u64(const uint64 first, const siphash_key_t *key)
+{
+  PREAMBLE(8)
+  v3 ^= first;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= first;
+  POSTAMBLE
+}
+
+/**
+ * siphash_2u64 - compute 64-bit siphash PRF value of 2 uint64
+ * @first: first uint64
+ * @second: second uint64
+ * @key: the siphash key
+ */
+uint64 siphash_2u64(const uint64 first, const uint64 second, const siphash_key_t *key)
+{
+  PREAMBLE(16)
+  v3 ^= first;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= first;
+  v3 ^= second;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= second;
+  POSTAMBLE
+}
+
+/**
+ * siphash_3u64 - compute 64-bit siphash PRF value of 3 uint64
+ * @first: first uint64
+ * @second: second uint64
+ * @third: third uint64
+ * @key: the siphash key
+ */
+uint64 siphash_3u64(const uint64 first, const uint64 second, const uint64 third,
+     const siphash_key_t *key)
+{
+  PREAMBLE(24)
+  v3 ^= first;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= first;
+  v3 ^= second;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= second;
+  v3 ^= third;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= third;
+  POSTAMBLE
+}
+
+/**
+ * siphash_4u64 - compute 64-bit siphash PRF value of 4 uint64
+ * @first: first uint64
+ * @second: second uint64
+ * @third: third uint64
+ * @forth: forth uint64
+ * @key: the siphash key
+ */
+uint64 siphash_4u64(const uint64 first, const uint64 second, const uint64 third,
+     const uint64 forth, const siphash_key_t *key)
+{
+  PREAMBLE(32)
+  v3 ^= first;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= first;
+  v3 ^= second;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= second;
+  v3 ^= third;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= third;
+  v3 ^= forth;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= forth;
+  POSTAMBLE
+}
+
+uint64 siphash_1u32(const uint32 first, const siphash_key_t *key)
+{
+  PREAMBLE(4)
+  b |= first;
+  POSTAMBLE
+}
+
+uint64 siphash_3u32(const uint32 first, const uint32 second, const uint32 third,
+     const siphash_key_t *key)
+{
+  uint64 combined = (uint64)second << 32 | first;
+  PREAMBLE(12)
+  v3 ^= combined;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= combined;
+  b |= third;
+  POSTAMBLE
+}
+
+uint64 siphash_u64_u32(const uint64 combined, const uint32 third, const siphash_key_t *key) {
+  PREAMBLE(12)
+  v3 ^= combined;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= combined;
+  b |= third;
+  POSTAMBLE
+}
+
diff --git a/crypto/siphash.h b/crypto/siphash.h
new file mode 100644
index 0000000..3b5dc74
--- /dev/null
+++ b/crypto/siphash.h
@@ -0,0 +1,53 @@
+/* Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.
+ *
+ * SipHash: a fast short-input PRF
+ * https://131002.net/siphash/
+ *
+ * This implementation is specifically for SipHash2-4 for a secure PRF
+ * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for
+ * hashtables.
+ */
+
+#ifndef TUNSAFE_CRYPTO_SIPHASH_H_
+#define TUNSAFE_CRYPTO_SIPHASH_H_
+
+#include "tunsafe_types.h"
+
+typedef struct {
+	uint64 key[2];
+} siphash_key_t;
+
+uint64 siphash_1u64(const uint64 a, const siphash_key_t *key);
+uint64 siphash_2u64(const uint64 a, const uint64 b, const siphash_key_t *key);
+uint64 siphash_3u64(const uint64 a, const uint64 b, const uint64 c,
+		 const siphash_key_t *key);
+uint64 siphash_4u64(const uint64 a, const uint64 b, const uint64 c, const uint64 d,
+		 const siphash_key_t *key);
+uint64 siphash_1u32(const uint32 a, const siphash_key_t *key);
+uint64 siphash_3u32(const uint32 a, const uint32 b, const uint32 c,
+		 const siphash_key_t *key);
+
+static inline uint64 siphash_2u32(const uint32 a, const uint32 b,
+			       const siphash_key_t *key)
+{
+	return siphash_1u64((uint64)b << 32 | a, key);
+}
+static inline uint64 siphash_4u32(const uint32 a, const uint32 b, const uint32 c,
+			       const uint32 d, const siphash_key_t *key)
+{
+	return siphash_2u64((uint64)b << 32 | a, (uint64)d << 32 | c, key);
+}
+
+uint64 siphash_u64_u32(const uint64 combined, const uint32 third, const siphash_key_t *key);
+
+/**
+ * siphash - compute 64-bit siphash PRF value
+ * @data: buffer to hash
+ * @size: size of @data
+ * @key: the siphash key
+ */
+uint64 siphash(const void *data, size_t len, const siphash_key_t *key);
+
+#endif  // TUNSAFE_CRYPTO_SIPHASH_H_
diff --git a/crypto/x86_64-xlate.pl b/crypto/x86_64-xlate.pl
new file mode 100644
index 0000000..c1ae6ad
--- /dev/null
+++ b/crypto/x86_64-xlate.pl
@@ -0,0 +1,1433 @@
+#! /usr/bin/env perl
+# Copyright 2005-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+
+# Ascetic x86_64 AT&T to MASM/NASM assembler translator by <appro>.
+#
+# Why AT&T to MASM and not vice versa? Several reasons. Because AT&T
+# format is way easier to parse. Because it's simpler to "gear" from
+# Unix ABI to Windows one [see cross-reference "card" at the end of
+# file]. Because Linux targets were available first...
+#
+# In addition the script also "distills" code suitable for GNU
+# assembler, so that it can be compiled with more rigid assemblers,
+# such as Solaris /usr/ccs/bin/as.
+#
+# This translator is not designed to convert *arbitrary* assembler
+# code from AT&T format to MASM one. It's designed to convert just
+# enough to provide for dual-ABI OpenSSL modules development...
+# There *are* limitations and you might have to modify your assembler
+# code or this script to achieve the desired result...
+#
+# Currently recognized limitations:
+#
+# - can't use multiple ops per line;
+#
+# Dual-ABI styling rules.
+#
+# 1. Adhere to Unix register and stack layout [see cross-reference
+#    ABI "card" at the end for explanation].
+# 2. Forget about "red zone," stick to more traditional blended
+#    stack frame allocation. If volatile storage is actually required
+#    that is. If not, just leave the stack as is.
+# 3. Functions tagged with ".type name,@function" get crafted with
+#    unified Win64 prologue and epilogue automatically. If you want
+#    to take care of ABI differences yourself, tag functions as
+#    ".type name,@abi-omnipotent" instead.
+# 4. To optimize the Win64 prologue you can specify number of input
+#    arguments as ".type name,@function,N." Keep in mind that if N is
+#    larger than 6, then you *have to* write "abi-omnipotent" code,
+#    because >6 cases can't be addressed with unified prologue.
+# 5. Name local labels as .L*, do *not* use dynamic labels such as 1:
+#    (sorry about latter).
+# 6. Don't use [or hand-code with .byte] "rep ret." "ret" mnemonic is
+#    required to identify the spots, where to inject Win64 epilogue!
+#    But on the pros, it's then prefixed with rep automatically:-)
+# 7. Stick to explicit ip-relative addressing. If you have to use
+#    GOTPCREL addressing, stick to mov symbol@GOTPCREL(%rip),%r??.
+#    Both are recognized and translated to proper Win64 addressing
+#    modes.
+#
+# 8. In order to provide for structured exception handling unified
+#    Win64 prologue copies %rsp value to %rax. For further details
+#    see SEH paragraph at the end.
+# 9. .init segment is allowed to contain calls to functions only.
+# a. If function accepts more than 4 arguments *and* >4th argument
+#    is declared as non 64-bit value, do clear its upper part.
+
+
+use strict;
+
+my $flavour = shift;
+my $output  = shift;
+if ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
+
+open STDOUT,">$output" || die "can't open $output: $!"
+	if (defined($output));
+
+my $gas=1;	$gas=0 if ($output =~ /\.asm$/);
+my $elf=1;	$elf=0 if (!$gas);
+my $win64=0;
+my $prefix="";
+my $decor=".L";
+
+my $masmref=8 + 50727*2**-32;	# 8.00.50727 shipped with VS2005
+my $masm=0;
+my $PTR=" PTR";
+
+my $nasmref=2.03;
+my $nasm=0;
+
+if    ($flavour eq "mingw64")	{ $gas=1; $elf=0; $win64=1;
+				  $prefix=`echo __USER_LABEL_PREFIX__ | $ENV{CC} -E -P -`;
+				  $prefix =~ s|\R$||; # Better chomp
+				}
+elsif ($flavour eq "macosx")	{ $gas=1; $elf=0; $prefix="_"; $decor="L\$"; }
+elsif ($flavour eq "masm")	{ $gas=0; $elf=0; $masm=$masmref; $win64=1; $decor="\$L\$"; }
+elsif ($flavour eq "nasm")	{ $gas=0; $elf=0; $nasm=$nasmref; $win64=1; $decor="\$L\$"; $PTR=""; }
+elsif (!$gas)
+{   if ($ENV{ASM} =~ m/nasm/ && `nasm -v` =~ m/version ([0-9]+)\.([0-9]+)/i)
+    {	$nasm = $1 + $2*0.01; $PTR="";  }
+    elsif (`ml64 2>&1` =~ m/Version ([0-9]+)\.([0-9]+)(\.([0-9]+))?/)
+    {	$masm = $1 + $2*2**-16 + $4*2**-32;   }
+    die "no assembler found on %PATH%" if (!($nasm || $masm));
+    $win64=1;
+    $elf=0;
+    $decor="\$L\$";
+}
+
+my $current_segment;
+my $current_function;
+my %globals;
+
+{ package opcode;	# pick up opcodes
+    sub re {
+	my	($class, $line) = @_;
+	my	$self = {};
+	my	$ret;
+
+	if ($$line =~ /^([a-z][a-z0-9]*)/i) {
+	    bless $self,$class;
+	    $self->{op} = $1;
+	    $ret = $self;
+	    $$line = substr($$line,@+[0]); $$line =~ s/^\s+//;
+
+	    undef $self->{sz};
+	    if ($self->{op} =~ /^(movz)x?([bw]).*/) {	# movz is pain...
+		$self->{op} = $1;
+		$self->{sz} = $2;
+	    } elsif ($self->{op} =~ /call|jmp/) {
+		$self->{sz} = "";
+	    } elsif ($self->{op} =~ /^p/ && $' !~ /^(ush|op|insrw)/) { # SSEn
+		$self->{sz} = "";
+	    } elsif ($self->{op} =~ /^[vk]/) { # VEX or k* such as kmov
+		$self->{sz} = "";
+	    } elsif ($self->{op} =~ /mov[dq]/ && $$line =~ /%xmm/) {
+		$self->{sz} = "";
+	    } elsif ($self->{op} =~ /([a-z]{3,})([qlwb])$/) {
+		$self->{op} = $1;
+		$self->{sz} = $2;
+	    }
+	}
+	$ret;
+    }
+    sub size {
+	my ($self, $sz) = @_;
+	$self->{sz} = $sz if (defined($sz) && !defined($self->{sz}));
+	$self->{sz};
+    }
+    sub out {
+	my $self = shift;
+	if ($gas) {
+	    if ($self->{op} eq "movz") {	# movz is pain...
+		sprintf "%s%s%s",$self->{op},$self->{sz},shift;
+	    } elsif ($self->{op} =~ /^set/) {
+		"$self->{op}";
+	    } elsif ($self->{op} eq "ret") {
+		my $epilogue = "";
+		if ($win64 && $current_function->{abi} eq "svr4") {
+		    $epilogue = "movq	8(%rsp),%rdi\n\t" .
+				"movq	16(%rsp),%rsi\n\t";
+		}
+	    	#$epilogue . ".byte	0xf3,0xc3";
+	    	$epilogue . "ret";
+	    } elsif ($self->{op} eq "call" && !$elf && $current_segment eq ".init") {
+		".p2align\t3\n\t.quad";
+	    } else {
+		"$self->{op}$self->{sz}";
+	    }
+	} else {
+	    $self->{op} =~ s/^movz/movzx/;
+	    if ($self->{op} eq "ret") {
+		$self->{op} = "";
+		if ($win64 && $current_function->{abi} eq "svr4") {
+		    $self->{op} = "mov	rdi,QWORD$PTR\[8+rsp\]\t;WIN64 epilogue\n\t".
+				  "mov	rsi,QWORD$PTR\[16+rsp\]\n\t";
+	    	}
+		$self->{op} .= "ret";
+	    } elsif ($self->{op} =~ /^(pop|push)f/) {
+		$self->{op} .= $self->{sz};
+	    } elsif ($self->{op} eq "call" && $current_segment eq ".CRT\$XCU") {
+		$self->{op} = "\tDQ";
+	    }
+	    $self->{op};
+	}
+    }
+    sub mnemonic {
+	my ($self, $op) = @_;
+	$self->{op}=$op if (defined($op));
+	$self->{op};
+    }
+}
+{ package const;	# pick up constants, which start with $
+    sub re {
+	my	($class, $line) = @_;
+	my	$self = {};
+	my	$ret;
+
+	if ($$line =~ /^\$([^,]+)/) {
+	    bless $self, $class;
+	    $self->{value} = $1;
+	    $ret = $self;
+	    $$line = substr($$line,@+[0]); $$line =~ s/^\s+//;
+	}
+	$ret;
+    }
+    sub out {
+    	my $self = shift;
+
+	$self->{value} =~ s/\b(0b[0-1]+)/oct($1)/eig;
+	if ($gas) {
+	    # Solaris /usr/ccs/bin/as can't handle multiplications
+	    # in $self->{value}
+	    my $value = $self->{value};
+	    no warnings;    # oct might complain about overflow, ignore here...
+	    $value =~ s/(?<![\w\$\.])(0x?[0-9a-f]+)/oct($1)/egi;
+	    if ($value =~ s/([0-9]+\s*[\*\/\%]\s*[0-9]+)/eval($1)/eg) {
+		$self->{value} = $value;
+	    }
+	    sprintf "\$%s",$self->{value};
+	} else {
+	    my $value = $self->{value};
+	    $value =~ s/0x([0-9a-f]+)/0$1h/ig if ($masm);
+	    sprintf "%s",$value;
+	}
+    }
+}
+{ package ea;		# pick up effective addresses: expr(%reg,%reg,scale)
+
+    my %szmap = (	b=>"BYTE$PTR",    w=>"WORD$PTR",
+			l=>"DWORD$PTR",   d=>"DWORD$PTR",
+			q=>"QWORD$PTR",   o=>"OWORD$PTR",
+			x=>"XMMWORD$PTR", y=>"YMMWORD$PTR",
+			z=>"ZMMWORD$PTR" ) if (!$gas);
+
+    sub re {
+	my	($class, $line, $opcode) = @_;
+	my	$self = {};
+	my	$ret;
+
+	# optional * ----vvv--- appears in indirect jmp/call
+	if ($$line =~ /^(\*?)([^\(,]*)\(([%\w,]+)\)((?:{[^}]+})*)/) {
+	    bless $self, $class;
+	    $self->{asterisk} = $1;
+	    $self->{label} = $2;
+	    ($self->{base},$self->{index},$self->{scale})=split(/,/,$3);
+	    $self->{scale} = 1 if (!defined($self->{scale}));
+	    $self->{opmask} = $4;
+	    $ret = $self;
+	    $$line = substr($$line,@+[0]); $$line =~ s/^\s+//;
+
+	    if ($win64 && $self->{label} =~ s/\@GOTPCREL//) {
+		die if ($opcode->mnemonic() ne "mov");
+		$opcode->mnemonic("lea");
+	    }
+	    $self->{base}  =~ s/^%//;
+	    $self->{index} =~ s/^%// if (defined($self->{index}));
+	    $self->{opcode} = $opcode;
+	}
+	$ret;
+    }
+    sub size {}
+    sub out {
+	my ($self, $sz) = @_;
+
+	$self->{label} =~ s/([_a-z][_a-z0-9]*)/$globals{$1} or $1/gei;
+	$self->{label} =~ s/\.L/$decor/g;
+
+	# Silently convert all EAs to 64-bit. This is required for
+	# elder GNU assembler and results in more compact code,
+	# *but* most importantly AES module depends on this feature!
+	$self->{index} =~ s/^[er](.?[0-9xpi])[d]?$/r\1/;
+	$self->{base}  =~ s/^[er](.?[0-9xpi])[d]?$/r\1/;
+
+	# Solaris /usr/ccs/bin/as can't handle multiplications
+	# in $self->{label}...
+	use integer;
+	$self->{label} =~ s/(?<![\w\$\.])(0x?[0-9a-f]+)/oct($1)/egi;
+	$self->{label} =~ s/\b([0-9]+\s*[\*\/\%]\s*[0-9]+)\b/eval($1)/eg;
+
+	# Some assemblers insist on signed presentation of 32-bit
+	# offsets, but sign extension is a tricky business in perl...
+	if ((1<<31)<<1) {
+	    $self->{label} =~ s/\b([0-9]+)\b/$1<<32>>32/eg;
+	} else {
+	    $self->{label} =~ s/\b([0-9]+)\b/$1>>0/eg;
+	}
+
+	# if base register is %rbp or %r13, see if it's possible to
+	# flip base and index registers [for better performance]
+	if (!$self->{label} && $self->{index} && $self->{scale}==1 &&
+	    $self->{base} =~ /(rbp|r13)/) {
+		$self->{base} = $self->{index}; $self->{index} = $1;
+	}
+
+	if ($gas) {
+	    $self->{label} =~ s/^___imp_/__imp__/   if ($flavour eq "mingw64");
+
+	    if (defined($self->{index})) {
+		sprintf "%s%s(%s,%%%s,%d)%s",
+					$self->{asterisk},$self->{label},
+					$self->{base}?"%$self->{base}":"",
+					$self->{index},$self->{scale},
+					$self->{opmask};
+	    } else {
+		sprintf "%s%s(%%%s)%s",	$self->{asterisk},$self->{label},
+					$self->{base},$self->{opmask};
+	    }
+	} else {
+	    $self->{label} =~ s/\./\$/g;
+	    $self->{label} =~ s/(?<![\w\$\.])0x([0-9a-f]+)/0$1h/ig;
+	    $self->{label} = "($self->{label})" if ($self->{label} =~ /[\*\+\-\/]/);
+
+	    my $mnemonic = $self->{opcode}->mnemonic();
+	    ($self->{asterisk})				&& ($sz="q") ||
+	    ($mnemonic =~ /^v?mov([qd])$/)		&& ($sz=$1)  ||
+	    ($mnemonic =~ /^v?pinsr([qdwb])$/)		&& ($sz=$1)  ||
+	    ($mnemonic =~ /^vpbroadcast([qdwb])$/)	&& ($sz=$1)  ||
+	    ($mnemonic =~ /^v(?!perm)[a-z]+[fi]128$/)	&& ($sz="x");
+
+	    $self->{opmask}  =~ s/%(k[0-7])/$1/;
+
+	    if (defined($self->{index})) {
+		sprintf "%s[%s%s*%d%s]%s",$szmap{$sz},
+					$self->{label}?"$self->{label}+":"",
+					$self->{index},$self->{scale},
+					$self->{base}?"+$self->{base}":"",
+					$self->{opmask};
+	    } elsif ($self->{base} eq "rip") {
+		sprintf "%s[%s]",$szmap{$sz},$self->{label};
+	    } else {
+		sprintf "%s[%s%s]%s",	$szmap{$sz},
+					$self->{label}?"$self->{label}+":"",
+					$self->{base},$self->{opmask};
+	    }
+	}
+    }
+}
+{ package register;	# pick up registers, which start with %.
+    sub re {
+	my	($class, $line, $opcode) = @_;
+	my	$self = {};
+	my	$ret;
+
+	# optional * ----vvv--- appears in indirect jmp/call
+	if ($$line =~ /^(\*?)%(\w+)((?:{[^}]+})*)/) {
+	    bless $self,$class;
+	    $self->{asterisk} = $1;
+	    $self->{value} = $2;
+	    $self->{opmask} = $3;
+	    $opcode->size($self->size());
+	    $ret = $self;
+	    $$line = substr($$line,@+[0]); $$line =~ s/^\s+//;
+	}
+	$ret;
+    }
+    sub size {
+	my	$self = shift;
+	my	$ret;
+
+	if    ($self->{value} =~ /^r[\d]+b$/i)	{ $ret="b"; }
+	elsif ($self->{value} =~ /^r[\d]+w$/i)	{ $ret="w"; }
+	elsif ($self->{value} =~ /^r[\d]+d$/i)	{ $ret="l"; }
+	elsif ($self->{value} =~ /^r[\w]+$/i)	{ $ret="q"; }
+	elsif ($self->{value} =~ /^[a-d][hl]$/i){ $ret="b"; }
+	elsif ($self->{value} =~ /^[\w]{2}l$/i)	{ $ret="b"; }
+	elsif ($self->{value} =~ /^[\w]{2}$/i)	{ $ret="w"; }
+	elsif ($self->{value} =~ /^e[a-z]{2}$/i){ $ret="l"; }
+
+	$ret;
+    }
+    sub out {
+    	my $self = shift;
+	if ($gas)	{ sprintf "%s%%%s%s",	$self->{asterisk},
+						$self->{value},
+						$self->{opmask}; }
+	else		{ $self->{opmask} =~ s/%(k[0-7])/$1/;
+			  $self->{value}.$self->{opmask}; }
+    }
+}
+{ package label;	# pick up labels, which end with :
+    sub re {
+	my	($class, $line) = @_;
+	my	$self = {};
+	my	$ret;
+
+	if ($$line =~ /(^[\.\w]+)\:/) {
+	    bless $self,$class;
+	    $self->{value} = $1;
+	    $ret = $self;
+	    $$line = substr($$line,@+[0]); $$line =~ s/^\s+//;
+
+	    $self->{value} =~ s/^\.L/$decor/;
+	}
+	$ret;
+    }
+    sub out {
+	my $self = shift;
+
+	if ($gas) {
+	    my $func = ($globals{$self->{value}} or $self->{value}) . ":";
+	    if ($win64	&& $current_function->{name} eq $self->{value}
+			&& $current_function->{abi} eq "svr4") {
+		$func .= "\n";
+		$func .= "	movq	%rdi,8(%rsp)\n";
+		$func .= "	movq	%rsi,16(%rsp)\n";
+		$func .= "	movq	%rsp,%rax\n";
+		$func .= "${decor}SEH_begin_$current_function->{name}:\n";
+		my $narg = $current_function->{narg};
+		$narg=6 if (!defined($narg));
+		$func .= "	movq	%rcx,%rdi\n" if ($narg>0);
+		$func .= "	movq	%rdx,%rsi\n" if ($narg>1);
+		$func .= "	movq	%r8,%rdx\n"  if ($narg>2);
+		$func .= "	movq	%r9,%rcx\n"  if ($narg>3);
+		$func .= "	movq	40(%rsp),%r8\n" if ($narg>4);
+		$func .= "	movq	48(%rsp),%r9\n" if ($narg>5);
+	    }
+	    $func;
+	} elsif ($self->{value} ne "$current_function->{name}") {
+	    # Make all labels in masm global.
+	    $self->{value} .= ":" if ($masm);
+	    $self->{value} . ":";
+	} elsif ($win64 && $current_function->{abi} eq "svr4") {
+	    my $func =	"$current_function->{name}" .
+			($nasm ? ":" : "\tPROC $current_function->{scope}") .
+			"\n";
+	    $func .= "	mov	QWORD$PTR\[8+rsp\],rdi\t;WIN64 prologue\n";
+	    $func .= "	mov	QWORD$PTR\[16+rsp\],rsi\n";
+	    $func .= "	mov	rax,rsp\n";
+	    $func .= "${decor}SEH_begin_$current_function->{name}:";
+	    $func .= ":" if ($masm);
+	    $func .= "\n";
+	    my $narg = $current_function->{narg};
+	    $narg=6 if (!defined($narg));
+	    $func .= "	mov	rdi,rcx\n" if ($narg>0);
+	    $func .= "	mov	rsi,rdx\n" if ($narg>1);
+	    $func .= "	mov	rdx,r8\n"  if ($narg>2);
+	    $func .= "	mov	rcx,r9\n"  if ($narg>3);
+	    $func .= "	mov	r8,QWORD$PTR\[40+rsp\]\n" if ($narg>4);
+	    $func .= "	mov	r9,QWORD$PTR\[48+rsp\]\n" if ($narg>5);
+	    $func .= "\n";
+	} else {
+	   "$current_function->{name}".
+			($nasm ? ":" : "\tPROC $current_function->{scope}");
+	}
+    }
+}
+{ package expr;		# pick up expressions
+    sub re {
+	my	($class, $line, $opcode) = @_;
+	my	$self = {};
+	my	$ret;
+
+	if ($$line =~ /(^[^,]+)/) {
+	    bless $self,$class;
+	    $self->{value} = $1;
+	    $ret = $self;
+	    $$line = substr($$line,@+[0]); $$line =~ s/^\s+//;
+
+	    $self->{value} =~ s/\@PLT// if (!$elf);
+	    $self->{value} =~ s/([_a-z][_a-z0-9]*)/$globals{$1} or $1/gei;
+	    $self->{value} =~ s/\.L/$decor/g;
+	    $self->{opcode} = $opcode;
+	}
+	$ret;
+    }
+    sub out {
+	my $self = shift;
+	if ($nasm && $self->{opcode}->mnemonic()=~m/^j(?![re]cxz)/) {
+	    "NEAR ".$self->{value};
+	} else {
+	    $self->{value};
+	}
+    }
+}
+{ package cfi_directive;
+    # CFI directives annotate instructions that are significant for
+    # stack unwinding procedure compliant with DWARF specification,
+    # see http://dwarfstd.org/. Besides naturally expected for this
+    # script platform-specific filtering function, this module adds
+    # three auxiliary synthetic directives not recognized by [GNU]
+    # assembler:
+    #
+    # - .cfi_push to annotate push instructions in prologue, which
+    #   translates to .cfi_adjust_cfa_offset (if needed) and
+    #   .cfi_offset;
+    # - .cfi_pop to annotate pop instructions in epilogue, which
+    #   translates to .cfi_adjust_cfa_offset (if needed) and
+    #   .cfi_restore;
+    # - [and most notably] .cfi_cfa_expression which encodes
+    #   DW_CFA_def_cfa_expression and passes it to .cfi_escape as
+    #   byte vector;
+    #
+    # CFA expressions were introduced in DWARF specification version
+    # 3 and describe how to deduce CFA, Canonical Frame Address. This
+    # becomes handy if your stack frame is variable and you can't
+    # spare register for [previous] frame pointer. Suggested directive
+    # syntax is made-up mix of DWARF operator suffixes [subset of]
+    # and references to registers with optional bias. Following example
+    # describes offloaded *original* stack pointer at specific offset
+    # from *current* stack pointer:
+    #
+    #   .cfi_cfa_expression     %rsp+40,deref,+8
+    #
+    # Final +8 has everything to do with the fact that CFA is defined
+    # as reference to top of caller's stack, and on x86_64 call to
+    # subroutine pushes 8-byte return address. In other words original
+    # stack pointer upon entry to a subroutine is 8 bytes off from CFA.
+
+    # Below constants are taken from "DWARF Expressions" section of the
+    # DWARF specification, section is numbered 7.7 in versions 3 and 4.
+    my %DW_OP_simple = (	# no-arg operators, mapped directly
+	deref	=> 0x06,	dup	=> 0x12,
+	drop	=> 0x13,	over	=> 0x14,
+	pick	=> 0x15,	swap	=> 0x16,
+	rot	=> 0x17,	xderef	=> 0x18,
+
+	abs	=> 0x19,	and	=> 0x1a,
+	div	=> 0x1b,	minus	=> 0x1c,
+	mod	=> 0x1d,	mul	=> 0x1e,
+	neg	=> 0x1f,	not	=> 0x20,
+	or	=> 0x21,	plus	=> 0x22,
+	shl	=> 0x24,	shr	=> 0x25,
+	shra	=> 0x26,	xor	=> 0x27,
+	);
+
+    my %DW_OP_complex = (	# used in specific subroutines
+	constu		=> 0x10,	# uleb128
+	consts		=> 0x11,	# sleb128
+	plus_uconst	=> 0x23,	# uleb128
+	lit0 		=> 0x30,	# add 0-31 to opcode
+	reg0		=> 0x50,	# add 0-31 to opcode
+	breg0		=> 0x70,	# add 0-31 to opcole, sleb128
+	regx		=> 0x90,	# uleb28
+	fbreg		=> 0x91,	# sleb128
+	bregx		=> 0x92,	# uleb128, sleb128
+	piece		=> 0x93,	# uleb128
+	);
+
+    # Following constants are defined in x86_64 ABI supplement, for
+    # example available at https://www.uclibc.org/docs/psABI-x86_64.pdf,
+    # see section 3.7 "Stack Unwind Algorithm".
+    my %DW_reg_idx = (
+	"%rax"=>0,  "%rdx"=>1,  "%rcx"=>2,  "%rbx"=>3,
+	"%rsi"=>4,  "%rdi"=>5,  "%rbp"=>6,  "%rsp"=>7,
+	"%r8" =>8,  "%r9" =>9,  "%r10"=>10, "%r11"=>11,
+	"%r12"=>12, "%r13"=>13, "%r14"=>14, "%r15"=>15
+	);
+
+    my ($cfa_reg, $cfa_rsp);
+
+    # [us]leb128 format is variable-length integer representation base
+    # 2^128, with most significant bit of each byte being 0 denoting
+    # *last* most significant digit. See "Variable Length Data" in the
+    # DWARF specification, numbered 7.6 at least in versions 3 and 4.
+    sub sleb128 {
+	use integer;	# get right shift extend sign
+
+	my $val = shift;
+	my $sign = ($val < 0) ? -1 : 0;
+	my @ret = ();
+
+	while(1) {
+	    push @ret, $val&0x7f;
+
+	    # see if remaining bits are same and equal to most
+	    # significant bit of the current digit, if so, it's
+	    # last digit...
+	    last if (($val>>6) == $sign);
+
+	    @ret[-1] |= 0x80;
+	    $val >>= 7;
+	}
+
+	return @ret;
+    }
+    sub uleb128 {
+	my $val = shift;
+	my @ret = ();
+
+	while(1) {
+	    push @ret, $val&0x7f;
+
+	    # see if it's last significant digit...
+	    last if (($val >>= 7) == 0);
+
+	    @ret[-1] |= 0x80;
+	}
+
+	return @ret;
+    }
+    sub const {
+	my $val = shift;
+
+	if ($val >= 0 && $val < 32) {
+            return ($DW_OP_complex{lit0}+$val);
+	}
+	return ($DW_OP_complex{consts}, sleb128($val));
+    }
+    sub reg {
+	my $val = shift;
+
+	return if ($val !~ m/^(%r\w+)(?:([\+\-])((?:0x)?[0-9a-f]+))?/);
+
+	my $reg = $DW_reg_idx{$1};
+	my $off = eval ("0 $2 $3");
+
+	return (($DW_OP_complex{breg0} + $reg), sleb128($off));
+	# Yes, we use DW_OP_bregX+0 to push register value and not
+	# DW_OP_regX, because latter would require even DW_OP_piece,
+	# which would be a waste under the circumstances. If you have
+	# to use DWP_OP_reg, use "regx:N"...
+    }
+    sub cfa_expression {
+	my $line = shift;
+	my @ret;
+
+	foreach my $token (split(/,\s*/,$line)) {
+	    if ($token =~ /^%r/) {
+		push @ret,reg($token);
+	    } elsif ($token =~ /((?:0x)?[0-9a-f]+)\((%r\w+)\)/) {
+		push @ret,reg("$2+$1");
+	    } elsif ($token =~ /(\w+):(\-?(?:0x)?[0-9a-f]+)(U?)/i) {
+		my $i = 1*eval($2);
+		push @ret,$DW_OP_complex{$1}, ($3 ? uleb128($i) : sleb128($i));
+	    } elsif (my $i = 1*eval($token) or $token eq "0") {
+		if ($token =~ /^\+/) {
+		    push @ret,$DW_OP_complex{plus_uconst},uleb128($i);
+		} else {
+		    push @ret,const($i);
+		}
+	    } else {
+		push @ret,$DW_OP_simple{$token};
+	    }
+	}
+
+	# Finally we return DW_CFA_def_cfa_expression, 15, followed by
+	# length of the expression and of course the expression itself.
+	return (15,scalar(@ret),@ret);
+    }
+    sub re {
+	my	($class, $line) = @_;
+	my	$self = {};
+	my	$ret;
+
+	if ($$line =~ s/^\s*\.cfi_(\w+)\s*//) {
+	    bless $self,$class;
+	    $ret = $self;
+	    undef $self->{value};
+	    my $dir = $1;
+
+	    SWITCH: for ($dir) {
+	    # What is $cfa_rsp? Effectively it's difference between %rsp
+	    # value and current CFA, Canonical Frame Address, which is
+	    # why it starts with -8. Recall that CFA is top of caller's
+	    # stack...
+	    /startproc/	&& do {	($cfa_reg, $cfa_rsp) = ("%rsp", -8); last; };
+	    /endproc/	&& do {	($cfa_reg, $cfa_rsp) = ("%rsp",  0); last; };
+	    /def_cfa_register/
+			&& do {	$cfa_reg = $$line; last; };
+	    /def_cfa_offset/
+			&& do {	$cfa_rsp = -1*eval($$line) if ($cfa_reg eq "%rsp");
+				last;
+			      };
+	    /adjust_cfa_offset/
+			&& do {	$cfa_rsp -= 1*eval($$line) if ($cfa_reg eq "%rsp");
+				last;
+			      };
+	    /def_cfa/	&& do {	if ($$line =~ /(%r\w+)\s*,\s*(.+)/) {
+				    $cfa_reg = $1;
+				    $cfa_rsp = -1*eval($2) if ($cfa_reg eq "%rsp");
+				}
+				last;
+			      };
+	    /push/	&& do {	$dir = undef;
+				$cfa_rsp -= 8;
+				if ($cfa_reg eq "%rsp") {
+				    $self->{value} = ".cfi_adjust_cfa_offset\t8\n";
+				}
+				$self->{value} .= ".cfi_offset\t$$line,$cfa_rsp";
+				last;
+			      };
+	    /pop/	&& do {	$dir = undef;
+				$cfa_rsp += 8;
+				if ($cfa_reg eq "%rsp") {
+				    $self->{value} = ".cfi_adjust_cfa_offset\t-8\n";
+				}
+				$self->{value} .= ".cfi_restore\t$$line";
+				last;
+			      };
+	    /cfa_expression/
+			&& do {	$dir = undef;
+				$self->{value} = ".cfi_escape\t" .
+					join(",", map(sprintf("0x%02x", $_),
+						      cfa_expression($$line)));
+				last;
+			      };
+	    }
+
+	    $self->{value} = ".cfi_$dir\t$$line" if ($dir);
+
+	    $$line = "";
+	}
+
+	return $ret;
+    }
+    sub out {
+	my $self = shift;
+	return ($elf ? $self->{value} : undef);
+    }
+}
+{ package directive;	# pick up directives, which start with .
+    sub re {
+	my	($class, $line) = @_;
+	my	$self = {};
+	my	$ret;
+	my	$dir;
+
+	# chain-call to cfi_directive
+	$ret = cfi_directive->re($line) and return $ret;
+
+	if ($$line =~ /^\s*(\.\w+)/) {
+	    bless $self,$class;
+	    $dir = $1;
+	    $ret = $self;
+	    undef $self->{value};
+	    $$line = substr($$line,@+[0]); $$line =~ s/^\s+//;
+
+	    SWITCH: for ($dir) {
+		/\.global|\.globl|\.extern/
+			    && do { $globals{$$line} = $prefix . $$line;
+				    $$line = $globals{$$line} if ($prefix);
+				    last;
+				  };
+		/\.type/    && do { my ($sym,$type,$narg) = split(',',$$line);
+				    if ($type eq "\@function") {
+					undef $current_function;
+					$current_function->{name} = $sym;
+					$current_function->{abi}  = "svr4";
+					$current_function->{narg} = $narg;
+					$current_function->{scope} = defined($globals{$sym})?"PUBLIC":"PRIVATE";
+				    } elsif ($type eq "\@abi-omnipotent") {
+					undef $current_function;
+					$current_function->{name} = $sym;
+					$current_function->{scope} = defined($globals{$sym})?"PUBLIC":"PRIVATE";
+				    }
+				    $$line =~ s/\@abi\-omnipotent/\@function/;
+				    $$line =~ s/\@function.*/\@function/;
+				    last;
+				  };
+		/\.asciz/   && do { if ($$line =~ /^"(.*)"$/) {
+					$dir  = ".byte";
+					$$line = join(",",unpack("C*",$1),0);
+				    }
+				    last;
+				  };
+		/\.rva|\.long|\.quad/
+			    && do { $$line =~ s/([_a-z][_a-z0-9]*)/$globals{$1} or $1/gei;
+				    $$line =~ s/\.L/$decor/g;
+				    last;
+				  };
+	    }
+
+	    if ($gas) {
+		$self->{value} = $dir . "\t" . $$line;
+
+		if ($dir =~ /\.extern/) {
+		    $self->{value} = ""; # swallow extern
+		} elsif (!$elf && $dir =~ /\.type/) {
+		    $self->{value} = "";
+		    $self->{value} = ".def\t" . ($globals{$1} or $1) . ";\t" .
+				(defined($globals{$1})?".scl 2;":".scl 3;") .
+				"\t.type 32;\t.endef"
+				if ($win64 && $$line =~ /([^,]+),\@function/);
+		} elsif (!$elf && $dir =~ /\.size/) {
+		    $self->{value} = "";
+		    if (defined($current_function)) {
+			$self->{value} .= "${decor}SEH_end_$current_function->{name}:"
+				if ($win64 && $current_function->{abi} eq "svr4");
+			undef $current_function;
+		    }
+		} elsif (!$elf && $dir =~ /\.align/) {
+		    $self->{value} = ".p2align\t" . (log($$line)/log(2));
+		} elsif ($dir eq ".section") {
+		    $current_segment=$$line;
+		    if (!$elf && $current_segment eq ".init") {
+			if	($flavour eq "macosx")	{ $self->{value} = ".mod_init_func"; }
+			elsif	($flavour eq "mingw64")	{ $self->{value} = ".section\t.ctors"; }
+		    }
+		} elsif ($dir =~ /\.(text|data)/) {
+		    $current_segment=".$1";
+		} elsif ($dir =~ /\.hidden/) {
+		    if    ($flavour eq "macosx")  { $self->{value} = ".private_extern\t$prefix$$line"; }
+		    elsif ($flavour eq "mingw64") { $self->{value} = ""; }
+		} elsif ($dir =~ /\.comm/) {
+		    $self->{value} = "$dir\t$prefix$$line";
+		    $self->{value} =~ s|,([0-9]+),([0-9]+)$|",$1,".log($2)/log(2)|e if ($flavour eq "macosx");
+		}
+		$$line = "";
+		return $self;
+	    }
+
+	    # non-gas case or nasm/masm
+	    SWITCH: for ($dir) {
+		/\.text/    && do { my $v=undef;
+				    if ($nasm) {
+					$v="section	.text code align=64\n";
+				    } else {
+					$v="$current_segment\tENDS\n" if ($current_segment);
+					$current_segment = ".text\$";
+					$v.="$current_segment\tSEGMENT ";
+					$v.=$masm>=$masmref ? "ALIGN(256)" : "PAGE";
+					$v.=" 'CODE'";
+				    }
+				    $self->{value} = $v;
+				    last;
+				  };
+		/\.data/    && do { my $v=undef;
+				    if ($nasm) {
+					$v="section	.data data align=8\n";
+				    } else {
+					$v="$current_segment\tENDS\n" if ($current_segment);
+					$current_segment = "_DATA";
+					$v.="$current_segment\tSEGMENT";
+				    }
+				    $self->{value} = $v;
+				    last;
+				  };
+		/\.section/ && do { my $v=undef;
+				    $$line =~ s/([^,]*).*/$1/;
+				    $$line = ".CRT\$XCU" if ($$line eq ".init");
+				    if ($nasm) {
+					$v="section	$$line";
+					if ($$line=~/\.([px])data/) {
+					    $v.=" rdata align=";
+					    $v.=$1 eq "p"? 4 : 8;
+					} elsif ($$line=~/\.CRT\$/i) {
+					    $v.=" rdata align=8";
+					}
+				    } else {
+					$v="$current_segment\tENDS\n" if ($current_segment);
+					$v.="$$line\tSEGMENT";
+					if ($$line=~/\.([px])data/) {
+					    $v.=" READONLY";
+					    $v.=" ALIGN(".($1 eq "p" ? 4 : 8).")" if ($masm>=$masmref);
+					} elsif ($$line=~/\.CRT\$/i) {
+					    $v.=" READONLY ";
+					    $v.=$masm>=$masmref ? "ALIGN(8)" : "DWORD";
+					}
+				    }
+				    $current_segment = $$line;
+				    $self->{value} = $v;
+				    last;
+				  };
+		/\.extern/  && do { $self->{value}  = "EXTERN\t".$$line;
+				    $self->{value} .= ":NEAR" if ($masm);
+				    last;
+				  };
+		/\.globl|.global/
+			    && do { $self->{value}  = $masm?"PUBLIC":"global";
+				    $self->{value} .= "\t".$$line;
+				    last;
+				  };
+		/\.size/    && do { if (defined($current_function)) {
+					undef $self->{value};
+					if ($current_function->{abi} eq "svr4") {
+					    $self->{value}="${decor}SEH_end_$current_function->{name}:";
+					    $self->{value}.=":\n" if($masm);
+					}
+					$self->{value}.="$current_function->{name}\tENDP" if($masm && $current_function->{name});
+					undef $current_function;
+				    }
+				    last;
+				  };
+		/\.align/   && do { my $max = ($masm && $masm>=$masmref) ? 256 : 4096;
+				    $self->{value} = "ALIGN\t".($$line>$max?$max:$$line);
+				    last;
+				  };
+		/\.(value|long|rva|quad)/
+			    && do { my $sz  = substr($1,0,1);
+				    my @arr = split(/,\s*/,$$line);
+				    my $last = pop(@arr);
+				    my $conv = sub  {	my $var=shift;
+							$var=~s/^(0b[0-1]+)/oct($1)/eig;
+							$var=~s/^0x([0-9a-f]+)/0$1h/ig if ($masm);
+							if ($sz eq "D" && ($current_segment=~/.[px]data/ || $dir eq ".rva"))
+							{ $var=~s/([_a-z\$\@][_a-z0-9\$\@]*)/$nasm?"$1 wrt ..imagebase":"imagerel $1"/egi; }
+							$var;
+						    };
+
+				    $sz =~ tr/bvlrq/BWDDQ/;
+				    $self->{value} = "\tD$sz\t";
+				    for (@arr) { $self->{value} .= &$conv($_).","; }
+				    $self->{value} .= &$conv($last);
+				    last;
+				  };
+		/\.byte/    && do { my @str=split(/,\s*/,$$line);
+				    map(s/(0b[0-1]+)/oct($1)/eig,@str);
+				    map(s/0x([0-9a-f]+)/0$1h/ig,@str) if ($masm);
+				    while ($#str>15) {
+					$self->{value}.="DB\t"
+						.join(",",@str[0..15])."\n";
+					foreach (0..15) { shift @str; }
+				    }
+				    $self->{value}.="DB\t"
+						.join(",",@str) if (@str);
+				    last;
+				  };
+		/\.comm/    && do { my @str=split(/,\s*/,$$line);
+				    my $v=undef;
+				    if ($nasm) {
+					$v.="common	$prefix@str[0] @str[1]";
+				    } else {
+					$v="$current_segment\tENDS\n" if ($current_segment);
+					$current_segment = "_DATA";
+					$v.="$current_segment\tSEGMENT\n";
+					$v.="COMM	@str[0]:DWORD:".@str[1]/4;
+				    }
+				    $self->{value} = $v;
+				    last;
+				  };
+	    }
+	    $$line = "";
+	}
+
+	$ret;
+    }
+    sub out {
+	my $self = shift;
+	$self->{value};
+    }
+}
+
+# Upon initial x86_64 introduction SSE>2 extensions were not introduced
+# yet. In order not to be bothered by tracing exact assembler versions,
+# but at the same time to provide a bare security minimum of AES-NI, we
+# hard-code some instructions. Extensions past AES-NI on the other hand
+# are traced by examining assembler version in individual perlasm
+# modules...
+
+my %regrm = (	"%eax"=>0, "%ecx"=>1, "%edx"=>2, "%ebx"=>3,
+		"%esp"=>4, "%ebp"=>5, "%esi"=>6, "%edi"=>7	);
+
+sub rex {
+ my $opcode=shift;
+ my ($dst,$src,$rex)=@_;
+
+   $rex|=0x04 if($dst>=8);
+   $rex|=0x01 if($src>=8);
+   push @$opcode,($rex|0x40) if ($rex);
+}
+
+my $movq = sub {	# elderly gas can't handle inter-register movq
+  my $arg = shift;
+  my @opcode=(0x66);
+    if ($arg =~ /%xmm([0-9]+),\s*%r(\w+)/) {
+	my ($src,$dst)=($1,$2);
+	if ($dst !~ /[0-9]+/)	{ $dst = $regrm{"%e$dst"}; }
+	rex(\@opcode,$src,$dst,0x8);
+	push @opcode,0x0f,0x7e;
+	push @opcode,0xc0|(($src&7)<<3)|($dst&7);	# ModR/M
+	@opcode;
+    } elsif ($arg =~ /%r(\w+),\s*%xmm([0-9]+)/) {
+	my ($src,$dst)=($2,$1);
+	if ($dst !~ /[0-9]+/)	{ $dst = $regrm{"%e$dst"}; }
+	rex(\@opcode,$src,$dst,0x8);
+	push @opcode,0x0f,0x6e;
+	push @opcode,0xc0|(($src&7)<<3)|($dst&7);	# ModR/M
+	@opcode;
+    } else {
+	();
+    }
+};
+
+my $pextrd = sub {
+    if (shift =~ /\$([0-9]+),\s*%xmm([0-9]+),\s*(%\w+)/) {
+      my @opcode=(0x66);
+	my $imm=$1;
+	my $src=$2;
+	my $dst=$3;
+	if ($dst =~ /%r([0-9]+)d/)	{ $dst = $1; }
+	elsif ($dst =~ /%e/)		{ $dst = $regrm{$dst}; }
+	rex(\@opcode,$src,$dst);
+	push @opcode,0x0f,0x3a,0x16;
+	push @opcode,0xc0|(($src&7)<<3)|($dst&7);	# ModR/M
+	push @opcode,$imm;
+	@opcode;
+    } else {
+	();
+    }
+};
+
+my $pinsrd = sub {
+    if (shift =~ /\$([0-9]+),\s*(%\w+),\s*%xmm([0-9]+)/) {
+      my @opcode=(0x66);
+	my $imm=$1;
+	my $src=$2;
+	my $dst=$3;
+	if ($src =~ /%r([0-9]+)/)	{ $src = $1; }
+	elsif ($src =~ /%e/)		{ $src = $regrm{$src}; }
+	rex(\@opcode,$dst,$src);
+	push @opcode,0x0f,0x3a,0x22;
+	push @opcode,0xc0|(($dst&7)<<3)|($src&7);	# ModR/M
+	push @opcode,$imm;
+	@opcode;
+    } else {
+	();
+    }
+};
+
+my $pshufb = sub {
+    if (0 && shift =~ /%xmm([0-9]+),\s*%xmm([0-9]+)/) {
+      my @opcode=(0x66);
+	rex(\@opcode,$2,$1);
+	push @opcode,0x0f,0x38,0x00;
+	push @opcode,0xc0|($1&7)|(($2&7)<<3);		# ModR/M
+	@opcode;
+    } else {
+	();
+    }
+};
+
+my $palignr = sub {
+    if (shift =~ /\$([0-9]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) {
+      my @opcode=(0x66);
+	rex(\@opcode,$3,$2);
+	push @opcode,0x0f,0x3a,0x0f;
+	push @opcode,0xc0|($2&7)|(($3&7)<<3);		# ModR/M
+	push @opcode,$1;
+	@opcode;
+    } else {
+	();
+    }
+};
+
+my $pclmulqdq = sub {
+    if (shift =~ /\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) {
+      my @opcode=(0x66);
+	rex(\@opcode,$3,$2);
+	push @opcode,0x0f,0x3a,0x44;
+	push @opcode,0xc0|($2&7)|(($3&7)<<3);		# ModR/M
+	my $c=$1;
+	push @opcode,$c=~/^0/?oct($c):$c;
+	@opcode;
+    } else {
+	();
+    }
+};
+
+my $rdrand = sub {
+    if (shift =~ /%[er](\w+)/) {
+      my @opcode=();
+      my $dst=$1;
+	if ($dst !~ /[0-9]+/) { $dst = $regrm{"%e$dst"}; }
+	rex(\@opcode,0,$dst,8);
+	push @opcode,0x0f,0xc7,0xf0|($dst&7);
+	@opcode;
+    } else {
+	();
+    }
+};
+
+my $rdseed = sub {
+    if (shift =~ /%[er](\w+)/) {
+      my @opcode=();
+      my $dst=$1;
+	if ($dst !~ /[0-9]+/) { $dst = $regrm{"%e$dst"}; }
+	rex(\@opcode,0,$dst,8);
+	push @opcode,0x0f,0xc7,0xf8|($dst&7);
+	@opcode;
+    } else {
+	();
+    }
+};
+
+# Not all AVX-capable assemblers recognize AMD XOP extension. Since we
+# are using only two instructions hand-code them in order to be excused
+# from chasing assembler versions...
+
+sub rxb {
+ my $opcode=shift;
+ my ($dst,$src1,$src2,$rxb)=@_;
+
+   $rxb|=0x7<<5;
+   $rxb&=~(0x04<<5) if($dst>=8);
+   $rxb&=~(0x01<<5) if($src1>=8);
+   $rxb&=~(0x02<<5) if($src2>=8);
+   push @$opcode,$rxb;
+}
+
+my $vprotd = sub {
+    if (shift =~ /\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) {
+      my @opcode=(0x8f);
+	rxb(\@opcode,$3,$2,-1,0x08);
+	push @opcode,0x78,0xc2;
+	push @opcode,0xc0|($2&7)|(($3&7)<<3);		# ModR/M
+	my $c=$1;
+	push @opcode,$c=~/^0/?oct($c):$c;
+	@opcode;
+    } else {
+	();
+    }
+};
+
+my $vprotq = sub {
+    if (shift =~ /\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) {
+      my @opcode=(0x8f);
+	rxb(\@opcode,$3,$2,-1,0x08);
+	push @opcode,0x78,0xc3;
+	push @opcode,0xc0|($2&7)|(($3&7)<<3);		# ModR/M
+	my $c=$1;
+	push @opcode,$c=~/^0/?oct($c):$c;
+	@opcode;
+    } else {
+	();
+    }
+};
+
+# Intel Control-flow Enforcement Technology extension. All functions and
+# indirect branch targets will have to start with this instruction...
+
+my $endbranch = sub {
+    (0xf3,0x0f,0x1e,0xfa);
+};
+
+########################################################################
+
+if ($nasm) {
+    print <<___;
+default	rel
+%define XMMWORD
+%define YMMWORD
+%define ZMMWORD
+___
+} elsif ($masm) {
+    print <<___;
+OPTION	DOTNAME
+___
+}
+while(defined(my $line=<>)) {
+
+    $line =~ s|\R$||;           # Better chomp
+
+    $line =~ s|[#!].*$||;	# get rid of asm-style comments...
+    $line =~ s|/\*.*\*/||;	# ... and C-style comments...
+    $line =~ s|^\s+||;		# ... and skip white spaces in beginning
+    $line =~ s|\s+$||;		# ... and at the end
+
+    if (my $label=label->re(\$line))	{ print $label->out(); }
+
+    if (my $directive=directive->re(\$line)) {
+	printf "%s",$directive->out();
+    } elsif (my $opcode=opcode->re(\$line)) {
+	my $asm = eval("\$".$opcode->mnemonic());
+
+	if ((ref($asm) eq 'CODE') && scalar(my @bytes=&$asm($line))) {
+	    print $gas?".byte\t":"DB\t",join(',',@bytes),"\n";
+	    next;
+	}
+
+	my @args;
+	ARGUMENT: while (1) {
+	    my $arg;
+
+	    ($arg=register->re(\$line, $opcode))||
+	    ($arg=const->re(\$line))		||
+	    ($arg=ea->re(\$line, $opcode))	||
+	    ($arg=expr->re(\$line, $opcode))	||
+	    last ARGUMENT;
+
+	    push @args,$arg;
+
+	    last ARGUMENT if ($line !~ /^,/);
+
+	    $line =~ s/^,\s*//;
+	} # ARGUMENT:
+
+	if ($#args>=0) {
+	    my $insn;
+	    my $sz=$opcode->size();
+
+	    if ($gas) {
+		$insn = $opcode->out($#args>=1?$args[$#args]->size():$sz);
+		@args = map($_->out($sz),@args);
+		printf "\t%s\t%s",$insn,join(",",@args);
+	    } else {
+		$insn = $opcode->out();
+		foreach (@args) {
+		    my $arg = $_->out();
+		    # $insn.=$sz compensates for movq, pinsrw, ...
+		    if ($arg =~ /^xmm[0-9]+$/) { $insn.=$sz; $sz="x" if(!$sz); last; }
+		    if ($arg =~ /^ymm[0-9]+$/) { $insn.=$sz; $sz="y" if(!$sz); last; }
+		    if ($arg =~ /^zmm[0-9]+$/) { $insn.=$sz; $sz="z" if(!$sz); last; }
+		    if ($arg =~ /^mm[0-9]+$/)  { $insn.=$sz; $sz="q" if(!$sz); last; }
+		}
+		@args = reverse(@args);
+		undef $sz if ($nasm && $opcode->mnemonic() eq "lea");
+		printf "\t%s\t%s",$insn,join(",",map($_->out($sz),@args));
+	    }
+	} else {
+	    printf "\t%s",$opcode->out();
+	}
+    }
+
+    print $line,"\n";
+}
+
+print "\n$current_segment\tENDS\n"	if ($current_segment && $masm);
+print "END\n"				if ($masm);
+
+close STDOUT;
+
+#################################################
+# Cross-reference x86_64 ABI "card"
+#
+# 		Unix		Win64
+# %rax		*		*
+# %rbx		-		-
+# %rcx		#4		#1
+# %rdx		#3		#2
+# %rsi		#2		-
+# %rdi		#1		-
+# %rbp		-		-
+# %rsp		-		-
+# %r8		#5		#3
+# %r9		#6		#4
+# %r10		*		*
+# %r11		*		*
+# %r12		-		-
+# %r13		-		-
+# %r14		-		-
+# %r15		-		-
+#
+# (*)	volatile register
+# (-)	preserved by callee
+# (#)	Nth argument, volatile
+#
+# In Unix terms top of stack is argument transfer area for arguments
+# which could not be accommodated in registers. Or in other words 7th
+# [integer] argument resides at 8(%rsp) upon function entry point.
+# 128 bytes above %rsp constitute a "red zone" which is not touched
+# by signal handlers and can be used as temporal storage without
+# allocating a frame.
+#
+# In Win64 terms N*8 bytes on top of stack is argument transfer area,
+# which belongs to/can be overwritten by callee. N is the number of
+# arguments passed to callee, *but* not less than 4! This means that
+# upon function entry point 5th argument resides at 40(%rsp), as well
+# as that 32 bytes from 8(%rsp) can always be used as temporal
+# storage [without allocating a frame]. One can actually argue that
+# one can assume a "red zone" above stack pointer under Win64 as well.
+# Point is that at apparently no occasion Windows kernel would alter
+# the area above user stack pointer in true asynchronous manner...
+#
+# All the above means that if assembler programmer adheres to Unix
+# register and stack layout, but disregards the "red zone" existence,
+# it's possible to use following prologue and epilogue to "gear" from
+# Unix to Win64 ABI in leaf functions with not more than 6 arguments.
+#
+# omnipotent_function:
+# ifdef WIN64
+#	movq	%rdi,8(%rsp)
+#	movq	%rsi,16(%rsp)
+#	movq	%rcx,%rdi	; if 1st argument is actually present
+#	movq	%rdx,%rsi	; if 2nd argument is actually ...
+#	movq	%r8,%rdx	; if 3rd argument is ...
+#	movq	%r9,%rcx	; if 4th argument ...
+#	movq	40(%rsp),%r8	; if 5th ...
+#	movq	48(%rsp),%r9	; if 6th ...
+# endif
+#	...
+# ifdef WIN64
+#	movq	8(%rsp),%rdi
+#	movq	16(%rsp),%rsi
+# endif
+#	ret
+#
+#################################################
+# Win64 SEH, Structured Exception Handling.
+#
+# Unlike on Unix systems(*) lack of Win64 stack unwinding information
+# has undesired side-effect at run-time: if an exception is raised in
+# assembler subroutine such as those in question (basically we're
+# referring to segmentation violations caused by malformed input
+# parameters), the application is briskly terminated without invoking
+# any exception handlers, most notably without generating memory dump
+# or any user notification whatsoever. This poses a problem. It's
+# possible to address it by registering custom language-specific
+# handler that would restore processor context to the state at
+# subroutine entry point and return "exception is not handled, keep
+# unwinding" code. Writing such handler can be a challenge... But it's
+# doable, though requires certain coding convention. Consider following
+# snippet:
+#
+# .type	function,@function
+# function:
+#	movq	%rsp,%rax	# copy rsp to volatile register
+#	pushq	%r15		# save non-volatile registers
+#	pushq	%rbx
+#	pushq	%rbp
+#	movq	%rsp,%r11
+#	subq	%rdi,%r11	# prepare [variable] stack frame
+#	andq	$-64,%r11
+#	movq	%rax,0(%r11)	# check for exceptions
+#	movq	%r11,%rsp	# allocate [variable] stack frame
+#	movq	%rax,0(%rsp)	# save original rsp value
+# magic_point:
+#	...
+#	movq	0(%rsp),%rcx	# pull original rsp value
+#	movq	-24(%rcx),%rbp	# restore non-volatile registers
+#	movq	-16(%rcx),%rbx
+#	movq	-8(%rcx),%r15
+#	movq	%rcx,%rsp	# restore original rsp
+# magic_epilogue:
+#	ret
+# .size function,.-function
+#
+# The key is that up to magic_point copy of original rsp value remains
+# in chosen volatile register and no non-volatile register, except for
+# rsp, is modified. While past magic_point rsp remains constant till
+# the very end of the function. In this case custom language-specific
+# exception handler would look like this:
+#
+# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
+#		CONTEXT *context,DISPATCHER_CONTEXT *disp)
+# {	ULONG64 *rsp = (ULONG64 *)context->Rax;
+#	ULONG64  rip = context->Rip;
+#
+#	if (rip >= magic_point)
+#	{   rsp = (ULONG64 *)context->Rsp;
+#	    if (rip < magic_epilogue)
+#	    {	rsp = (ULONG64 *)rsp[0];
+#		context->Rbp = rsp[-3];
+#		context->Rbx = rsp[-2];
+#		context->R15 = rsp[-1];
+#	    }
+#	}
+#	context->Rsp = (ULONG64)rsp;
+#	context->Rdi = rsp[1];
+#	context->Rsi = rsp[2];
+#
+#	memcpy (disp->ContextRecord,context,sizeof(CONTEXT));
+#	RtlVirtualUnwind(UNW_FLAG_NHANDLER,disp->ImageBase,
+#		dips->ControlPc,disp->FunctionEntry,disp->ContextRecord,
+#		&disp->HandlerData,&disp->EstablisherFrame,NULL);
+#	return ExceptionContinueSearch;
+# }
+#
+# It's appropriate to implement this handler in assembler, directly in
+# function's module. In order to do that one has to know members'
+# offsets in CONTEXT and DISPATCHER_CONTEXT structures and some constant
+# values. Here they are:
+#
+#	CONTEXT.Rax				120
+#	CONTEXT.Rcx				128
+#	CONTEXT.Rdx				136
+#	CONTEXT.Rbx				144
+#	CONTEXT.Rsp				152
+#	CONTEXT.Rbp				160
+#	CONTEXT.Rsi				168
+#	CONTEXT.Rdi				176
+#	CONTEXT.R8				184
+#	CONTEXT.R9				192
+#	CONTEXT.R10				200
+#	CONTEXT.R11				208
+#	CONTEXT.R12				216
+#	CONTEXT.R13				224
+#	CONTEXT.R14				232
+#	CONTEXT.R15				240
+#	CONTEXT.Rip				248
+#	CONTEXT.Xmm6				512
+#	sizeof(CONTEXT)				1232
+#	DISPATCHER_CONTEXT.ControlPc		0
+#	DISPATCHER_CONTEXT.ImageBase		8
+#	DISPATCHER_CONTEXT.FunctionEntry	16
+#	DISPATCHER_CONTEXT.EstablisherFrame	24
+#	DISPATCHER_CONTEXT.TargetIp		32
+#	DISPATCHER_CONTEXT.ContextRecord	40
+#	DISPATCHER_CONTEXT.LanguageHandler	48
+#	DISPATCHER_CONTEXT.HandlerData		56
+#	UNW_FLAG_NHANDLER			0
+#	ExceptionContinueSearch			1
+#
+# In order to tie the handler to the function one has to compose
+# couple of structures: one for .xdata segment and one for .pdata.
+#
+# UNWIND_INFO structure for .xdata segment would be
+#
+# function_unwind_info:
+#	.byte	9,0,0,0
+#	.rva	handler
+#
+# This structure designates exception handler for a function with
+# zero-length prologue, no stack frame or frame register.
+#
+# To facilitate composing of .pdata structures, auto-generated "gear"
+# prologue copies rsp value to rax and denotes next instruction with
+# .LSEH_begin_{function_name} label. This essentially defines the SEH
+# styling rule mentioned in the beginning. Position of this label is
+# chosen in such manner that possible exceptions raised in the "gear"
+# prologue would be accounted to caller and unwound from latter's frame.
+# End of function is marked with respective .LSEH_end_{function_name}
+# label. To summarize, .pdata segment would contain
+#
+#	.rva	.LSEH_begin_function
+#	.rva	.LSEH_end_function
+#	.rva	function_unwind_info
+#
+# Reference to function_unwind_info from .xdata segment is the anchor.
+# In case you wonder why references are 32-bit .rvas and not 64-bit
+# .quads. References put into these two segments are required to be
+# *relative* to the base address of the current binary module, a.k.a.
+# image base. No Win64 module, be it .exe or .dll, can be larger than
+# 2GB and thus such relative references can be and are accommodated in
+# 32 bits.
+#
+# Having reviewed the example function code, one can argue that "movq
+# %rsp,%rax" above is redundant. It is not! Keep in mind that on Unix
+# rax would contain an undefined value. If this "offends" you, use
+# another register and refrain from modifying rax till magic_point is
+# reached, i.e. as if it was a non-volatile register. If more registers
+# are required prior [variable] frame setup is completed, note that
+# nobody says that you can have only one "magic point." You can
+# "liberate" non-volatile registers by denoting last stack off-load
+# instruction and reflecting it in finer grade unwind logic in handler.
+# After all, isn't it why it's called *language-specific* handler...
+#
+# SE handlers are also involved in unwinding stack when executable is
+# profiled or debugged. Profiling implies additional limitations that
+# are too subtle to discuss here. For now it's sufficient to say that
+# in order to simplify handlers one should either a) offload original
+# %rsp to stack (like discussed above); or b) if you have a register to
+# spare for frame pointer, choose volatile one.
+#
+# (*)	Note that we're talking about run-time, not debug-time. Lack of
+#	unwind information makes debugging hard on both Windows and
+#	Unix. "Unlike" refers to the fact that on Unix signal handler
+#	will always be invoked, core dumped and appropriate exit code
+#	returned to parent (for user notification).
diff --git a/crypto_ops.h b/crypto_ops.h
new file mode 100644
index 0000000..4c72280
--- /dev/null
+++ b/crypto_ops.h
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#ifndef TUNSAFE_CRYPTO_OPS_H_
+#define TUNSAFE_CRYPTO_OPS_H_
+
+#include "build_config.h"
+#include "tunsafe_types.h"
+
+#include <string.h>
+#if defined(COMPILER_MSVC)
+#include <intrin.h>
+#endif  // defined(COMPILER_MSVC)
+
+#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC)
+FORCEINLINE static void memzero_crypto(void *dst, size_t n) {
+if (n & 7) {
+    __stosb((unsigned char*)dst, 0, n);
+  } else {
+    __stosq((uint64*)dst, 0, n >> 3);
+  }
+}
+
+#elif defined(ARCH_CPU_X86) && defined(COMPILER_MSVC)
+FORCEINLINE static void memzero_crypto(void *dst, size_t n) {
+  if (n & 3) {
+    __stosb((unsigned char*)dst, 0, n);
+  } else {
+    __stosd((unsigned long*)dst, 0, n >> 2);
+  }
+}
+#else
+FORCEINLINE static void memzero_crypto(void *dst, size_t n) {
+  memset(dst, 0, n);
+  __asm__ __volatile__("": :"r"(dst) :"memory");
+}
+#endif
+
+int memcmp_crypto(const uint8 *a, const uint8 *b, size_t n);
+
+
+#endif  // TUNSAFE_CRYPTO_OPS_H_
\ No newline at end of file
diff --git a/icons/green-bg-icon.ico b/icons/green-bg-icon.ico
new file mode 100644
index 0000000..376ff26
Binary files /dev/null and b/icons/green-bg-icon.ico differ
diff --git a/icons/green-bg-icon.png b/icons/green-bg-icon.png
new file mode 100644
index 0000000..69af2ae
Binary files /dev/null and b/icons/green-bg-icon.png differ
diff --git a/icons/green-icon.ico b/icons/green-icon.ico
new file mode 100644
index 0000000..7434c05
Binary files /dev/null and b/icons/green-icon.ico differ
diff --git a/icons/green-icon.png b/icons/green-icon.png
new file mode 100644
index 0000000..bfc2eeb
Binary files /dev/null and b/icons/green-icon.png differ
diff --git a/icons/neutral-icon.ico b/icons/neutral-icon.ico
new file mode 100644
index 0000000..254f2f3
Binary files /dev/null and b/icons/neutral-icon.ico differ
diff --git a/icons/neutral-icon.png b/icons/neutral-icon.png
new file mode 100644
index 0000000..bf6412e
Binary files /dev/null and b/icons/neutral-icon.png differ
diff --git a/icons/red-icon.ico b/icons/red-icon.ico
new file mode 100644
index 0000000..ec4c1a9
Binary files /dev/null and b/icons/red-icon.ico differ
diff --git a/icons/red-icon.png b/icons/red-icon.png
new file mode 100644
index 0000000..ddb33a3
Binary files /dev/null and b/icons/red-icon.png differ
diff --git a/installer/.gitignore b/installer/.gitignore
new file mode 100644
index 0000000..66727c9
--- /dev/null
+++ b/installer/.gitignore
@@ -0,0 +1,4 @@
+/tunsafe*.exe
+/x64/
+/x86/
+*.pyc
diff --git a/installer/ChangeLog.txt b/installer/ChangeLog.txt
new file mode 100644
index 0000000..3cb9202
--- /dev/null
+++ b/installer/ChangeLog.txt
@@ -0,0 +1,48 @@
+2018-06-20 - TunSafe v1.3-rc3
+
+Changes:
+1.Add option to block Internet traffic outside of TunSafe. Either
+  based on firewall rules, or by adding a null route, or both.
+  The firewall rule blocks all traffic except traffic from TunSafe,
+  loopback traffic, and DHCP traffic on the default NIC.
+  The route rule adds two /1 routes to 0.0.0.0.
+2.Convert LF to CRLF when importing config files
+3.Update some logging messages
+4.Delete the old routing rule pointing at the VPN server IP when
+  disconnecting
+5.Delete any conflicting old routing rule pointing at the VPN server
+  when connecting.
+6.Tray popup menu did not disappear when clicking outside of it.
+7.Show config file names also in tray popup menu.
+8.Make the menu item bold if connection is selected in popup menu.
+9.Don't show the .conf filename extension in the UI.
+10.Show also config file name when hovering on tray icon.
+11.Click on the connected server to toggle connection
+12.Fix bug where internet blocking checkbox was not removed.
+13.Change so bold is used for selected server, and checkbox
+   is used when connected.
+14.Use WS_EX_COMPOSITED to reduce flicker
+15.Now possible to enter a filename on command line to connect to.
+16.Support /minimize and /minimize_on_connect command line opts.
+17.Support PreUp,PostUp,PreDown,PostDown options on [Interface]
+   Note: For security reasons you need to first enable them,
+   so either Shift-Click on Options and select Allow Pre/Post Commands
+   or specify the /allow_pre_post command line option.
+
+2018-04-29 - TunSafe v1.2
+
+Changes:
+1.Use /24 instead of failing when a /32 Address is used
+2.Use /120 instead of failing when a /128 Address is used
+3.Add routes for all entries in AllowedIPs
+
+2018-04-29 - TunSafe v1.1
+
+Changes:
+1.Retry on failed DNS lookup. Helps when resuming from sleep.
+2.Display a better message if the TAP adapter can't be found.
+3.Retry connect when getting ERROR_FILE_NOT_FOUND.
+
+2018-03-06 - TunSafe v1.0
+
+First public release.
\ No newline at end of file
diff --git a/installer/LICENSE.TXT b/installer/LICENSE.TXT
new file mode 100644
index 0000000..06968fb
--- /dev/null
+++ b/installer/LICENSE.TXT
@@ -0,0 +1,240 @@
+TunSafe � 2018 Ludvig Strigeus
+==============================
+
+BY USING THE SOFTWARE, YOU ACCEPT THESE TERMS. IF YOU DO NOT ACCEPT
+THEM, DO NOT USE THE SOFTWARE.
+
+This software is provided "as is", without warranty of any kind,
+express or implied, including but not limited to the warranties of
+merchantability, fitness for a particular purpose and noninfringement.
+In no event shall the authors or copyright holders be liable for any
+claim, damages or other liability, whether in an action of contract,
+tort or otherwise, arising from, out of or in connection with the
+Software or the use or other dealings in the Software.
+
+We may not provide support services for this software in the future.
+
+You may install and use any number of copies of the software on your
+devices.
+
+Please be aware that, similar to other networking tools that capture
+network packets, the information processed by TunSafe or your VPN
+provider may include personally identifiable or other sensitive 
+information (such as usernames, passwords, addresses of web sites
+accessed). By using this software, you acknowledge that you are aware of
+this and take sole responsibility for any personally identifiable or
+other sensitive information provided to TunSafe or your VPN provider 
+through your use of the software.
+
+The software is licensed, not sold. This agreement only gives you some
+rights to use the software. Unless applicable law gives you more rights
+despite this limitation, you may use the software only as expressly
+permitted in this agreement. In doing so, you must comply with any
+technical limitations in the software that only allow you to use it in
+certain ways. You may not
+
+  * work around any technical limitations in the software;
+
+  * reverse engineer, decompile or disassemble the software, except
+    and only to the extent that applicable law expressly permits,
+    despite this limitation;
+
+  * publish the software for others to copy;
+
+  * sell, rent, lease or lend the software;
+
+  * transfer the software or this agreement to any third party; or
+
+  * use the software for commercial software hosting services.
+
+All exceptions require prior written consent from info@tunsafe.com.
+
+You can recover from us and our suppliers only direct damages up to
+U.S. $0.10. You cannot recover any other damages, including consequential,
+lost profits, special, indirect or incidental damages.
+
+This limitation applies to
+ * anything related to the software, services, content (including code)
+   on third party Internet sites, or third party programs; and
+ * claims for breach of contract, breach of warranty, guarantee or
+   condition, strict liability, negligence, or other tort to the extent
+   permitted by applicable law.
+
+It also applies even if we knew or should have known about the possibility
+of the damages. 
+
+This agreement describes certain legal rights. You may have other rights
+under the laws of your country. You may also have rights with respect to the
+party from whom you acquired the software. This agreement does not change
+your rights under the laws of your country if the laws of your country do
+not permit it to do so.
+
+This agreement is the entire agreement and is governed by the laws of Sweden.
+
+Several pieces of Open Source software were used in this product.
+Here are their licenses.
+
+BLAKE2 License
+--------------
+
+Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+your option.  The terms of these licenses can be found at:
+
+- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+- OpenSSL license   : https://www.openssl.org/source/license.html
+- Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+More information about the BLAKE2 hash function can be found at
+https://blake2.net.
+
+
+Curve25519-Donna License
+------------------------
+
+Copyright 2008, Google Inc.
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+
+    * Redistributions of source code must retain the above copyright
+notice, this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above
+copyright notice, this list of conditions and the following disclaimer
+in the documentation and/or other materials provided with the
+distribution.
+    * Neither the name of Google Inc. nor the names of its
+contributors may be used to endorse or promote products derived from
+this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+OpenSSL License
+---------------
+
+====================================================================
+Copyright (c) 1998-2018 The OpenSSL Project.  All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+
+1. Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer. 
+
+2. Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in
+   the documentation and/or other materials provided with the
+   distribution.
+
+3. All advertising materials mentioning features or use of this
+   software must display the following acknowledgment:
+   "This product includes software developed by the OpenSSL Project
+   for use in the OpenSSL Toolkit. (http://www.openssl.org/)"
+
+4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to
+   endorse or promote products derived from this software without
+   prior written permission. For written permission, please contact
+   openssl-core@openssl.org.
+
+5. Products derived from this software may not be called "OpenSSL"
+   nor may "OpenSSL" appear in their names without prior written
+   permission of the OpenSSL Project.
+
+6. Redistributions of any form whatsoever must retain the following
+   acknowledgment:
+   "This product includes software developed by the OpenSSL Project
+   for use in the OpenSSL Toolkit (http://www.openssl.org/)"
+
+THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT ``AS IS'' AND ANY
+EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE OpenSSL PROJECT OR
+ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+OF THE POSSIBILITY OF SUCH DAMAGE.
+====================================================================
+
+This product includes cryptographic software written by Eric Young
+(eay@cryptsoft.com).  This product includes software written by Tim
+Hudson (tjh@cryptsoft.com).
+
+
+
+Original SSLeay License
+-----------------------
+
+Copyright (C) 1995-1998 Eric Young (eay@cryptsoft.com)
+All rights reserved.
+
+This package is an SSL implementation written
+by Eric Young (eay@cryptsoft.com).
+The implementation was written so as to conform with Netscapes SSL.
+
+This library is free for commercial and non-commercial use as long as
+the following conditions are aheared to.  The following conditions
+apply to all code found in this distribution, be it the RC4, RSA,
+lhash, DES, etc., code; not just the SSL code.  The SSL documentation
+included with this distribution is covered by the same copyright terms
+except that the holder is Tim Hudson (tjh@cryptsoft.com).
+
+Copyright remains Eric Young's, and as such any Copyright notices in
+the code are not to be removed.
+If this package is used in a product, Eric Young should be given attribution
+as the author of the parts of the library used.
+This can be in the form of a textual message at program startup or
+in documentation (online or textual) provided with the package.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+1. Redistributions of source code must retain the copyright
+   notice, this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+3. All advertising materials mentioning features or use of this software
+   must display the following acknowledgement:
+   "This product includes cryptographic software written by
+    Eric Young (eay@cryptsoft.com)"
+   The word 'cryptographic' can be left out if the rouines from the library
+   being used are not cryptographic related :-).
+4. If you include any Windows specific code (or a derivative thereof) from 
+   the apps directory (application code) you must include an acknowledgement:
+   "This product includes software written by Tim Hudson (tjh@cryptsoft.com)"
+
+THIS SOFTWARE IS PROVIDED BY ERIC YOUNG ``AS IS'' AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGE.
+
+The licence and distribution terms for any publically available version or
+derivative of this code cannot be changed.  i.e. this code cannot simply be
+copied and put under another distribution licence
+[including the GNU Public Licence.]
+
+
diff --git a/installer/TunSafe.conf b/installer/TunSafe.conf
new file mode 100644
index 0000000..fce908a
--- /dev/null
+++ b/installer/TunSafe.conf
@@ -0,0 +1,46 @@
+# This is a sample config file for TunSafe. It uses the same syntax as
+# WireGuard's wg-quick tool
+
+[Interface]
+
+# The private key of this computer. This is a secret key, don't give it out.
+# To convert it to a public key you can go to 'Generate Key Pair' in TunSafe.
+PrivateKey = gIIBl0OHb3wZjYGqZtgzRml3wec0e5vqXtSvCTfa42w=
+
+# Whether we want to bind a port to allow others to initiate connections to us.
+# Please ensure this port is mapped in your router.
+# ListenPort = 51820
+
+# Switch DNS server while connected
+# DNS = 8.8.8.8 
+
+# The addresses to bind to. Either IPv4 or IPv6. /31 and /32 are not supported.
+Address = 192.168.2.2/24
+
+# Whether to block all access to Internet that doesn't go through tunsafe.
+# Note that Internet will keep being blocked even after TunSafe is restarted.
+# Possible values (comma separated):
+#  route - Blocks all traffic using null route entries
+#  firewall - Blocks all traffic except TunSafe through the Windows firewall
+#  on - Uses the default block mechanism
+#  off - Turns off blocking
+# BlockInternet = route, firewall
+
+[Peer]
+# The public key of the peer. Do not use the private key here. Use the 'Generate Key Pair'
+# function in TunSafe to convert a private key to a public key.
+PublicKey = hIA3ikjlSOAo0qqrI+rXaS3ZH04Yx7Q2YQ4m2Syz+XE=
+
+# It's also possible to use a preshared key for extra security
+# PresharedKey  =  SNz4BYc61amtDhzxNCxgYgdV9rPU+WiC8woX47Xf/2Y=
+
+# The IP range that we may send packets to for this peer.
+AllowedIPs = 192.168.2.0/24
+
+# Address of the server
+Endpoint = 192.168.1.4:8040
+
+# Send periodic keepalives to ensure connection stays up behind NAT.
+PersistentKeepalive = 25
+
+
diff --git a/installer/icon.ico b/installer/icon.ico
new file mode 100644
index 0000000..06b583b
Binary files /dev/null and b/installer/icon.ico differ
diff --git a/installer/signplugin.dll b/installer/signplugin.dll
new file mode 100644
index 0000000..ef19ae8
Binary files /dev/null and b/installer/signplugin.dll differ
diff --git a/installer/signplugin/.gitignore b/installer/signplugin/.gitignore
new file mode 100644
index 0000000..99e1bf3
--- /dev/null
+++ b/installer/signplugin/.gitignore
@@ -0,0 +1,3 @@
+/Debug/
+/Release/
+/.vs/
\ No newline at end of file
diff --git a/installer/signplugin/chkstk.obj b/installer/signplugin/chkstk.obj
new file mode 100644
index 0000000..e9956a6
Binary files /dev/null and b/installer/signplugin/chkstk.obj differ
diff --git a/installer/signplugin/ed25519.py b/installer/signplugin/ed25519.py
new file mode 100644
index 0000000..7f8613b
--- /dev/null
+++ b/installer/signplugin/ed25519.py
@@ -0,0 +1,104 @@
+import hashlib
+
+b = 256
+q = 2**255 - 19
+l = 2**252 + 27742317777372353535851937790883648493
+
+def H(m):
+  return hashlib.sha512(m).digest()
+
+def expmod(b,e,m):
+  if e == 0: return 1
+  t = expmod(b,e/2,m)**2 % m
+  if e & 1: t = (t*b) % m
+  return t
+
+def inv(x):
+  return expmod(x,q-2,q)
+
+d = -121665 * inv(121666)
+I = expmod(2,(q-1)/4,q)
+
+def xrecover(y):
+  xx = (y*y-1) * inv(d*y*y+1)
+  x = expmod(xx,(q+3)/8,q)
+  if (x*x - xx) % q != 0: x = (x*I) % q
+  if x % 2 != 0: x = q-x
+  return x
+
+By = 4 * inv(5)
+Bx = xrecover(By)
+B = [Bx % q,By % q]
+
+def edwards(P,Q):
+  x1 = P[0]
+  y1 = P[1]
+  x2 = Q[0]
+  y2 = Q[1]
+  x3 = (x1*y2+x2*y1) * inv(1+d*x1*x2*y1*y2)
+  y3 = (y1*y2+x1*x2) * inv(1-d*x1*x2*y1*y2)
+  return [x3 % q,y3 % q]
+
+def scalarmult(P,e):
+  if e == 0: return [0,1]
+  Q = scalarmult(P,e/2)
+  Q = edwards(Q,Q)
+  if e & 1: Q = edwards(Q,P)
+  return Q
+
+def encodeint(y):
+  bits = [(y >> i) & 1 for i in range(b)]
+  return ''.join([chr(sum([bits[i * 8 + j] << j for j in range(8)])) for i in range(b/8)])
+
+def encodepoint(P):
+  x = P[0]
+  y = P[1]
+  bits = [(y >> i) & 1 for i in range(b - 1)] + [x & 1]
+  return ''.join([chr(sum([bits[i * 8 + j] << j for j in range(8)])) for i in range(b/8)])
+
+def bit(h,i):
+  return (ord(h[i/8]) >> (i%8)) & 1
+
+def publickey(sk):
+  h = H(sk)
+  a = 2**(b-2) + sum(2**i * bit(h,i) for i in range(3,b-2))
+  A = scalarmult(B,a)
+  return encodepoint(A)
+
+def Hint(m):
+  h = H(m)
+  return sum(2**i * bit(h,i) for i in range(2*b))
+
+def signature(m,sk,pk):
+  h = H(sk)
+  a = 2**(b-2) + sum(2**i * bit(h,i) for i in range(3,b-2))
+  r = Hint(''.join([h[i] for i in range(b/8,b/4)]) + m)
+  R = scalarmult(B,r)
+  S = (r + Hint(encodepoint(R) + pk + m) * a) % l
+  return encodepoint(R) + encodeint(S)
+
+def isoncurve(P):
+  x = P[0]
+  y = P[1]
+  return (-x*x + y*y - 1 - d*x*x*y*y) % q == 0
+
+def decodeint(s):
+  return sum(2**i * bit(s,i) for i in range(0,b))
+
+def decodepoint(s):
+  y = sum(2**i * bit(s,i) for i in range(0,b-1))
+  x = xrecover(y)
+  if x & 1 != bit(s,b-1): x = q-x
+  P = [x,y]
+  if not isoncurve(P): raise Exception("decoding point that is not on curve")
+  return P
+
+def checkvalid(s,m,pk):
+  if len(s) != b/4: raise Exception("signature length is wrong")
+  if len(pk) != b/8: raise Exception("public-key length is wrong")
+  R = decodepoint(s[0:b/8])
+  A = decodepoint(pk)
+  S = decodeint(s[b/8:b/4])
+  h = Hint(encodepoint(R) + pk + m)
+  if scalarmult(B,S) != edwards(R,scalarmult(A,h)):
+    raise Exception("signature does not pass verification")
diff --git a/installer/signplugin/ed_signtool.py b/installer/signplugin/ed_signtool.py
new file mode 100644
index 0000000..3f8d0dd
--- /dev/null
+++ b/installer/signplugin/ed_signtool.py
@@ -0,0 +1,22 @@
+import hashlib
+
+def H(m):
+  return hashlib.sha512(m).digest()
+
+import ed25519
+import os
+
+sk = "".join(chr(c) for c in [4, 213, 116, 80, 117, 4, 70, 166, 244, 214, 234, 159, 197, 101, 182, 177, 106, 180, 68, 125, 51, 32, 159, 77, 27, 151, 233, 91, 109, 184, 147, 235])
+pk = "".join(chr(c) for c in [79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99])
+
+hash = H(file('../tap/TunSafe-TAP-9.21.2.exe', 'rb').read()) 
+print hash.encode('hex'), repr(hash)
+
+#sk = os.urandom(32)
+#pk = ed25519.publickey(sk)
+#print 'sk', [ord(c) for c in sk]
+#print 'pk', [ord(c) for c in pk]
+
+#m = 'test'
+s = ed25519.signature(hash,sk,pk)
+file('../tap/TunSafe-TAP-9.21.2.exe.sig', 'wb').write(s.encode('hex'))
diff --git a/installer/signplugin/main.cpp b/installer/signplugin/main.cpp
new file mode 100644
index 0000000..db8c2f4
--- /dev/null
+++ b/installer/signplugin/main.cpp
@@ -0,0 +1,121 @@
+#include <Windows.h>
+extern "C" {
+#include "tiny/edsign.h"
+#include "nsis/pluginapi.h"
+#include "tiny/sha512.h"
+}
+
+// To work with Unicode version of NSIS, please use TCHAR-type
+// functions for accessing the variables and the stack.
+
+unsigned char buffer[4096];
+
+// sk[4, 213, 116, 80, 117, 4, 70, 166, 244, 214, 234, 159, 197, 101, 182, 177, 106, 180, 68, 125, 51, 32, 159, 77, 27, 151, 233, 91, 109, 184, 147, 235]
+// pk[79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99]
+static const unsigned char pk[32] = {79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99};
+
+int CheckFile(char *file) {
+  sha512_state ctx;
+  int ret;
+  HANDLE h;
+  unsigned char out[64];
+  unsigned char signature[64];
+
+  h = CreateFileA(file, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+  if (h == INVALID_HANDLE_VALUE)
+    return 1;
+  DWORD n;
+  sha512_init(&ctx);
+
+  size_t total_size = 0;
+  size_t p = 0;
+  while (ReadFile(h, buffer, sizeof(buffer), &n, NULL) && n) {
+    total_size += n;
+    p = 0;
+    while (p + 128 <= n) {
+      sha512_block(&ctx, buffer + p);
+      p += 128;
+    }
+    if (p != n)
+      break;
+  }
+  sha512_final(&ctx, buffer + p, total_size);
+  sha512_get(&ctx, out, 0, 64);
+  CloseHandle(h);
+  /*
+  for (size_t i = 0; i < 64; i++) {
+    buffer[i * 2 + 0] = "0123456789abcdef"[out[i] >> 4];
+    buffer[i * 2 + 1] = "0123456789abcdef"[out[i] & 0xF];
+  }
+  buffer[128] = 0;
+  MessageBoxA(0, (char*)buffer, "sha", 0);
+  */
+  char *x = file;
+  while (*x)x++;
+  memcpy(x, ".sig", 5);
+
+  h = CreateFileA(file, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+  if (h == INVALID_HANDLE_VALUE)
+    return 2;
+  n = 0;
+  ReadFile(h, buffer, sizeof(buffer), &n, NULL);
+  CloseHandle(h);
+  if (n < 128)
+    return 3;
+
+  memset(signature, 0, sizeof(signature));
+  
+  for (int i = 0; i < 128; i++) {
+    unsigned char c = buffer[i];
+    if (c >= '0' && c <= '9')
+      c -= '0';
+    else if ((c |= 32), c >= 'a' && c <= 'f')
+      c -= 'a' - 10;
+    else
+      return 4;
+    signature[i >> 1] = (signature[i >> 1] << 4) + c;
+  }
+
+  /* create a random seed, and a keypair out of that seed */
+  //ed25519_create_seed(seed);
+  //ed25519_create_keypair(public_key, private_key, seed);
+
+  /* create signature on the message with the keypair */
+  //ed25519_sign(signature, message, message_len, public_key, private_key);
+
+  /* verify the signature */
+  return edsign_verify(signature, pk, out, sizeof(out)) ? 0 : 5;
+}
+
+extern "C" void __declspec(dllexport) myFunction(HWND hwndParent, int string_size,
+                                      LPTSTR variables, stack_t **stacktop,
+                                      extra_parameters *extra, ...) {
+  EXDLL_INIT();
+
+  int rv = 10;
+
+  // note if you want parameters from the stack, pop them off in order.
+  // i.e. if you are called via exdll::myFunction file.dat read.txt
+  // calling popstring() the first time would give you file.dat,
+  // and the second time would give you read.txt. 
+  // you should empty the stack of your parameters, and ONLY your
+  // parameters.
+
+  // do your stuff here
+  {
+    LPTSTR msgbuf = (LPTSTR)GlobalAlloc(GPTR, (string_size + 1 + 10) * sizeof(*msgbuf));
+    if (msgbuf) {
+      if (!popstring(msgbuf)) {
+        rv = CheckFile(msgbuf);
+      }
+      GlobalFree(msgbuf);
+    }
+  }
+
+  pushint(rv);
+}
+
+
+BOOL WINAPI DllMain(HINSTANCE hInst, ULONG ul_reason_for_call, LPVOID lpReserved) {
+  return TRUE;
+}
diff --git a/installer/signplugin/nsis/api.h b/installer/signplugin/nsis/api.h
new file mode 100644
index 0000000..eebbbf0
--- /dev/null
+++ b/installer/signplugin/nsis/api.h
@@ -0,0 +1,85 @@
+/*
+ * apih
+ * 
+ * This file is a part of NSIS.
+ * 
+ * Copyright (C) 1999-2018 Nullsoft and Contributors
+ * 
+ * Licensed under the zlib/libpng license (the "License");
+ * you may not use this file except in compliance with the License.
+ * 
+ * Licence details can be found in the file COPYING.
+ * 
+ * This software is provided 'as-is', without any express or implied
+ * warranty.
+ */
+
+#ifndef _NSIS_EXEHEAD_API_H_
+#define _NSIS_EXEHEAD_API_H_
+
+// Starting with NSIS 2.42, you can check the version of the plugin API in exec_flags->plugin_api_version
+// The format is 0xXXXXYYYY where X is the major version and Y is the minor version (MAKELONG(y,x))
+// When doing version checks, always remember to use >=, ex: if (pX->exec_flags->plugin_api_version >= NSISPIAPIVER_1_0) {}
+
+#define NSISPIAPIVER_1_0 0x00010000
+#define NSISPIAPIVER_CURR NSISPIAPIVER_1_0
+
+// NSIS Plug-In Callback Messages
+enum NSPIM 
+{
+  NSPIM_UNLOAD,    // This is the last message a plugin gets, do final cleanup
+  NSPIM_GUIUNLOAD, // Called after .onGUIEnd
+};
+
+// Prototype for callbacks registered with extra_parameters->RegisterPluginCallback()
+// Return NULL for unknown messages
+// Should always be __cdecl for future expansion possibilities
+typedef UINT_PTR (*NSISPLUGINCALLBACK)(enum NSPIM);
+
+// extra_parameters data structure containing other interesting stuff
+// besides the stack, variables and HWND passed on to plug-ins.
+typedef struct
+{
+  int autoclose;          // SetAutoClose
+  int all_user_var;       // SetShellVarContext: User context = 0, Machine context = 1
+  int exec_error;         // IfErrors
+  int abort;              // IfAbort
+  int exec_reboot;        // IfRebootFlag (NSIS_SUPPORT_REBOOT)
+  int reboot_called;      // NSIS_SUPPORT_REBOOT
+  int XXX_cur_insttype;   // Deprecated
+  int plugin_api_version; // Plug-in ABI. See NSISPIAPIVER_CURR (Note: used to be XXX_insttype_changed)
+  int silent;             // IfSilent (NSIS_CONFIG_SILENT_SUPPORT)
+  int instdir_error;      // GetInstDirError
+  int rtl;                // 1 if $LANGUAGE is a RTL language
+  int errlvl;             // SetErrorLevel
+  int alter_reg_view;     // SetRegView: Default View = 0, Alternative View = (sizeof(void*) > 4 ? KEY_WOW64_32KEY : KEY_WOW64_64KEY)
+  int status_update;      // SetDetailsPrint
+} exec_flags_t;
+
+#ifndef NSISCALL
+#  define NSISCALL __stdcall
+#endif
+#if !defined(_WIN32) && !defined(LPTSTR)
+#  define LPTSTR TCHAR*
+#endif
+
+typedef struct {
+  exec_flags_t *exec_flags;
+  int (NSISCALL *ExecuteCodeSegment)(int, HWND);
+  void (NSISCALL *validate_filename)(LPTSTR);
+  int (NSISCALL *RegisterPluginCallback)(HMODULE, NSISPLUGINCALLBACK); // returns 0 on success, 1 if already registered and < 0 on errors
+} extra_parameters;
+
+// Definitions for page showing plug-ins
+// See Ui.c to understand better how they're used
+
+// sent to the outer window to tell it to go to the next inner window
+#define WM_NOTIFY_OUTER_NEXT (WM_USER+0x8)
+
+// custom pages should send this message to let NSIS know they're ready
+#define WM_NOTIFY_CUSTOM_READY (WM_USER+0xd)
+
+// sent as wParam with WM_NOTIFY_OUTER_NEXT when user cancels - heed its warning
+#define NOTIFY_BYE_BYE 'x'
+
+#endif /* _NSIS_EXEHEAD_API_H_ */
diff --git a/installer/signplugin/nsis/nsis_tchar.h b/installer/signplugin/nsis/nsis_tchar.h
new file mode 100644
index 0000000..3f105ba
--- /dev/null
+++ b/installer/signplugin/nsis/nsis_tchar.h
@@ -0,0 +1,229 @@
+/*
+ * nsis_tchar.h
+ * 
+ * This file is a part of NSIS.
+ * 
+ * Copyright (C) 1999-2018 Nullsoft and Contributors
+ * 
+ * This software is provided 'as-is', without any express or implied
+ * warranty.
+ *
+ * For Unicode support by Jim Park -- 08/30/2007
+ */
+
+// Jim Park: Only those we use are listed here.
+
+#pragma once
+
+#ifdef _UNICODE
+
+#ifndef _T
+#define __T(x)   L ## x
+#define _T(x)    __T(x)
+#define _TEXT(x) __T(x)
+#endif
+
+#ifndef _TCHAR_DEFINED
+#define _TCHAR_DEFINED
+#if !defined(_NATIVE_WCHAR_T_DEFINED) && !defined(_WCHAR_T_DEFINED)
+typedef unsigned short TCHAR;
+#else
+typedef wchar_t TCHAR;
+#endif
+#endif
+
+
+// program
+#define _tenviron   _wenviron
+#define __targv     __wargv
+
+// printfs
+#define _ftprintf   fwprintf
+#define _sntprintf  _snwprintf
+#if (defined(_MSC_VER) && (_MSC_VER<=1310||_MSC_FULL_VER<=140040310)) || defined(__MINGW32__)
+#	define _stprintf   swprintf
+#else
+#	define _stprintf   _swprintf
+#endif
+#define _tprintf    wprintf
+#define _vftprintf  vfwprintf
+#define _vsntprintf _vsnwprintf
+#if defined(_MSC_VER) && (_MSC_VER<=1310)
+#	define _vstprintf  vswprintf
+#else
+#	define _vstprintf  _vswprintf
+#endif
+
+// scanfs
+#define _tscanf     wscanf
+#define _stscanf    swscanf
+
+// string manipulations
+#define _tcscat     wcscat
+#define _tcschr     wcschr
+#define _tcsclen    wcslen
+#define _tcscpy     wcscpy
+#define _tcsdup     _wcsdup
+#define _tcslen     wcslen
+#define _tcsnccpy   wcsncpy
+#define _tcsncpy    wcsncpy
+#define _tcsrchr    wcsrchr
+#define _tcsstr     wcsstr
+#define _tcstok     wcstok
+
+// string comparisons
+#define _tcscmp     wcscmp
+#define _tcsicmp    _wcsicmp
+#define _tcsncicmp  _wcsnicmp
+#define _tcsncmp    wcsncmp
+#define _tcsnicmp   _wcsnicmp
+
+// upper / lower
+#define _tcslwr     _wcslwr
+#define _tcsupr     _wcsupr
+#define _totlower   towlower
+#define _totupper   towupper
+
+// conversions to numbers
+#define _tcstoi64   _wcstoi64
+#define _tcstol     wcstol
+#define _tcstoul    wcstoul
+#define _tstof      _wtof
+#define _tstoi      _wtoi
+#define _tstoi64    _wtoi64
+#define _ttoi       _wtoi
+#define _ttoi64     _wtoi64
+#define _ttol       _wtol
+
+// conversion from numbers to strings
+#define _itot       _itow
+#define _ltot       _ltow
+#define _i64tot     _i64tow
+#define _ui64tot    _ui64tow
+
+// file manipulations
+#define _tfopen     _wfopen
+#define _topen      _wopen
+#define _tremove    _wremove
+#define _tunlink    _wunlink
+
+// reading and writing to i/o
+#define _fgettc     fgetwc
+#define _fgetts     fgetws
+#define _fputts     fputws
+#define _gettchar   getwchar
+
+// directory
+#define _tchdir     _wchdir
+
+// environment
+#define _tgetenv    _wgetenv
+#define _tsystem    _wsystem
+
+// time
+#define _tcsftime   wcsftime
+
+#else // ANSI
+
+#ifndef _T
+#define _T(x)    x
+#define _TEXT(x) x
+#endif
+
+#ifndef _TCHAR_DEFINED
+#define _TCHAR_DEFINED
+typedef char TCHAR;
+#endif
+
+// program
+#define _tenviron   environ
+#define __targv     __argv
+
+// printfs
+#define _ftprintf   fprintf
+#define _sntprintf  _snprintf
+#define _stprintf   sprintf
+#define _tprintf    printf
+#define _vftprintf  vfprintf
+#define _vsntprintf _vsnprintf
+#define _vstprintf  vsprintf
+
+// scanfs
+#define _tscanf     scanf
+#define _stscanf    sscanf
+
+// string manipulations
+#define _tcscat     strcat
+#define _tcschr     strchr
+#define _tcsclen    strlen
+#define _tcscnlen   strnlen
+#define _tcscpy     strcpy
+#define _tcsdup     _strdup
+#define _tcslen     strlen
+#define _tcsnccpy   strncpy
+#define _tcsrchr    strrchr
+#define _tcsstr     strstr
+#define _tcstok     strtok
+
+// string comparisons
+#define _tcscmp     strcmp
+#define _tcsicmp    _stricmp
+#define _tcsncmp    strncmp
+#define _tcsncicmp  _strnicmp
+#define _tcsnicmp   _strnicmp
+
+// upper / lower
+#define _tcslwr     _strlwr
+#define _tcsupr     _strupr
+
+#define _totupper   toupper
+#define _totlower   tolower
+
+// conversions to numbers
+#define _tcstol     strtol
+#define _tcstoul    strtoul
+#define _tstof      atof
+#define _tstoi      atoi
+#define _tstoi64    _atoi64
+#define _tstoi64    _atoi64
+#define _ttoi       atoi
+#define _ttoi64     _atoi64
+#define _ttol       atol
+
+// conversion from numbers to strings
+#define _i64tot     _i64toa
+#define _itot       _itoa
+#define _ltot       _ltoa
+#define _ui64tot    _ui64toa
+
+// file manipulations
+#define _tfopen     fopen
+#define _topen      _open
+#define _tremove    remove
+#define _tunlink    _unlink
+
+// reading and writing to i/o
+#define _fgettc     fgetc
+#define _fgetts     fgets
+#define _fputts     fputs
+#define _gettchar   getchar
+
+// directory
+#define _tchdir     _chdir
+
+// environment
+#define _tgetenv    getenv
+#define _tsystem    system
+
+// time
+#define _tcsftime   strftime
+
+#endif
+
+// is functions (the same in Unicode / ANSI)
+#define _istgraph   isgraph
+#define _istascii   __isascii
+
+#define __TFILE__ _T(__FILE__)
+#define __TDATE__ _T(__DATE__)
+#define __TTIME__ _T(__TIME__)
diff --git a/installer/signplugin/nsis/pluginapi-x86-ansi.lib b/installer/signplugin/nsis/pluginapi-x86-ansi.lib
new file mode 100644
index 0000000..4921639
Binary files /dev/null and b/installer/signplugin/nsis/pluginapi-x86-ansi.lib differ
diff --git a/installer/signplugin/nsis/pluginapi-x86-unicode.lib b/installer/signplugin/nsis/pluginapi-x86-unicode.lib
new file mode 100644
index 0000000..400c488
Binary files /dev/null and b/installer/signplugin/nsis/pluginapi-x86-unicode.lib differ
diff --git a/installer/signplugin/nsis/pluginapi.h b/installer/signplugin/nsis/pluginapi.h
new file mode 100644
index 0000000..63fe790
--- /dev/null
+++ b/installer/signplugin/nsis/pluginapi.h
@@ -0,0 +1,108 @@
+#ifndef ___NSIS_PLUGIN__H___
+#define ___NSIS_PLUGIN__H___
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "api.h"
+#include "nsis_tchar.h" // BUGBUG: Why cannot our plugins use the compilers tchar.h?
+
+#ifndef NSISCALL
+#  define NSISCALL WINAPI
+#endif
+
+#define EXDLL_INIT()           {  \
+        g_stringsize=string_size; \
+        g_stacktop=stacktop;      \
+        g_variables=variables; }
+
+typedef struct _stack_t {
+  struct _stack_t *next;
+#ifdef UNICODE
+  WCHAR text[1]; // this should be the length of g_stringsize when allocating
+#else
+  char text[1];
+#endif
+} stack_t;
+
+enum
+{
+INST_0,         // $0
+INST_1,         // $1
+INST_2,         // $2
+INST_3,         // $3
+INST_4,         // $4
+INST_5,         // $5
+INST_6,         // $6
+INST_7,         // $7
+INST_8,         // $8
+INST_9,         // $9
+INST_R0,        // $R0
+INST_R1,        // $R1
+INST_R2,        // $R2
+INST_R3,        // $R3
+INST_R4,        // $R4
+INST_R5,        // $R5
+INST_R6,        // $R6
+INST_R7,        // $R7
+INST_R8,        // $R8
+INST_R9,        // $R9
+INST_CMDLINE,   // $CMDLINE
+INST_INSTDIR,   // $INSTDIR
+INST_OUTDIR,    // $OUTDIR
+INST_EXEDIR,    // $EXEDIR
+INST_LANG,      // $LANGUAGE
+__INST_LAST
+};
+
+extern unsigned int g_stringsize;
+extern stack_t **g_stacktop;
+extern LPTSTR g_variables;
+
+void NSISCALL pushstring(LPCTSTR str);
+void NSISCALL pushintptr(INT_PTR value);
+#define pushint(v) pushintptr((INT_PTR)(v))
+int NSISCALL popstring(LPTSTR str); // 0 on success, 1 on empty stack
+int NSISCALL popstringn(LPTSTR str, int maxlen); // with length limit, pass 0 for g_stringsize
+INT_PTR NSISCALL popintptr();
+#define popint() ( (int) popintptr() )
+int NSISCALL popint_or(); // with support for or'ing (2|4|8)
+INT_PTR NSISCALL nsishelper_str_to_ptr(LPCTSTR s);
+#define myatoi(s) ( (int) nsishelper_str_to_ptr(s) ) // converts a string to an integer
+unsigned int NSISCALL myatou(LPCTSTR s); // converts a string to an unsigned integer, decimal only
+int NSISCALL myatoi_or(LPCTSTR s); // with support for or'ing (2|4|8)
+LPTSTR NSISCALL getuservariable(const int varnum);
+void NSISCALL setuservariable(const int varnum, LPCTSTR var);
+
+#ifdef UNICODE
+#define PopStringW(x) popstring(x)
+#define PushStringW(x) pushstring(x)
+#define SetUserVariableW(x,y) setuservariable(x,y)
+
+int  NSISCALL PopStringA(LPSTR ansiStr);
+void NSISCALL PushStringA(LPCSTR ansiStr);
+void NSISCALL GetUserVariableW(const int varnum, LPWSTR wideStr);
+void NSISCALL GetUserVariableA(const int varnum, LPSTR ansiStr);
+void NSISCALL SetUserVariableA(const int varnum, LPCSTR ansiStr);
+
+#else
+// ANSI defs
+
+#define PopStringA(x) popstring(x)
+#define PushStringA(x) pushstring(x)
+#define SetUserVariableA(x,y) setuservariable(x,y)
+
+int  NSISCALL PopStringW(LPWSTR wideStr);
+void NSISCALL PushStringW(LPWSTR wideStr);
+void NSISCALL GetUserVariableW(const int varnum, LPWSTR wideStr);
+void NSISCALL GetUserVariableA(const int varnum, LPSTR ansiStr);
+void NSISCALL SetUserVariableW(const int varnum, LPCWSTR wideStr);
+
+#endif
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif//!___NSIS_PLUGIN__H___
diff --git a/installer/signplugin/signplugin.sln b/installer/signplugin/signplugin.sln
new file mode 100644
index 0000000..fd263d8
--- /dev/null
+++ b/installer/signplugin/signplugin.sln
@@ -0,0 +1,28 @@
+﻿
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio 15
+VisualStudioVersion = 15.0.26403.7
+MinimumVisualStudioVersion = 10.0.40219.1
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "signplugin", "signplugin.vcxproj", "{C6E4A1D7-ECBC-466E-9183-30727EF81533}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|x64 = Debug|x64
+		Debug|x86 = Debug|x86
+		Release|x64 = Release|x64
+		Release|x86 = Release|x86
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x64.ActiveCfg = Debug|x64
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x64.Build.0 = Debug|x64
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x86.ActiveCfg = Debug|Win32
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x86.Build.0 = Debug|Win32
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x64.ActiveCfg = Release|x64
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x64.Build.0 = Release|x64
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x86.ActiveCfg = Release|Win32
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x86.Build.0 = Release|Win32
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+EndGlobal
diff --git a/installer/signplugin/signplugin.vcxproj b/installer/signplugin/signplugin.vcxproj
new file mode 100644
index 0000000..1104e33
--- /dev/null
+++ b/installer/signplugin/signplugin.vcxproj
@@ -0,0 +1,166 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <VCProjectVersion>15.0</VCProjectVersion>
+    <ProjectGuid>{C6E4A1D7-ECBC-466E-9183-30727EF81533}</ProjectGuid>
+    <Keyword>Win32Proj</Keyword>
+    <WindowsTargetPlatformVersion>10.0.15063.0</WindowsTargetPlatformVersion>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <ConfigurationType>DynamicLibrary</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <ConfigurationType>DynamicLibrary</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <WholeProgramOptimization>false</WholeProgramOptimization>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+  </ImportGroup>
+  <ImportGroup Label="Shared">
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <LinkIncremental>true</LinkIncremental>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <LinkIncremental>true</LinkIncremental>
+    <GenerateManifest>false</GenerateManifest>
+  </PropertyGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <PreprocessorDefinitions>WIN32;_DEBUG;_WINDOWS;_USRDLL;SIGNPLUGIN_EXPORTS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
+      <RuntimeLibrary>MultiThreadedDebugDLL</RuntimeLibrary>
+      <WarningLevel>Level3</WarningLevel>
+      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>
+      <Optimization>Disabled</Optimization>
+    </ClCompile>
+    <Link>
+      <TargetMachine>MachineX86</TargetMachine>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <SubSystem>Windows</SubSystem>
+      <EntryPointSymbol>
+      </EntryPointSymbol>
+      <IgnoreAllDefaultLibraries>false</IgnoreAllDefaultLibraries>
+      <ImageHasSafeExceptionHandlers>false</ImageHasSafeExceptionHandlers>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <PreprocessorDefinitions>WIN32;NDEBUG;_WINDOWS;_USRDLL;SIGNPLUGIN_EXPORTS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <WarningLevel>Level3</WarningLevel>
+      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>
+      <ExceptionHandling>false</ExceptionHandling>
+      <BufferSecurityCheck>false</BufferSecurityCheck>
+      <Optimization>MinSpace</Optimization>
+      <OmitFramePointers>true</OmitFramePointers>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+    </ClCompile>
+    <Link>
+      <TargetMachine>MachineX86</TargetMachine>
+      <GenerateDebugInformation>false</GenerateDebugInformation>
+      <SubSystem>Windows</SubSystem>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <IgnoreAllDefaultLibraries>true</IgnoreAllDefaultLibraries>
+      <EntryPointSymbol>DllMain</EntryPointSymbol>
+      <ImageHasSafeExceptionHandlers>false</ImageHasSafeExceptionHandlers>
+      <LinkTimeCodeGeneration>UseLinkTimeCodeGeneration</LinkTimeCodeGeneration>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <ClCompile Include="main.cpp" />
+    <ClCompile Include="tiny\c25519.c" />
+    <ClCompile Include="tiny\ed25519.c" />
+    <ClCompile Include="tiny\edsign.c" />
+    <ClCompile Include="tiny\f25519.c" />
+    <ClCompile Include="tiny\fprime.c" />
+    <ClCompile Include="tiny\morph25519.c" />
+    <ClCompile Include="tiny\sha512.c" />
+    <ClCompile Include="win32_crt_math.cpp" />
+    <ClCompile Include="win32_crt_memory.cpp" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="curve25519-donna-32bit.h" />
+    <ClInclude Include="curve25519-donna-64bit.h" />
+    <ClInclude Include="curve25519-donna-helpers.h" />
+    <ClInclude Include="curve25519-donna-sse2.h" />
+    <ClInclude Include="ed25519-donna-32bit-tables.h" />
+    <ClInclude Include="ed25519-donna-64bit-tables.h" />
+    <ClInclude Include="ed25519-donna-batchverify.h" />
+    <ClInclude Include="ed25519-donna-impl-base.h" />
+    <ClInclude Include="ed25519-donna-impl-sse2.h" />
+    <ClInclude Include="ed25519-donna-portable-identify.h" />
+    <ClInclude Include="ed25519-donna-portable.h" />
+    <ClInclude Include="ed25519-donna.h" />
+    <ClInclude Include="ed25519-hash-custom.h" />
+    <ClInclude Include="ed25519-hash.h" />
+    <ClInclude Include="ed25519-randombytes.h" />
+    <ClInclude Include="ed25519.h" />
+    <ClInclude Include="modm-donna-32bit.h" />
+    <ClInclude Include="modm-donna-64bit.h" />
+    <ClInclude Include="tiny\c25519.h" />
+    <ClInclude Include="tiny\ed25519.h" />
+    <ClInclude Include="tiny\edsign.h" />
+    <ClInclude Include="tiny\f25519.h" />
+    <ClInclude Include="tiny\fprime.h" />
+    <ClInclude Include="tiny\morph25519.h" />
+    <ClInclude Include="tiny\sha512.h" />
+  </ItemGroup>
+  <ItemGroup>
+    <Object Include="chkstk.obj">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+    </Object>
+  </ItemGroup>
+  <ItemGroup>
+    <Library Include="nsis\pluginapi-x86-unicode.lib" />
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+  </ImportGroup>
+</Project>
\ No newline at end of file
diff --git a/installer/signplugin/signplugin.vcxproj.filters b/installer/signplugin/signplugin.vcxproj.filters
new file mode 100644
index 0000000..57b82ec
--- /dev/null
+++ b/installer/signplugin/signplugin.vcxproj.filters
@@ -0,0 +1,132 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="Header Files">
+      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>
+      <Extensions>h;hh;hpp;hxx;hm;inl;inc;xsd</Extensions>
+    </Filter>
+    <Filter Include="Resource Files">
+      <UniqueIdentifier>{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}</UniqueIdentifier>
+      <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav</Extensions>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="main.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="win32_crt_math.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="win32_crt_memory.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\c25519.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\ed25519.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\edsign.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\f25519.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\fprime.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\morph25519.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\sha512.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="curve25519-donna-32bit.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="curve25519-donna-64bit.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="curve25519-donna-helpers.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="curve25519-donna-sse2.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-32bit-tables.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-64bit-tables.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-batchverify.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-impl-base.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-impl-sse2.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-portable-identify.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-portable.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-hash-custom.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-hash.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-randombytes.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="modm-donna-32bit.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="modm-donna-64bit.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\c25519.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\ed25519.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\edsign.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\f25519.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\fprime.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\morph25519.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\sha512.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+  <ItemGroup>
+    <Object Include="chkstk.obj" />
+  </ItemGroup>
+  <ItemGroup>
+    <Library Include="nsis\pluginapi-x86-unicode.lib" />
+  </ItemGroup>
+</Project>
\ No newline at end of file
diff --git a/installer/signplugin/signplugin.vcxproj.user b/installer/signplugin/signplugin.vcxproj.user
new file mode 100644
index 0000000..be25078
--- /dev/null
+++ b/installer/signplugin/signplugin.vcxproj.user
@@ -0,0 +1,4 @@
+﻿<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <PropertyGroup />
+</Project>
\ No newline at end of file
diff --git a/installer/signplugin/tiny/c25519.c b/installer/signplugin/tiny/c25519.c
new file mode 100644
index 0000000..a9c9f08
--- /dev/null
+++ b/installer/signplugin/tiny/c25519.c
@@ -0,0 +1,124 @@
+/* Curve25519 (Montgomery form)
+ * Daniel Beer <dlbeer@gmail.com>, 18 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "c25519.h"
+
+const uint8_t c25519_base_x[F25519_SIZE] = {9};
+
+/* Double an X-coordinate */
+static void xc_double(uint8_t *x3, uint8_t *z3,
+		      const uint8_t *x1, const uint8_t *z1)
+{
+	/* Explicit formulas database: dbl-1987-m
+	 *
+	 * source 1987 Montgomery "Speeding the Pollard and elliptic
+	 *   curve methods of factorization", page 261, fourth display
+	 * compute X3 = (X1^2-Z1^2)^2
+	 * compute Z3 = 4 X1 Z1 (X1^2 + a X1 Z1 + Z1^2)
+	 */
+	uint8_t x1sq[F25519_SIZE];
+	uint8_t z1sq[F25519_SIZE];
+	uint8_t x1z1[F25519_SIZE];
+	uint8_t a[F25519_SIZE];
+
+	f25519_mul__distinct(x1sq, x1, x1);
+	f25519_mul__distinct(z1sq, z1, z1);
+	f25519_mul__distinct(x1z1, x1, z1);
+
+	f25519_sub(a, x1sq, z1sq);
+	f25519_mul__distinct(x3, a, a);
+
+	f25519_mul_c(a, x1z1, 486662);
+	f25519_add(a, x1sq, a);
+	f25519_add(a, z1sq, a);
+	f25519_mul__distinct(x1sq, x1z1, a);
+	f25519_mul_c(z3, x1sq, 4);
+}
+
+/* Differential addition */
+static void xc_diffadd(uint8_t *x5, uint8_t *z5,
+		       const uint8_t *x1, const uint8_t *z1,
+		       const uint8_t *x2, const uint8_t *z2,
+		       const uint8_t *x3, const uint8_t *z3)
+{
+	/* Explicit formulas database: dbl-1987-m3
+	 *
+	 * source 1987 Montgomery "Speeding the Pollard and elliptic curve
+	 *   methods of factorization", page 261, fifth display, plus
+	 *   common-subexpression elimination
+	 * compute A = X2+Z2
+	 * compute B = X2-Z2
+	 * compute C = X3+Z3
+	 * compute D = X3-Z3
+	 * compute DA = D A
+	 * compute CB = C B
+	 * compute X5 = Z1(DA+CB)^2
+	 * compute Z5 = X1(DA-CB)^2
+	 */
+	uint8_t da[F25519_SIZE];
+	uint8_t cb[F25519_SIZE];
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+
+	f25519_add(a, x2, z2);
+	f25519_sub(b, x3, z3); /* D */
+	f25519_mul__distinct(da, a, b);
+
+	f25519_sub(b, x2, z2);
+	f25519_add(a, x3, z3); /* C */
+	f25519_mul__distinct(cb, a, b);
+
+	f25519_add(a, da, cb);
+	f25519_mul__distinct(b, a, a);
+	f25519_mul__distinct(x5, z1, b);
+
+	f25519_sub(a, da, cb);
+	f25519_mul__distinct(b, a, a);
+	f25519_mul__distinct(z5, x1, b);
+}
+
+void c25519_smult(uint8_t *result, const uint8_t *q, const uint8_t *e)
+{
+	/* Current point: P_m */
+	uint8_t xm[F25519_SIZE];
+	uint8_t zm[F25519_SIZE] = {1};
+
+	/* Predecessor: P_(m-1) */
+	uint8_t xm1[F25519_SIZE] = {1};
+	uint8_t zm1[F25519_SIZE] = {0};
+
+	int i;
+
+	/* Note: bit 254 is assumed to be 1 */
+	f25519_copy(xm, q);
+
+	for (i = 253; i >= 0; i--) {
+		const int bit = (e[i >> 3] >> (i & 7)) & 1;
+		uint8_t xms[F25519_SIZE];
+		uint8_t zms[F25519_SIZE];
+
+		/* From P_m and P_(m-1), compute P_(2m) and P_(2m-1) */
+		xc_diffadd(xm1, zm1, q, f25519_one, xm, zm, xm1, zm1);
+		xc_double(xm, zm, xm, zm);
+
+		/* Compute P_(2m+1) */
+		xc_diffadd(xms, zms, xm1, zm1, xm, zm, q, f25519_one);
+
+		/* Select:
+		 *   bit = 1 --> (P_(2m+1), P_(2m))
+		 *   bit = 0 --> (P_(2m), P_(2m-1))
+		 */
+		f25519_select(xm1, xm1, xm, bit);
+		f25519_select(zm1, zm1, zm, bit);
+		f25519_select(xm, xm, xms, bit);
+		f25519_select(zm, zm, zms, bit);
+	}
+
+	/* Freeze out of projective coordinates */
+	f25519_inv__distinct(zm1, zm);
+	f25519_mul__distinct(result, zm1, xm);
+	f25519_normalize(result);
+}
diff --git a/installer/signplugin/tiny/c25519.h b/installer/signplugin/tiny/c25519.h
new file mode 100644
index 0000000..4596438
--- /dev/null
+++ b/installer/signplugin/tiny/c25519.h
@@ -0,0 +1,48 @@
+/* Curve25519 (Montgomery form)
+ * Daniel Beer <dlbeer@gmail.com>, 18 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef C25519_H_
+#define C25519_H_
+
+#include <stdint.h>
+#include "f25519.h"
+
+/* Curve25519 has the equation over F(p = 2^255-19):
+ *
+ *    y^2 = x^3 + 486662x^2 + x
+ *
+ * 486662 = 4A+2, where A = 121665. This is a Montgomery curve.
+ *
+ * For more information, see:
+ *
+ *    Bernstein, D.J. (2006) "Curve25519: New Diffie-Hellman speed
+ *    records". Document ID: 4230efdfa673480fc079449d90f322c0.
+ */
+
+/* This is the site of a Curve25519 exponent (private key) */
+#define C25519_EXPONENT_SIZE  32
+
+/* Having generated 32 random bytes, you should call this function to
+ * finalize the generated key.
+ */
+static inline void c25519_prepare(uint8_t *key)
+{
+	key[0] &= 0xf8;
+	key[31] &= 0x7f;
+	key[31] |= 0x40;
+}
+
+/* X-coordinate of the base point */
+extern const uint8_t c25519_base_x[F25519_SIZE];
+
+/* X-coordinate scalar multiply: given the X-coordinate of q, return the
+ * X-coordinate of e*q.
+ *
+ * result and q are field elements. e is an exponent.
+ */
+void c25519_smult(uint8_t *result, const uint8_t *q, const uint8_t *e);
+
+#endif
diff --git a/installer/signplugin/tiny/ed25519.c b/installer/signplugin/tiny/ed25519.c
new file mode 100644
index 0000000..51ac462
--- /dev/null
+++ b/installer/signplugin/tiny/ed25519.c
@@ -0,0 +1,320 @@
+/* Edwards curve operations
+ * Daniel Beer <dlbeer@gmail.com>, 9 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "ed25519.h"
+
+/* Base point is (numbers wrapped):
+ *
+ *     x = 151122213495354007725011514095885315114
+ *         54012693041857206046113283949847762202
+ *     y = 463168356949264781694283940034751631413
+ *         07993866256225615783033603165251855960
+ *
+ * y is derived by transforming the original Montgomery base (u=9). x
+ * is the corresponding positive coordinate for the new curve equation.
+ * t is x*y.
+ */
+const struct ed25519_pt ed25519_base = {
+	.x = {
+		0x1a, 0xd5, 0x25, 0x8f, 0x60, 0x2d, 0x56, 0xc9,
+		0xb2, 0xa7, 0x25, 0x95, 0x60, 0xc7, 0x2c, 0x69,
+		0x5c, 0xdc, 0xd6, 0xfd, 0x31, 0xe2, 0xa4, 0xc0,
+		0xfe, 0x53, 0x6e, 0xcd, 0xd3, 0x36, 0x69, 0x21
+	},
+	.y = {
+		0x58, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66,
+		0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66,
+		0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66,
+		0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66
+	},
+	.t = {
+		0xa3, 0xdd, 0xb7, 0xa5, 0xb3, 0x8a, 0xde, 0x6d,
+		0xf5, 0x52, 0x51, 0x77, 0x80, 0x9f, 0xf0, 0x20,
+		0x7d, 0xe3, 0xab, 0x64, 0x8e, 0x4e, 0xea, 0x66,
+		0x65, 0x76, 0x8b, 0xd7, 0x0f, 0x5f, 0x87, 0x67
+	},
+	.z = {1, 0}
+};
+
+const struct ed25519_pt ed25519_neutral = {
+	.x = {0},
+	.y = {1, 0},
+	.t = {0},
+	.z = {1, 0}
+};
+
+/* Conversion to and from projective coordinates */
+void ed25519_project(struct ed25519_pt *p,
+		     const uint8_t *x, const uint8_t *y)
+{
+	f25519_copy(p->x, x);
+	f25519_copy(p->y, y);
+	f25519_load(p->z, 1);
+	f25519_mul__distinct(p->t, x, y);
+}
+
+void ed25519_unproject(uint8_t *x, uint8_t *y,
+		       const struct ed25519_pt *p)
+{
+	uint8_t z1[F25519_SIZE];
+
+	f25519_inv__distinct(z1, p->z);
+	f25519_mul__distinct(x, p->x, z1);
+	f25519_mul__distinct(y, p->y, z1);
+
+	f25519_normalize(x);
+	f25519_normalize(y);
+}
+
+/* Compress/uncompress points. We compress points by storing the x
+ * coordinate and the parity of the y coordinate.
+ *
+ * Rearranging the curve equation, we obtain explicit formulae for the
+ * coordinates:
+ *
+ *     x = sqrt((y^2-1) / (1+dy^2))
+ *     y = sqrt((x^2+1) / (1-dx^2))
+ *
+ * Where d = (-121665/121666), or:
+ *
+ *     d = 370957059346694393431380835087545651895
+ *         42113879843219016388785533085940283555
+ */
+
+static const uint8_t ed25519_d[F25519_SIZE] = {
+	0xa3, 0x78, 0x59, 0x13, 0xca, 0x4d, 0xeb, 0x75,
+	0xab, 0xd8, 0x41, 0x41, 0x4d, 0x0a, 0x70, 0x00,
+	0x98, 0xe8, 0x79, 0x77, 0x79, 0x40, 0xc7, 0x8c,
+	0x73, 0xfe, 0x6f, 0x2b, 0xee, 0x6c, 0x03, 0x52
+};
+
+void ed25519_pack(uint8_t *c, const uint8_t *x, const uint8_t *y)
+{
+	uint8_t tmp[F25519_SIZE];
+	uint8_t parity;
+
+	f25519_copy(tmp, x);
+	f25519_normalize(tmp);
+	parity = (tmp[0] & 1) << 7;
+
+	f25519_copy(c, y);
+	f25519_normalize(c);
+	c[31] |= parity;
+}
+
+uint8_t ed25519_try_unpack(uint8_t *x, uint8_t *y, const uint8_t *comp)
+{
+	const int parity = comp[31] >> 7;
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+	uint8_t c[F25519_SIZE];
+
+	/* Unpack y */
+	f25519_copy(y, comp);
+	y[31] &= 127;
+
+	/* Compute c = y^2 */
+	f25519_mul__distinct(c, y, y);
+
+	/* Compute b = (1+dy^2)^-1 */
+	f25519_mul__distinct(b, c, ed25519_d);
+	f25519_add(a, b, f25519_one);
+	f25519_inv__distinct(b, a);
+
+	/* Compute a = y^2-1 */
+	f25519_sub(a, c, f25519_one);
+
+	/* Compute c = a*b = (y^2-1)/(1-dy^2) */
+	f25519_mul__distinct(c, a, b);
+
+	/* Compute a, b = +/-sqrt(c), if c is square */
+	f25519_sqrt(a, c);
+	f25519_neg(b, a);
+
+	/* Select one of them, based on the compressed parity bit */
+	f25519_select(x, a, b, (a[0] ^ parity) & 1);
+
+	/* Verify that x^2 = c */
+	f25519_mul__distinct(a, x, x);
+	f25519_normalize(a);
+	f25519_normalize(c);
+
+	return f25519_eq(a, c);
+}
+
+/* k = 2d */
+static const uint8_t ed25519_k[F25519_SIZE] = {
+	0x59, 0xf1, 0xb2, 0x26, 0x94, 0x9b, 0xd6, 0xeb,
+	0x56, 0xb1, 0x83, 0x82, 0x9a, 0x14, 0xe0, 0x00,
+	0x30, 0xd1, 0xf3, 0xee, 0xf2, 0x80, 0x8e, 0x19,
+	0xe7, 0xfc, 0xdf, 0x56, 0xdc, 0xd9, 0x06, 0x24
+};
+
+void ed25519_add(struct ed25519_pt *r,
+		 const struct ed25519_pt *p1, const struct ed25519_pt *p2)
+{
+	/* Explicit formulas database: add-2008-hwcd-3
+	 *
+	 * source 2008 Hisil--Wong--Carter--Dawson,
+	 *     http://eprint.iacr.org/2008/522, Section 3.1
+	 * appliesto extended-1
+	 * parameter k
+	 * assume k = 2 d
+	 * compute A = (Y1-X1)(Y2-X2)
+	 * compute B = (Y1+X1)(Y2+X2)
+	 * compute C = T1 k T2
+	 * compute D = Z1 2 Z2
+	 * compute E = B - A
+	 * compute F = D - C
+	 * compute G = D + C
+	 * compute H = B + A
+	 * compute X3 = E F
+	 * compute Y3 = G H
+	 * compute T3 = E H
+	 * compute Z3 = F G
+	 */
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+	uint8_t c[F25519_SIZE];
+	uint8_t d[F25519_SIZE];
+	uint8_t e[F25519_SIZE];
+	uint8_t f[F25519_SIZE];
+	uint8_t g[F25519_SIZE];
+	uint8_t h[F25519_SIZE];
+
+	/* A = (Y1-X1)(Y2-X2) */
+	f25519_sub(c, p1->y, p1->x);
+	f25519_sub(d, p2->y, p2->x);
+	f25519_mul__distinct(a, c, d);
+
+	/* B = (Y1+X1)(Y2+X2) */
+	f25519_add(c, p1->y, p1->x);
+	f25519_add(d, p2->y, p2->x);
+	f25519_mul__distinct(b, c, d);
+
+	/* C = T1 k T2 */
+	f25519_mul__distinct(d, p1->t, p2->t);
+	f25519_mul__distinct(c, d, ed25519_k);
+
+	/* D = Z1 2 Z2 */
+	f25519_mul__distinct(d, p1->z, p2->z);
+	f25519_add(d, d, d);
+
+	/* E = B - A */
+	f25519_sub(e, b, a);
+
+	/* F = D - C */
+	f25519_sub(f, d, c);
+
+	/* G = D + C */
+	f25519_add(g, d, c);
+
+	/* H = B + A */
+	f25519_add(h, b, a);
+
+	/* X3 = E F */
+	f25519_mul__distinct(r->x, e, f);
+
+	/* Y3 = G H */
+	f25519_mul__distinct(r->y, g, h);
+
+	/* T3 = E H */
+	f25519_mul__distinct(r->t, e, h);
+
+	/* Z3 = F G */
+	f25519_mul__distinct(r->z, f, g);
+}
+
+void ed25519_double(struct ed25519_pt *r, const struct ed25519_pt *p)
+{
+	/* Explicit formulas database: dbl-2008-hwcd
+	 *
+	 * source 2008 Hisil--Wong--Carter--Dawson,
+	 *     http://eprint.iacr.org/2008/522, Section 3.3
+	 * compute A = X1^2
+	 * compute B = Y1^2
+	 * compute C = 2 Z1^2
+	 * compute D = a A
+	 * compute E = (X1+Y1)^2-A-B
+	 * compute G = D + B
+	 * compute F = G - C
+	 * compute H = D - B
+	 * compute X3 = E F
+	 * compute Y3 = G H
+	 * compute T3 = E H
+	 * compute Z3 = F G
+	 */
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+	uint8_t c[F25519_SIZE];
+	uint8_t e[F25519_SIZE];
+	uint8_t f[F25519_SIZE];
+	uint8_t g[F25519_SIZE];
+	uint8_t h[F25519_SIZE];
+
+	/* A = X1^2 */
+	f25519_mul__distinct(a, p->x, p->x);
+
+	/* B = Y1^2 */
+	f25519_mul__distinct(b, p->y, p->y);
+
+	/* C = 2 Z1^2 */
+	f25519_mul__distinct(c, p->z, p->z);
+	f25519_add(c, c, c);
+
+	/* D = a A (alter sign) */
+	/* E = (X1+Y1)^2-A-B */
+	f25519_add(f, p->x, p->y);
+	f25519_mul__distinct(e, f, f);
+	f25519_sub(e, e, a);
+	f25519_sub(e, e, b);
+
+	/* G = D + B */
+	f25519_sub(g, b, a);
+
+	/* F = G - C */
+	f25519_sub(f, g, c);
+
+	/* H = D - B */
+	f25519_neg(h, b);
+	f25519_sub(h, h, a);
+
+	/* X3 = E F */
+	f25519_mul__distinct(r->x, e, f);
+
+	/* Y3 = G H */
+	f25519_mul__distinct(r->y, g, h);
+
+	/* T3 = E H */
+	f25519_mul__distinct(r->t, e, h);
+
+	/* Z3 = F G */
+	f25519_mul__distinct(r->z, f, g);
+}
+
+void ed25519_smult(struct ed25519_pt *r_out, const struct ed25519_pt *p,
+		   const uint8_t *e)
+{
+	struct ed25519_pt r;
+	int i;
+
+	ed25519_copy(&r, &ed25519_neutral);
+
+	for (i = 255; i >= 0; i--) {
+		const uint8_t bit = (e[i >> 3] >> (i & 7)) & 1;
+		struct ed25519_pt s;
+
+		ed25519_double(&r, &r);
+		ed25519_add(&s, &r, p);
+
+		f25519_select(r.x, r.x, s.x, bit);
+		f25519_select(r.y, r.y, s.y, bit);
+		f25519_select(r.z, r.z, s.z, bit);
+		f25519_select(r.t, r.t, s.t, bit);
+	}
+
+	ed25519_copy(r_out, &r);
+}
diff --git a/installer/signplugin/tiny/ed25519.h b/installer/signplugin/tiny/ed25519.h
new file mode 100644
index 0000000..62f0120
--- /dev/null
+++ b/installer/signplugin/tiny/ed25519.h
@@ -0,0 +1,82 @@
+/* Edwards curve operations
+ * Daniel Beer <dlbeer@gmail.com>, 9 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef ED25519_H_
+#define ED25519_H_
+
+#include "f25519.h"
+
+/* This is not the Ed25519 signature system. Rather, we're implementing
+ * basic operations on the twisted Edwards curve over (Z mod 2^255-19):
+ *
+ *     -x^2 + y^2 = 1 - (121665/121666)x^2y^2
+ *
+ * With the positive-x base point y = 4/5.
+ *
+ * These functions will not leak secret data through timing.
+ *
+ * For more information, see:
+ *
+ *     Bernstein, D.J. & Lange, T. (2007) "Faster addition and doubling on
+ *     elliptic curves". Document ID: 95616567a6ba20f575c5f25e7cebaf83.
+ *
+ *     Hisil, H. & Wong, K K. & Carter, G. & Dawson, E. (2008) "Twisted
+ *     Edwards curves revisited". Advances in Cryptology, ASIACRYPT 2008,
+ *     Vol. 5350, pp. 326-343.
+ */
+
+/* Projective coordinates */
+struct ed25519_pt {
+	uint8_t  x[F25519_SIZE];
+	uint8_t  y[F25519_SIZE];
+	uint8_t  t[F25519_SIZE];
+	uint8_t  z[F25519_SIZE];
+};
+
+extern const struct ed25519_pt ed25519_base;
+extern const struct ed25519_pt ed25519_neutral;
+
+/* Convert between projective and affine coordinates (x/y in F25519) */
+void ed25519_project(struct ed25519_pt *p,
+		     const uint8_t *x, const uint8_t *y);
+
+void ed25519_unproject(uint8_t *x, uint8_t *y,
+		       const struct ed25519_pt *p);
+
+/* Compress/uncompress points. try_unpack() will check that the
+ * compressed point is on the curve, returning 1 if the unpacked point
+ * is valid, and 0 otherwise.
+ */
+#define ED25519_PACK_SIZE  F25519_SIZE
+
+void ed25519_pack(uint8_t *c, const uint8_t *x, const uint8_t *y);
+uint8_t ed25519_try_unpack(uint8_t *x, uint8_t *y, const uint8_t *c);
+
+/* Add, double and scalar multiply */
+#define ED25519_EXPONENT_SIZE  32
+
+/* Prepare an exponent by clamping appropriate bits */
+static inline void ed25519_prepare(uint8_t *e)
+{
+	e[0] &= 0xf8;
+	e[31] &= 0x7f;
+	e[31] |= 0x40;
+}
+
+/* Order of the group generated by the base point */
+static inline void ed25519_copy(struct ed25519_pt *dst,
+				const struct ed25519_pt *src)
+{
+	memcpy(dst, src, sizeof(*dst));
+}
+
+void ed25519_add(struct ed25519_pt *r,
+		 const struct ed25519_pt *a, const struct ed25519_pt *b);
+void ed25519_double(struct ed25519_pt *r, const struct ed25519_pt *a);
+void ed25519_smult(struct ed25519_pt *r, const struct ed25519_pt *a,
+		   const uint8_t *e);
+
+#endif
diff --git a/installer/signplugin/tiny/edsign.c b/installer/signplugin/tiny/edsign.c
new file mode 100644
index 0000000..bf131a5
--- /dev/null
+++ b/installer/signplugin/tiny/edsign.c
@@ -0,0 +1,168 @@
+/* Edwards curve signature system
+ * Daniel Beer <dlbeer@gmail.com>, 22 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "ed25519.h"
+#include "sha512.h"
+#include "fprime.h"
+#include "edsign.h"
+
+#define EXPANDED_SIZE  64
+
+static const uint8_t ed25519_order[FPRIME_SIZE] = {
+	0xed, 0xd3, 0xf5, 0x5c, 0x1a, 0x63, 0x12, 0x58,
+	0xd6, 0x9c, 0xf7, 0xa2, 0xde, 0xf9, 0xde, 0x14,
+	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10
+};
+
+static void expand_key(uint8_t *expanded, const uint8_t *secret)
+{
+	struct sha512_state s;
+
+	sha512_init(&s);
+	sha512_final(&s, secret, EDSIGN_SECRET_KEY_SIZE);
+	sha512_get(&s, expanded, 0, EXPANDED_SIZE);
+	ed25519_prepare(expanded);
+}
+
+static uint8_t upp(struct ed25519_pt *p, const uint8_t *packed)
+{
+	uint8_t x[F25519_SIZE];
+	uint8_t y[F25519_SIZE];
+	uint8_t ok = ed25519_try_unpack(x, y, packed);
+
+	ed25519_project(p, x, y);
+	return ok;
+}
+
+static void pp(uint8_t *packed, const struct ed25519_pt *p)
+{
+	uint8_t x[F25519_SIZE];
+	uint8_t y[F25519_SIZE];
+
+	ed25519_unproject(x, y, p);
+	ed25519_pack(packed, x, y);
+}
+
+static void sm_pack(uint8_t *r, const uint8_t *k)
+{
+	struct ed25519_pt p;
+
+	ed25519_smult(&p, &ed25519_base, k);
+	pp(r, &p);
+}
+
+void edsign_sec_to_pub(uint8_t *pub, const uint8_t *secret)
+{
+	uint8_t expanded[EXPANDED_SIZE];
+
+	expand_key(expanded, secret);
+	sm_pack(pub, expanded);
+}
+
+static void hash_with_prefix(uint8_t *out_fp,
+			     uint8_t *init_block, unsigned int prefix_size,
+			     const uint8_t *message, size_t len)
+{
+	struct sha512_state s;
+
+	sha512_init(&s);
+
+	if (len < SHA512_BLOCK_SIZE && len + prefix_size < SHA512_BLOCK_SIZE) {
+		memcpy(init_block + prefix_size, message, len);
+		sha512_final(&s, init_block, len + prefix_size);
+	} else {
+		size_t i;
+
+		memcpy(init_block + prefix_size, message,
+		       SHA512_BLOCK_SIZE - prefix_size);
+		sha512_block(&s, init_block);
+
+		for (i = SHA512_BLOCK_SIZE - prefix_size;
+		     i + SHA512_BLOCK_SIZE <= len;
+		     i += SHA512_BLOCK_SIZE)
+			sha512_block(&s, message + i);
+
+		sha512_final(&s, message + i, len + prefix_size);
+	}
+
+	sha512_get(&s, init_block, 0, SHA512_HASH_SIZE);
+	fprime_from_bytes(out_fp, init_block, SHA512_HASH_SIZE, ed25519_order);
+}
+
+static void generate_k(uint8_t *k, const uint8_t *kgen_key,
+		       const uint8_t *message, size_t len)
+{
+	uint8_t block[SHA512_BLOCK_SIZE];
+
+	memcpy(block, kgen_key, 32);
+	hash_with_prefix(k, block, 32, message, len);
+}
+
+static void hash_message(uint8_t *z, const uint8_t *r, const uint8_t *a,
+			 const uint8_t *m, size_t len)
+{
+	uint8_t block[SHA512_BLOCK_SIZE];
+
+	memcpy(block, r, 32);
+	memcpy(block + 32, a, 32);
+	hash_with_prefix(z, block, 64, m, len);
+}
+
+void edsign_sign(uint8_t *signature, const uint8_t *pub,
+		 const uint8_t *secret,
+		 const uint8_t *message, size_t len)
+{
+	uint8_t expanded[EXPANDED_SIZE];
+	uint8_t e[FPRIME_SIZE];
+	uint8_t s[FPRIME_SIZE];
+	uint8_t k[FPRIME_SIZE];
+	uint8_t z[FPRIME_SIZE];
+
+	expand_key(expanded, secret);
+
+	/* Generate k and R = kB */
+	generate_k(k, expanded + 32, message, len);
+	sm_pack(signature, k);
+
+	/* Compute z = H(R, A, M) */
+	hash_message(z, signature, pub, message, len);
+
+	/* Obtain e */
+	fprime_from_bytes(e, expanded, 32, ed25519_order);
+
+	/* Compute s = ze + k */
+	fprime_mul(s, z, e, ed25519_order);
+	fprime_add(s, k, ed25519_order);
+	memcpy(signature + 32, s, 32);
+}
+
+uint8_t edsign_verify(const uint8_t *signature, const uint8_t *pub,
+		      const uint8_t *message, size_t len)
+{
+	struct ed25519_pt p;
+	struct ed25519_pt q;
+	uint8_t lhs[F25519_SIZE];
+	uint8_t rhs[F25519_SIZE];
+	uint8_t z[FPRIME_SIZE];
+	uint8_t ok = 1;
+
+	/* Compute z = H(R, A, M) */
+	hash_message(z, signature, pub, message, len);
+
+	/* sB = (ze + k)B = ... */
+	sm_pack(lhs, signature + 32);
+
+	/* ... = zA + R */
+	ok &= upp(&p, pub);
+	ed25519_smult(&p, &p, z);
+	ok &= upp(&q, signature);
+	ed25519_add(&p, &p, &q);
+	pp(rhs, &p);
+
+	/* Equal? */
+	return ok & f25519_eq(lhs, rhs);
+}
diff --git a/installer/signplugin/tiny/edsign.h b/installer/signplugin/tiny/edsign.h
new file mode 100644
index 0000000..85e2208
--- /dev/null
+++ b/installer/signplugin/tiny/edsign.h
@@ -0,0 +1,51 @@
+/* Edwards curve signature system
+ * Daniel Beer <dlbeer@gmail.com>, 22 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef EDSIGN_H_
+#define EDSIGN_H_
+
+#include <stdint.h>
+#include <stddef.h>
+
+/* This is the Ed25519 signature system, as described in:
+ *
+ *     Daniel J. Bernstein, Niels Duif, Tanja Lange, Peter Schwabe, Bo-Yin
+ *     Yang. High-speed high-security signatures. Journal of Cryptographic
+ *     Engineering 2 (2012), 77-89. Document ID:
+ *     a1a62a2f76d23f65d622484ddd09caf8. URL:
+ *     http://cr.yp.to/papers.html#ed25519. Date: 2011.09.26.
+ *
+ * The format and calculation of signatures is compatible with the
+ * Ed25519 implementation in SUPERCOP. Note, however, that our secret
+ * keys are half the size: we don't store a copy of the public key in
+ * the secret key (we generate it on demand).
+ */
+
+/* Any string of 32 random bytes is a valid secret key. There is no
+ * clamping of bits, because we don't use the key directly as an
+ * exponent (the exponent is derived from part of a key expansion).
+ */
+#define EDSIGN_SECRET_KEY_SIZE  32
+
+/* Given a secret key, produce the public key (a packed Edwards-curve
+ * point).
+ */
+#define EDSIGN_PUBLIC_KEY_SIZE  32
+
+void edsign_sec_to_pub(uint8_t *pub, const uint8_t *secret);
+
+/* Produce a signature for a message. */
+#define EDSIGN_SIGNATURE_SIZE  64
+
+void edsign_sign(uint8_t *signature, const uint8_t *pub,
+		 const uint8_t *secret,
+		 const uint8_t *message, size_t len);
+
+/* Verify a message signature. Returns non-zero if ok. */
+uint8_t edsign_verify(const uint8_t *signature, const uint8_t *pub,
+		      const uint8_t *message, size_t len);
+
+#endif
diff --git a/installer/signplugin/tiny/f25519.c b/installer/signplugin/tiny/f25519.c
new file mode 100644
index 0000000..3b06fa6
--- /dev/null
+++ b/installer/signplugin/tiny/f25519.c
@@ -0,0 +1,324 @@
+/* Arithmetic mod p = 2^255-19
+ * Daniel Beer <dlbeer@gmail.com>, 5 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "f25519.h"
+
+const uint8_t f25519_zero[F25519_SIZE] = {0};
+const uint8_t f25519_one[F25519_SIZE] = {1};
+
+void f25519_load(uint8_t *x, uint32_t c)
+{
+	unsigned int i;
+
+	for (i = 0; i < sizeof(c); i++) {
+		x[i] = c;
+		c >>= 8;
+	}
+
+	for (; i < F25519_SIZE; i++)
+		x[i] = 0;
+}
+
+void f25519_normalize(uint8_t *x)
+{
+	uint8_t minusp[F25519_SIZE];
+	uint16_t c;
+	int i;
+
+	/* Reduce using 2^255 = 19 mod p */
+	c = (x[31] >> 7) * 19;
+	x[31] &= 127;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += x[i];
+		x[i] = c;
+		c >>= 8;
+	}
+
+	/* The number is now less than 2^255 + 18, and therefore less than
+	 * 2p. Try subtracting p, and conditionally load the subtracted
+	 * value if underflow did not occur.
+	 */
+	c = 19;
+
+	for (i = 0; i + 1 < F25519_SIZE; i++) {
+		c += x[i];
+		minusp[i] = c;
+		c >>= 8;
+	}
+
+	c += ((uint16_t)x[i]) - 128;
+	minusp[31] = c;
+
+	/* Load x-p if no underflow */
+	f25519_select(x, minusp, x, (c >> 15) & 1);
+}
+
+uint8_t f25519_eq(const uint8_t *x, const uint8_t *y)
+{
+	uint8_t sum = 0;
+	int i;
+
+	for (i = 0; i < F25519_SIZE; i++)
+		sum |= x[i] ^ y[i];
+
+	sum |= (sum >> 4);
+	sum |= (sum >> 2);
+	sum |= (sum >> 1);
+
+	return (sum ^ 1) & 1;
+}
+
+void f25519_select(uint8_t *dst,
+		   const uint8_t *zero, const uint8_t *one,
+		   uint8_t condition)
+{
+	const uint8_t mask = -condition;
+	int i;
+
+	for (i = 0; i < F25519_SIZE; i++)
+		dst[i] = zero[i] ^ (mask & (one[i] ^ zero[i]));
+}
+
+void f25519_add(uint8_t *r, const uint8_t *a, const uint8_t *b)
+{
+	uint16_t c = 0;
+	int i;
+
+	/* Add */
+	for (i = 0; i < F25519_SIZE; i++) {
+		c >>= 8;
+		c += ((uint16_t)a[i]) + ((uint16_t)b[i]);
+		r[i] = c;
+	}
+
+	/* Reduce with 2^255 = 19 mod p */
+	r[31] &= 127;
+	c = (c >> 7) * 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_sub(uint8_t *r, const uint8_t *a, const uint8_t *b)
+{
+	uint32_t c = 0;
+	int i;
+
+	/* Calculate a + 2p - b, to avoid underflow */
+	c = 218;
+	for (i = 0; i + 1 < F25519_SIZE; i++) {
+		c += 65280 + ((uint32_t)a[i]) - ((uint32_t)b[i]);
+		r[i] = c;
+		c >>= 8;
+	}
+
+	c += ((uint32_t)a[31]) - ((uint32_t)b[31]);
+	r[31] = c & 127;
+	c = (c >> 7) * 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_neg(uint8_t *r, const uint8_t *a)
+{
+	uint32_t c = 0;
+	int i;
+
+	/* Calculate 2p - a, to avoid underflow */
+	c = 218;
+	for (i = 0; i + 1 < F25519_SIZE; i++) {
+		c += 65280 - ((uint32_t)a[i]);
+		r[i] = c;
+		c >>= 8;
+	}
+
+	c -= ((uint32_t)a[31]);
+	r[31] = c & 127;
+	c = (c >> 7) * 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_mul__distinct(uint8_t *r, const uint8_t *a, const uint8_t *b)
+{
+	uint32_t c = 0;
+	int i;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		int j;
+
+		c >>= 8;
+		for (j = 0; j <= i; j++)
+			c += ((uint32_t)a[j]) * ((uint32_t)b[i - j]);
+
+		for (; j < F25519_SIZE; j++)
+			c += ((uint32_t)a[j]) *
+			     ((uint32_t)b[i + F25519_SIZE - j]) * 38;
+
+		r[i] = c;
+	}
+
+	r[31] &= 127;
+	c = (c >> 7) * 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_mul(uint8_t *r, const uint8_t *a, const uint8_t *b)
+{
+	uint8_t tmp[F25519_SIZE];
+
+	f25519_mul__distinct(tmp, a, b);
+	f25519_copy(r, tmp);
+}
+
+void f25519_mul_c(uint8_t *r, const uint8_t *a, uint32_t b)
+{
+	uint32_t c = 0;
+	int i;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c >>= 8;
+		c += b * ((uint32_t)a[i]);
+		r[i] = c;
+	}
+
+	r[31] &= 127;
+	c >>= 7;
+	c *= 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_inv__distinct(uint8_t *r, const uint8_t *x)
+{
+	uint8_t s[F25519_SIZE];
+	int i;
+
+	/* This is a prime field, so by Fermat's little theorem:
+	 *
+	 *     x^(p-1) = 1 mod p
+	 *
+	 * Therefore, raise to (p-2) = 2^255-21 to get a multiplicative
+	 * inverse.
+	 *
+	 * This is a 255-bit binary number with the digits:
+	 *
+	 *     11111111... 01011
+	 *
+	 * We compute the result by the usual binary chain, but
+	 * alternate between keeping the accumulator in r and s, so as
+	 * to avoid copying temporaries.
+	 */
+
+	/* 1 1 */
+	f25519_mul__distinct(s, x, x);
+	f25519_mul__distinct(r, s, x);
+
+	/* 1 x 248 */
+	for (i = 0; i < 248; i++) {
+		f25519_mul__distinct(s, r, r);
+		f25519_mul__distinct(r, s, x);
+	}
+
+	/* 0 */
+	f25519_mul__distinct(s, r, r);
+
+	/* 1 */
+	f25519_mul__distinct(r, s, s);
+	f25519_mul__distinct(s, r, x);
+
+	/* 0 */
+	f25519_mul__distinct(r, s, s);
+
+	/* 1 */
+	f25519_mul__distinct(s, r, r);
+	f25519_mul__distinct(r, s, x);
+
+	/* 1 */
+	f25519_mul__distinct(s, r, r);
+	f25519_mul__distinct(r, s, x);
+}
+
+void f25519_inv(uint8_t *r, const uint8_t *x)
+{
+	uint8_t tmp[F25519_SIZE];
+
+	f25519_inv__distinct(tmp, x);
+	f25519_copy(r, tmp);
+}
+
+/* Raise x to the power of (p-5)/8 = 2^252-3, using s for temporary
+ * storage.
+ */
+static void exp2523(uint8_t *r, const uint8_t *x, uint8_t *s)
+{
+	int i;
+
+	/* This number is a 252-bit number with the binary expansion:
+	 *
+	 *     111111... 01
+	 */
+
+	/* 1 1 */
+	f25519_mul__distinct(r, x, x);
+	f25519_mul__distinct(s, r, x);
+
+	/* 1 x 248 */
+	for (i = 0; i < 248; i++) {
+		f25519_mul__distinct(r, s, s);
+		f25519_mul__distinct(s, r, x);
+	}
+
+	/* 0 */
+	f25519_mul__distinct(r, s, s);
+
+	/* 1 */
+	f25519_mul__distinct(s, r, r);
+	f25519_mul__distinct(r, s, x);
+}
+
+void f25519_sqrt(uint8_t *r, const uint8_t *a)
+{
+	uint8_t v[F25519_SIZE];
+	uint8_t i[F25519_SIZE];
+	uint8_t x[F25519_SIZE];
+	uint8_t y[F25519_SIZE];
+
+	/* v = (2a)^((p-5)/8) [x = 2a] */
+	f25519_mul_c(x, a, 2);
+	exp2523(v, x, y);
+
+	/* i = 2av^2 - 1 */
+	f25519_mul__distinct(y, v, v);
+	f25519_mul__distinct(i, x, y);
+	f25519_load(y, 1);
+	f25519_sub(i, i, y);
+
+	/* r = avi */
+	f25519_mul__distinct(x, v, a);
+	f25519_mul__distinct(r, x, i);
+}
diff --git a/installer/signplugin/tiny/f25519.h b/installer/signplugin/tiny/f25519.h
new file mode 100644
index 0000000..4cfa5ec
--- /dev/null
+++ b/installer/signplugin/tiny/f25519.h
@@ -0,0 +1,92 @@
+/* Arithmetic mod p = 2^255-19
+ * Daniel Beer <dlbeer@gmail.com>, 8 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef F25519_H_
+#define F25519_H_
+
+#include <stdint.h>
+#include <string.h>
+
+/* Field elements are represented as little-endian byte strings. All
+ * operations have timings which are independent of input data, so they
+ * can be safely used for cryptography.
+ *
+ * Computation is performed on un-normalized elements. These are byte
+ * strings which fall into the range 0 <= x < 2p. Use f25519_normalize()
+ * to convert to a value 0 <= x < p.
+ *
+ * Elements received from the outside may greater even than 2p.
+ * f25519_normalize() will correctly deal with these numbers too.
+ */
+#define F25519_SIZE  32
+
+/* Identity constants */
+extern const uint8_t f25519_zero[F25519_SIZE];
+extern const uint8_t f25519_one[F25519_SIZE];
+
+/* Load a small constant */
+void f25519_load(uint8_t *x, uint32_t c);
+
+/* Copy two points */
+static inline void f25519_copy(uint8_t *x, const uint8_t *a)
+{
+	memcpy(x, a, F25519_SIZE);
+}
+
+/* Normalize a field point x < 2*p by subtracting p if necessary */
+void f25519_normalize(uint8_t *x);
+
+/* Compare two field points in constant time. Return one if equal, zero
+ * otherwise. This should be performed only on normalized values.
+ */
+uint8_t f25519_eq(const uint8_t *x, const uint8_t *y);
+
+/* Conditional copy. If condition == 0, then zero is copied to dst. If
+ * condition == 1, then one is copied to dst. Any other value results in
+ * undefined behaviour.
+ */
+void f25519_select(uint8_t *dst,
+		   const uint8_t *zero, const uint8_t *one,
+		   uint8_t condition);
+
+/* Add/subtract two field points. The three pointers are not required to
+ * be distinct.
+ */
+void f25519_add(uint8_t *r, const uint8_t *a, const uint8_t *b);
+void f25519_sub(uint8_t *r, const uint8_t *a, const uint8_t *b);
+
+/* Unary negation */
+void f25519_neg(uint8_t *r, const uint8_t *a);
+
+/* Multiply two field points. The __distinct variant is used when r is
+ * known to be in a different location to a and b.
+ */
+void f25519_mul(uint8_t *r, const uint8_t *a, const uint8_t *b);
+void f25519_mul__distinct(uint8_t *r, const uint8_t *a, const uint8_t *b);
+
+/* Multiply a point by a small constant. The two pointers are not
+ * required to be distinct.
+ *
+ * The constant must be less than 2^24.
+ */
+void f25519_mul_c(uint8_t *r, const uint8_t *a, uint32_t b);
+
+/* Take the reciprocal of a field point. The __distinct variant is used
+ * when r is known to be in a different location to x.
+ */
+void f25519_inv(uint8_t *r, const uint8_t *x);
+void f25519_inv__distinct(uint8_t *r, const uint8_t *x);
+
+/* Compute one of the square roots of the field element, if the element
+ * is square. The other square is -r.
+ *
+ * If the input is not square, the returned value is a valid field
+ * element, but not the correct answer. If you don't already know that
+ * your element is square, you should square the return value and test.
+ */
+void f25519_sqrt(uint8_t *r, const uint8_t *x);
+
+#endif
diff --git a/installer/signplugin/tiny/fprime.c b/installer/signplugin/tiny/fprime.c
new file mode 100644
index 0000000..25f2197
--- /dev/null
+++ b/installer/signplugin/tiny/fprime.c
@@ -0,0 +1,215 @@
+/* Arithmetic in prime fields
+ * Daniel Beer <dlbeer@gmail.com>, 10 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "fprime.h"
+
+const uint8_t fprime_zero[FPRIME_SIZE] = {0};
+const uint8_t fprime_one[FPRIME_SIZE] = {1};
+
+static void raw_add(uint8_t *x, const uint8_t *p)
+{
+	uint16_t c = 0;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++) {
+		c += ((uint16_t)x[i]) + ((uint16_t)p[i]);
+		x[i] = c;
+		c >>= 8;
+	}
+}
+
+static void raw_try_sub(uint8_t *x, const uint8_t *p)
+{
+	uint8_t minusp[FPRIME_SIZE];
+	uint16_t c = 0;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++) {
+		c = ((uint16_t)x[i]) - ((uint16_t)p[i]) - c;
+		minusp[i] = c;
+		c = (c >> 8) & 1;
+	}
+
+	fprime_select(x, minusp, x, c);
+}
+
+/* Warning: this function is variable-time */
+static int prime_msb(const uint8_t *p)
+{
+	int i;
+	uint8_t x;
+
+	for (i = FPRIME_SIZE - 1; i >= 0; i--)
+		if (p[i])
+			break;
+
+	x = p[i];
+	i <<= 3;
+
+	while (x) {
+		x >>= 1;
+		i++;
+	}
+
+	return i - 1;
+}
+
+/* Warning: this function may be variable-time in the argument n */
+static void shift_n_bits(uint8_t *x, int n)
+{
+	uint16_t c = 0;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++) {
+		c |= ((uint16_t)x[i]) << n;
+		x[i] = c;
+		c >>= 8;
+	}
+}
+
+void fprime_load(uint8_t *x, uint32_t c)
+{
+	unsigned int i;
+
+	for (i = 0; i < sizeof(c); i++) {
+		x[i] = c;
+		c >>= 8;
+	}
+
+	for (; i < FPRIME_SIZE; i++)
+		x[i] = 0;
+}
+
+static inline int min_int(int a, int b)
+{
+	return a < b ? a : b;
+}
+
+void fprime_from_bytes(uint8_t *n,
+		       const uint8_t *x, size_t len,
+		       const uint8_t *modulus)
+{
+	const int preload_total = min_int(prime_msb(modulus) - 1, len << 3);
+	const int preload_bytes = preload_total >> 3;
+	const int preload_bits = preload_total & 7;
+	const int rbits = (len << 3) - preload_total;
+	int i;
+
+	memset(n, 0, FPRIME_SIZE);
+
+	for (i = 0; i < preload_bytes; i++)
+		n[i] = x[len - preload_bytes + i];
+
+	if (preload_bits) {
+		shift_n_bits(n, preload_bits);
+		n[0] |= x[len - preload_bytes - 1] >> (8 - preload_bits);
+	}
+
+	for (i = rbits - 1; i >= 0; i--) {
+		const uint8_t bit = (x[i >> 3] >> (i & 7)) & 1;
+
+		shift_n_bits(n, 1);
+		n[0] |= bit;
+		raw_try_sub(n, modulus);
+	}
+}
+
+void fprime_normalize(uint8_t *x, const uint8_t *modulus)
+{
+	uint8_t n[FPRIME_SIZE];
+
+	fprime_from_bytes(n, x, FPRIME_SIZE, modulus);
+	fprime_copy(x, n);
+}
+
+uint8_t fprime_eq(const uint8_t *x, const uint8_t *y)
+{
+	uint8_t sum = 0;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++)
+		sum |= x[i] ^ y[i];
+
+	sum |= (sum >> 4);
+	sum |= (sum >> 2);
+	sum |= (sum >> 1);
+
+	return (sum ^ 1) & 1;
+}
+
+void fprime_select(uint8_t *dst,
+		   const uint8_t *zero, const uint8_t *one,
+		   uint8_t condition)
+{
+	const uint8_t mask = -condition;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++)
+		dst[i] = zero[i] ^ (mask & (one[i] ^ zero[i]));
+}
+
+void fprime_add(uint8_t *r, const uint8_t *a, const uint8_t *modulus)
+{
+	raw_add(r, a);
+	raw_try_sub(r, modulus);
+}
+
+void fprime_sub(uint8_t *r, const uint8_t *a, const uint8_t *modulus)
+{
+	raw_add(r, modulus);
+	raw_try_sub(r, a);
+	raw_try_sub(r, modulus);
+}
+
+void fprime_mul(uint8_t *r, const uint8_t *a, const uint8_t *b,
+		const uint8_t *modulus)
+{
+	int i;
+
+	memset(r, 0, FPRIME_SIZE);
+
+	for (i = prime_msb(modulus); i >= 0; i--) {
+		const uint8_t bit = (b[i >> 3] >> (i & 7)) & 1;
+		uint8_t plusa[FPRIME_SIZE];
+
+		shift_n_bits(r, 1);
+		raw_try_sub(r, modulus);
+
+		fprime_copy(plusa, r);
+		fprime_add(plusa, a, modulus);
+
+		fprime_select(r, r, plusa, bit);
+	}
+}
+
+void fprime_inv(uint8_t *r, const uint8_t *a, const uint8_t *modulus)
+{
+	uint8_t pm2[FPRIME_SIZE];
+	uint16_t c = 2;
+	int i;
+
+	/* Compute (p-2) */
+	fprime_copy(pm2, modulus);
+	for (i = 0; i < FPRIME_SIZE; i++) {
+		c = modulus[i] - c;
+		pm2[i] = c;
+		c >>= 8;
+	}
+
+	/* Binary exponentiation */
+	fprime_load(r, 1);
+
+	for (i = prime_msb(modulus); i >= 0; i--) {
+		uint8_t r2[FPRIME_SIZE];
+
+		fprime_mul(r2, r, r, modulus);
+
+		if ((pm2[i >> 3] >> (i & 7)) & 1)
+			fprime_mul(r, r2, a, modulus);
+		else
+			fprime_copy(r, r2);
+	}
+}
diff --git a/installer/signplugin/tiny/fprime.h b/installer/signplugin/tiny/fprime.h
new file mode 100644
index 0000000..4a5486c
--- /dev/null
+++ b/installer/signplugin/tiny/fprime.h
@@ -0,0 +1,70 @@
+/* Arithmetic in prime fields
+ * Daniel Beer <dlbeer@gmail.com>, 10 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef FPRIME_H_
+#define FPRIME_H_
+
+#include <stdint.h>
+#include <string.h>
+
+/* Maximum size of a field element (or a prime). Field elements are
+ * always manipulated and stored in normalized form, with 0 <= x < p.
+ * You can use normalize() to convert a denormalized bitstring to normal
+ * form.
+ *
+ * Operations are constant with respect to the value of field elements,
+ * but not with respect to the modulus.
+ *
+ * The modulus is a number p, such that 2p-1 fits in FPRIME_SIZE bytes.
+ */
+#define FPRIME_SIZE  32
+
+/* Useful constants */
+extern const uint8_t fprime_zero[FPRIME_SIZE];
+extern const uint8_t fprime_one[FPRIME_SIZE];
+
+/* Load a small constant */
+void fprime_load(uint8_t *x, uint32_t c);
+
+/* Load a large constant */
+void fprime_from_bytes(uint8_t *x,
+		       const uint8_t *in, size_t len,
+		       const uint8_t *modulus);
+
+/* Copy an element */
+static inline void fprime_copy(uint8_t *x, const uint8_t *a)
+{
+	memcpy(x, a, FPRIME_SIZE);
+}
+
+/* Normalize a field element */
+void fprime_normalize(uint8_t *x, const uint8_t *modulus);
+
+/* Compare two field points in constant time. Return one if equal, zero
+ * otherwise. This should be performed only on normalized values.
+ */
+uint8_t fprime_eq(const uint8_t *x, const uint8_t *y);
+
+/* Conditional copy. If condition == 0, then zero is copied to dst. If
+ * condition == 1, then one is copied to dst. Any other value results in
+ * undefined behaviour.
+ */
+void fprime_select(uint8_t *dst,
+		   const uint8_t *zero, const uint8_t *one,
+		   uint8_t condition);
+
+/* Add one value to another. The two pointers must be distinct. */
+void fprime_add(uint8_t *r, const uint8_t *a, const uint8_t *modulus);
+void fprime_sub(uint8_t *r, const uint8_t *a, const uint8_t *modulus);
+
+/* Multiply two values to get a third. r must be distinct from a and b */
+void fprime_mul(uint8_t *r, const uint8_t *a, const uint8_t *b,
+		const uint8_t *modulus);
+
+/* Compute multiplicative inverse. r must be distinct from a */
+void fprime_inv(uint8_t *r, const uint8_t *a, const uint8_t *modulus);
+
+#endif
diff --git a/installer/signplugin/tiny/morph25519.c b/installer/signplugin/tiny/morph25519.c
new file mode 100644
index 0000000..3d64022
--- /dev/null
+++ b/installer/signplugin/tiny/morph25519.c
@@ -0,0 +1,87 @@
+/* Montgomery <-> Edwards isomorphism
+ * Daniel Beer <dlbeer@gmail.com>, 18 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "morph25519.h"
+#include "f25519.h"
+
+void morph25519_e2m(uint8_t *montgomery, const uint8_t *y)
+{
+	uint8_t yplus[F25519_SIZE];
+	uint8_t yminus[F25519_SIZE];
+
+	f25519_sub(yplus, f25519_one, y);
+	f25519_inv__distinct(yminus, yplus);
+	f25519_add(yplus, f25519_one, y);
+	f25519_mul__distinct(montgomery, yplus, yminus);
+	f25519_normalize(montgomery);
+}
+
+static void mx2ey(uint8_t *ey, const uint8_t *mx)
+{
+	uint8_t n[F25519_SIZE];
+	uint8_t d[F25519_SIZE];
+
+	f25519_add(n, mx, f25519_one);
+	f25519_inv__distinct(d, n);
+	f25519_sub(n, mx, f25519_one);
+	f25519_mul__distinct(ey, n, d);
+}
+
+static uint8_t ey2ex(uint8_t *x, const uint8_t *y, int parity)
+{
+	static const uint8_t d[F25519_SIZE] = {
+		0xa3, 0x78, 0x59, 0x13, 0xca, 0x4d, 0xeb, 0x75,
+		0xab, 0xd8, 0x41, 0x41, 0x4d, 0x0a, 0x70, 0x00,
+		0x98, 0xe8, 0x79, 0x77, 0x79, 0x40, 0xc7, 0x8c,
+		0x73, 0xfe, 0x6f, 0x2b, 0xee, 0x6c, 0x03, 0x52
+	};
+
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+	uint8_t c[F25519_SIZE];
+
+	/* Compute c = y^2 */
+	f25519_mul__distinct(c, y, y);
+
+	/* Compute b = (1+dy^2)^-1 */
+	f25519_mul__distinct(b, c, d);
+	f25519_add(a, b, f25519_one);
+	f25519_inv__distinct(b, a);
+
+	/* Compute a = y^2-1 */
+	f25519_sub(a, c, f25519_one);
+
+	/* Compute c = a*b = (y^2+1)/(1-dy^2) */
+	f25519_mul__distinct(c, a, b);
+
+	/* Compute a, b = +/-sqrt(c), if c is square */
+	f25519_sqrt(a, c);
+	f25519_neg(b, a);
+
+	/* Select one of them, based on the parity bit */
+	f25519_select(x, a, b, (a[0] ^ parity) & 1);
+
+	/* Verify that x^2 = c */
+	f25519_mul__distinct(a, x, x);
+	f25519_normalize(a);
+	f25519_normalize(c);
+
+	return f25519_eq(a, c);
+}
+
+uint8_t morph25519_m2e(uint8_t *ex, uint8_t *ey,
+		       const uint8_t *mx, int parity)
+{
+	uint8_t ok;
+
+	mx2ey(ey, mx);
+	ok = ey2ex(ex, ey, parity);
+
+	f25519_normalize(ex);
+	f25519_normalize(ey);
+
+	return ok;
+}
diff --git a/installer/signplugin/tiny/morph25519.h b/installer/signplugin/tiny/morph25519.h
new file mode 100644
index 0000000..ead91f4
--- /dev/null
+++ b/installer/signplugin/tiny/morph25519.h
@@ -0,0 +1,29 @@
+/* Montgomery <-> Edwards isomorphism
+ * Daniel Beer <dlbeer@gmail.com>, 18 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef MORPH25519_H_
+#define MORPH25519_H_
+
+#include <stdint.h>
+
+/* Convert an Edwards Y to a Montgomery X (Edwards X is not used).
+ * Resulting coordinate is normalized.
+ */
+void morph25519_e2m(uint8_t *montgomery_x, const uint8_t *edwards_y);
+
+/* Return a parity bit for the Edwards X coordinate */
+static inline int morph25519_eparity(const uint8_t *edwards_x)
+{
+	return edwards_x[0] & 1;
+}
+
+/* Convert a Montgomery X and a parity bit to an Edwards X/Y. Returns
+ * non-zero if successful.
+ */
+uint8_t morph25519_m2e(uint8_t *ex, uint8_t *ey,
+		       const uint8_t *mx, int parity);
+
+#endif
diff --git a/installer/signplugin/tiny/sha512.c b/installer/signplugin/tiny/sha512.c
new file mode 100644
index 0000000..d90d22d
--- /dev/null
+++ b/installer/signplugin/tiny/sha512.c
@@ -0,0 +1,228 @@
+/* SHA512
+ * Daniel Beer <dlbeer@gmail.com>, 22 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "sha512.h"
+
+const struct sha512_state sha512_initial_state = { {
+	0x6a09e667f3bcc908LL, 0xbb67ae8584caa73bLL,
+	0x3c6ef372fe94f82bLL, 0xa54ff53a5f1d36f1LL,
+	0x510e527fade682d1LL, 0x9b05688c2b3e6c1fLL,
+	0x1f83d9abfb41bd6bLL, 0x5be0cd19137e2179LL,
+} };
+
+static const uint64_t round_k[80] = {
+	0x428a2f98d728ae22LL, 0x7137449123ef65cdLL,
+	0xb5c0fbcfec4d3b2fLL, 0xe9b5dba58189dbbcLL,
+	0x3956c25bf348b538LL, 0x59f111f1b605d019LL,
+	0x923f82a4af194f9bLL, 0xab1c5ed5da6d8118LL,
+	0xd807aa98a3030242LL, 0x12835b0145706fbeLL,
+	0x243185be4ee4b28cLL, 0x550c7dc3d5ffb4e2LL,
+	0x72be5d74f27b896fLL, 0x80deb1fe3b1696b1LL,
+	0x9bdc06a725c71235LL, 0xc19bf174cf692694LL,
+	0xe49b69c19ef14ad2LL, 0xefbe4786384f25e3LL,
+	0x0fc19dc68b8cd5b5LL, 0x240ca1cc77ac9c65LL,
+	0x2de92c6f592b0275LL, 0x4a7484aa6ea6e483LL,
+	0x5cb0a9dcbd41fbd4LL, 0x76f988da831153b5LL,
+	0x983e5152ee66dfabLL, 0xa831c66d2db43210LL,
+	0xb00327c898fb213fLL, 0xbf597fc7beef0ee4LL,
+	0xc6e00bf33da88fc2LL, 0xd5a79147930aa725LL,
+	0x06ca6351e003826fLL, 0x142929670a0e6e70LL,
+	0x27b70a8546d22ffcLL, 0x2e1b21385c26c926LL,
+	0x4d2c6dfc5ac42aedLL, 0x53380d139d95b3dfLL,
+	0x650a73548baf63deLL, 0x766a0abb3c77b2a8LL,
+	0x81c2c92e47edaee6LL, 0x92722c851482353bLL,
+	0xa2bfe8a14cf10364LL, 0xa81a664bbc423001LL,
+	0xc24b8b70d0f89791LL, 0xc76c51a30654be30LL,
+	0xd192e819d6ef5218LL, 0xd69906245565a910LL,
+	0xf40e35855771202aLL, 0x106aa07032bbd1b8LL,
+	0x19a4c116b8d2d0c8LL, 0x1e376c085141ab53LL,
+	0x2748774cdf8eeb99LL, 0x34b0bcb5e19b48a8LL,
+	0x391c0cb3c5c95a63LL, 0x4ed8aa4ae3418acbLL,
+	0x5b9cca4f7763e373LL, 0x682e6ff3d6b2b8a3LL,
+	0x748f82ee5defb2fcLL, 0x78a5636f43172f60LL,
+	0x84c87814a1f0ab72LL, 0x8cc702081a6439ecLL,
+	0x90befffa23631e28LL, 0xa4506cebde82bde9LL,
+	0xbef9a3f7b2c67915LL, 0xc67178f2e372532bLL,
+	0xca273eceea26619cLL, 0xd186b8c721c0c207LL,
+	0xeada7dd6cde0eb1eLL, 0xf57d4f7fee6ed178LL,
+	0x06f067aa72176fbaLL, 0x0a637dc5a2c898a6LL,
+	0x113f9804bef90daeLL, 0x1b710b35131c471bLL,
+	0x28db77f523047d84LL, 0x32caab7b40c72493LL,
+	0x3c9ebe0a15c9bebcLL, 0x431d67c49c100d4cLL,
+	0x4cc5d4becb3e42b6LL, 0x597f299cfc657e2aLL,
+	0x5fcb6fab3ad6faecLL, 0x6c44198c4a475817LL,
+};
+
+static inline uint64_t load64(const uint8_t *x)
+{
+	uint64_t r;
+
+	r = *(x++);
+	r = (r << 8) | *(x++);
+	r = (r << 8) | *(x++);
+	r = (r << 8) | *(x++);
+	r = (r << 8) | *(x++);
+	r = (r << 8) | *(x++);
+	r = (r << 8) | *(x++);
+	r = (r << 8) | *(x++);
+
+	return r;
+}
+
+static inline void store64(uint8_t *x, uint64_t v)
+{
+	x += 7;
+	*(x--) = v;
+	v >>= 8;
+	*(x--) = v;
+	v >>= 8;
+	*(x--) = v;
+	v >>= 8;
+	*(x--) = v;
+	v >>= 8;
+	*(x--) = v;
+	v >>= 8;
+	*(x--) = v;
+	v >>= 8;
+	*(x--) = v;
+	v >>= 8;
+	*(x--) = v;
+}
+
+static inline uint64_t rot64(uint64_t x, int bits)
+{
+	return (x >> bits) | (x << (64 - bits));
+}
+
+void sha512_block(struct sha512_state *s, const uint8_t *blk)
+{
+	uint64_t w[16];
+	uint64_t a, b, c, d, e, f, g, h;
+	int i;
+
+	for (i = 0; i < 16; i++) {
+		w[i] = load64(blk);
+		blk += 8;
+	}
+
+	/* Load state */
+	a = s->h[0];
+	b = s->h[1];
+	c = s->h[2];
+	d = s->h[3];
+	e = s->h[4];
+	f = s->h[5];
+	g = s->h[6];
+	h = s->h[7];
+
+	for (i = 0; i < 80; i++) {
+		/* Compute value of w[i + 16]. w[wrap(i)] is currently w[i] */
+		const uint64_t wi = w[i & 15];
+		const uint64_t wi15 = w[(i + 1) & 15];
+		const uint64_t wi2 = w[(i + 14) & 15];
+		const uint64_t wi7 = w[(i + 9) & 15];
+		const uint64_t s0 =
+			rot64(wi15, 1) ^ rot64(wi15, 8) ^ (wi15 >> 7);
+		const uint64_t s1 =
+			rot64(wi2, 19) ^ rot64(wi2, 61) ^ (wi2 >> 6);
+
+		/* Round calculations */
+		const uint64_t S0 = rot64(a, 28) ^ rot64(a, 34) ^ rot64(a, 39);
+		const uint64_t S1 = rot64(e, 14) ^ rot64(e, 18) ^ rot64(e, 41);
+		const uint64_t ch = (e & f) ^ ((~e) & g);
+		const uint64_t temp1 = h + S1 + ch + round_k[i] + wi;
+		const uint64_t maj = (a & b) ^ (a & c) ^ (b & c);
+		const uint64_t temp2 = S0 + maj;
+
+		/* Update round state */
+		h = g;
+		g = f;
+		f = e;
+		e = d + temp1;
+		d = c;
+		c = b;
+		b = a;
+		a = temp1 + temp2;
+
+		/* w[wrap(i)] becomes w[i + 16] */
+		w[i & 15] = wi + s0 + wi7 + s1;
+	}
+
+	/* Store state */
+	s->h[0] += a;
+	s->h[1] += b;
+	s->h[2] += c;
+	s->h[3] += d;
+	s->h[4] += e;
+	s->h[5] += f;
+	s->h[6] += g;
+	s->h[7] += h;
+}
+
+void sha512_final(struct sha512_state *s, const uint8_t *blk,
+		  size_t total_size)
+{
+	uint8_t temp[SHA512_BLOCK_SIZE] = {0};
+	const size_t last_size = total_size & (SHA512_BLOCK_SIZE - 1);
+
+	if (last_size)
+		memcpy(temp, blk, last_size);
+	temp[last_size] = 0x80;
+
+	if (last_size > 111) {
+		sha512_block(s, temp);
+		memset(temp, 0, sizeof(temp));
+	}
+
+	/* Note: we assume total_size fits in 61 bits */
+	store64(temp + SHA512_BLOCK_SIZE - 8, total_size << 3);
+	sha512_block(s, temp);
+}
+
+void sha512_get(const struct sha512_state *s, uint8_t *hash,
+		unsigned int offset, unsigned int len)
+{
+	int i;
+
+	if (offset > SHA512_BLOCK_SIZE)
+		return;
+
+	if (len > SHA512_BLOCK_SIZE - offset)
+		len = SHA512_BLOCK_SIZE - offset;
+
+	/* Skip whole words */
+	i = offset >> 3;
+	offset &= 7;
+
+	/* Skip/read out bytes */
+	if (offset) {
+		uint8_t tmp[8];
+		unsigned int c = 8 - offset;
+
+		if (c > len)
+			c = len;
+
+		store64(tmp, s->h[i++]);
+		memcpy(hash, tmp + offset, c);
+		len -= c;
+		hash += c;
+	}
+
+	/* Read out whole words */
+	while (len >= 8) {
+		store64(hash, s->h[i++]);
+		hash += 8;
+		len -= 8;
+	}
+
+	/* Read out bytes */
+	if (len) {
+		uint8_t tmp[8];
+
+		store64(tmp, s->h[i]);
+		memcpy(hash, tmp, len);
+	}
+}
diff --git a/installer/signplugin/tiny/sha512.h b/installer/signplugin/tiny/sha512.h
new file mode 100644
index 0000000..1391745
--- /dev/null
+++ b/installer/signplugin/tiny/sha512.h
@@ -0,0 +1,52 @@
+/* SHA512
+ * Daniel Beer <dlbeer@gmail.com>, 22 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef SHA512_H_
+#define SHA512_H_
+
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+
+/* SHA512 state. State is updated as data is fed in, and then the final
+ * hash can be read out in slices.
+ *
+ * Data is fed in as a sequence of full blocks terminated by a single
+ * partial block.
+ */
+struct sha512_state {
+	uint64_t  h[8];
+};
+
+/* Initial state */
+extern const struct sha512_state sha512_initial_state;
+
+/* Set up a new context */
+static inline void sha512_init(struct sha512_state *s)
+{
+	memcpy(s, &sha512_initial_state, sizeof(*s));
+}
+
+/* Feed a full block in */
+#define SHA512_BLOCK_SIZE  128
+
+void sha512_block(struct sha512_state *s, const uint8_t *blk);
+
+/* Feed the last partial block in. The total stream size must be
+ * specified. The size of the block given is assumed to be (total_size %
+ * SHA512_BLOCK_SIZE). This might be zero, but you still need to call
+ * this function to terminate the stream.
+ */
+void sha512_final(struct sha512_state *s, const uint8_t *blk,
+		  size_t total_size);
+
+/* Fetch a slice of the hash result. */
+#define SHA512_HASH_SIZE  64
+
+void sha512_get(const struct sha512_state *s, uint8_t *hash,
+		unsigned int offset, unsigned int len);
+
+#endif
diff --git a/installer/signplugin/win32_crt_float.cpp b/installer/signplugin/win32_crt_float.cpp
new file mode 100644
index 0000000..172fe7e
--- /dev/null
+++ b/installer/signplugin/win32_crt_float.cpp
@@ -0,0 +1,95 @@
+extern "C"
+{
+    int _fltused;
+
+#ifdef _M_IX86 // following functions are needed only for 32-bit architecture
+
+    __declspec(naked) void _ftol2()
+    {
+        __asm
+        {
+            fistp qword ptr [esp-8]
+            mov   edx,[esp-4]
+            mov   eax,[esp-8]
+            ret
+        }
+    }
+
+    __declspec(naked) void _ftol2_sse()
+    {
+        __asm
+        {
+            fistp dword ptr [esp-4]
+            mov   eax,[esp-4]
+            ret
+        }
+    }
+
+#if 0 // these functions are needed for SSE code for 32-bit arch, TODO: implement them
+    __declspec(naked) void _dtol3()
+    {
+        __asm
+        {
+        }
+    }
+       
+
+    __declspec(naked) void _dtoui3()
+    {
+        __asm
+        {
+        }
+    }
+       
+
+    __declspec(naked) void _dtoul3()
+    {
+        __asm
+        {
+        }
+    }
+       
+
+    __declspec(naked) void _ftol3()
+    {
+        __asm
+        {
+        }
+    }
+       
+
+    __declspec(naked) void _ftoui3()
+    {
+        __asm
+        {
+        }
+    }
+       
+
+    __declspec(naked) void _ftoul3()
+    {
+        __asm
+        {
+        }
+    }
+       
+
+    __declspec(naked) void _ltod3()
+    {
+        __asm
+        {
+        }
+    }
+       
+
+    __declspec(naked) void _ultod3()
+    {
+        __asm
+        {
+        }
+    }
+#endif
+
+#endif
+
+}
\ No newline at end of file
diff --git a/installer/signplugin/win32_crt_math.cpp b/installer/signplugin/win32_crt_math.cpp
new file mode 100644
index 0000000..de61c7f
--- /dev/null
+++ b/installer/signplugin/win32_crt_math.cpp
@@ -0,0 +1,947 @@
+#ifdef _M_IX86 // use this file only for 32-bit architecture
+
+#define CRT_LOWORD(x) dword ptr [x+0]
+#define CRT_HIWORD(x) dword ptr [x+4]
+
+extern "C"
+{
+    __declspec(naked) void _alldiv()
+    {
+        #define DVND    esp + 16      // stack address of dividend (a)
+        #define DVSR    esp + 24      // stack address of divisor (b)
+
+        __asm
+        {
+        push    edi
+        push    esi
+        push    ebx
+
+; Determine sign of the result (edi = 0 if result is positive, non-zero
+; otherwise) and make operands positive.
+
+        xor     edi,edi         ; result sign assumed positive
+
+        mov     eax,CRT_HIWORD(DVND) ; hi word of a
+        or      eax,eax         ; test to see if signed
+        jge     short L1        ; skip rest if a is already positive
+        inc     edi             ; complement result sign flag
+        mov     edx,CRT_LOWORD(DVND) ; lo word of a
+        neg     eax             ; make a positive
+        neg     edx
+        sbb     eax,0
+        mov     CRT_HIWORD(DVND),eax ; save positive value
+        mov     CRT_LOWORD(DVND),edx
+L1:
+        mov     eax,CRT_HIWORD(DVSR) ; hi word of b
+        or      eax,eax         ; test to see if signed
+        jge     short L2        ; skip rest if b is already positive
+        inc     edi             ; complement the result sign flag
+        mov     edx,CRT_LOWORD(DVSR) ; lo word of a
+        neg     eax             ; make b positive
+        neg     edx
+        sbb     eax,0
+        mov     CRT_HIWORD(DVSR),eax ; save positive value
+        mov     CRT_LOWORD(DVSR),edx
+L2:
+
+;
+; Now do the divide.  First look to see if the divisor is less than 4194304K.
+; If so, then we can use a simple algorithm with word divides, otherwise
+; things get a little more complex.
+;
+; NOTE - eax currently contains the high order word of DVSR
+;
+
+        or      eax,eax         ; check to see if divisor < 4194304K
+        jnz     short L3        ; nope, gotta do this the hard way
+        mov     ecx,CRT_LOWORD(DVSR) ; load divisor
+        mov     eax,CRT_HIWORD(DVND) ; load high word of dividend
+        xor     edx,edx
+        div     ecx             ; eax <- high order bits of quotient
+        mov     ebx,eax         ; save high bits of quotient
+        mov     eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
+        div     ecx             ; eax <- low order bits of quotient
+        mov     edx,ebx         ; edx:eax <- quotient
+        jmp     short L4        ; set sign, restore stack and return
+
+;
+; Here we do it the hard way.  Remember, eax contains the high word of DVSR
+;
+
+L3:
+        mov     ebx,eax         ; ebx:ecx <- divisor
+        mov     ecx,CRT_LOWORD(DVSR)
+        mov     edx,CRT_HIWORD(DVND) ; edx:eax <- dividend
+        mov     eax,CRT_LOWORD(DVND)
+L5:
+        shr     ebx,1           ; shift divisor right one bit
+        rcr     ecx,1
+        shr     edx,1           ; shift dividend right one bit
+        rcr     eax,1
+        or      ebx,ebx
+        jnz     short L5        ; loop until divisor < 4194304K
+        div     ecx             ; now divide, ignore remainder
+        mov     esi,eax         ; save quotient
+
+;
+; We may be off by one, so to check, we will multiply the quotient
+; by the divisor and check the result against the orignal dividend
+; Note that we must also check for overflow, which can occur if the
+; dividend is close to 2**64 and the quotient is off by 1.
+;
+
+        mul     CRT_HIWORD(DVSR) ; QUOT * CRT_HIWORD(DVSR)
+        mov     ecx,eax
+        mov     eax,CRT_LOWORD(DVSR)
+        mul     esi             ; QUOT * CRT_LOWORD(DVSR)
+        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
+        jc      short L6        ; carry means Quotient is off by 1
+
+;
+; do long compare here between original dividend and the result of the
+; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
+; subtract one (1) from the quotient.
+;
+
+        cmp     edx,CRT_HIWORD(DVND) ; compare hi words of result and original
+        ja      short L6        ; if result > original, do subtract
+        jb      short L7        ; if result < original, we are ok
+        cmp     eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words
+        jbe     short L7        ; if less or equal we are ok, else subtract
+L6:
+        dec     esi             ; subtract 1 from quotient
+L7:
+        xor     edx,edx         ; edx:eax <- quotient
+        mov     eax,esi
+
+;
+; Just the cleanup left to do.  edx:eax contains the quotient.  Set the sign
+; according to the save value, cleanup the stack, and return.
+;
+
+L4:
+        dec     edi             ; check to see if result is negative
+        jnz     short L8        ; if EDI == 0, result should be negative
+        neg     edx             ; otherwise, negate the result
+        neg     eax
+        sbb     edx,0
+
+;
+; Restore the saved registers and return.
+;
+
+L8:
+        pop     ebx
+        pop     esi
+        pop     edi
+
+        ret     16
+        }
+
+        #undef DVND
+        #undef DVSR
+    }
+
+    __declspec(naked) void _alldvrm()
+    {
+        #define DVND    esp + 16      // stack address of dividend (a)
+        #define DVSR    esp + 24      // stack address of divisor (b)
+
+        __asm
+        {
+        push    edi
+        push    esi
+        push    ebp
+
+; Determine sign of the quotient (edi = 0 if result is positive, non-zero
+; otherwise) and make operands positive.
+; Sign of the remainder is kept in ebp.
+
+        xor     edi,edi         ; result sign assumed positive
+        xor     ebp,ebp         ; result sign assumed positive
+
+        mov     eax,CRT_HIWORD(DVND) ; hi word of a
+        or      eax,eax         ; test to see if signed
+        jge     short L1        ; skip rest if a is already positive
+        inc     edi             ; complement result sign flag
+        inc     ebp             ; complement result sign flag
+        mov     edx,CRT_LOWORD(DVND) ; lo word of a
+        neg     eax             ; make a positive
+        neg     edx
+        sbb     eax,0
+        mov     CRT_HIWORD(DVND),eax ; save positive value
+        mov     CRT_LOWORD(DVND),edx
+L1:
+        mov     eax,CRT_HIWORD(DVSR) ; hi word of b
+        or      eax,eax         ; test to see if signed
+        jge     short L2        ; skip rest if b is already positive
+        inc     edi             ; complement the result sign flag
+        mov     edx,CRT_LOWORD(DVSR) ; lo word of a
+        neg     eax             ; make b positive
+        neg     edx
+        sbb     eax,0
+        mov     CRT_HIWORD(DVSR),eax ; save positive value
+        mov     CRT_LOWORD(DVSR),edx
+L2:
+
+;
+; Now do the divide.  First look to see if the divisor is less than 4194304K.
+; If so, then we can use a simple algorithm with word divides, otherwise
+; things get a little more complex.
+;
+; NOTE - eax currently contains the high order word of DVSR
+;
+
+        or      eax,eax         ; check to see if divisor < 4194304K
+        jnz     short L3        ; nope, gotta do this the hard way
+        mov     ecx,CRT_LOWORD(DVSR) ; load divisor
+        mov     eax,CRT_HIWORD(DVND) ; load high word of dividend
+        xor     edx,edx
+        div     ecx             ; eax <- high order bits of quotient
+        mov     ebx,eax         ; save high bits of quotient
+        mov     eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
+        div     ecx             ; eax <- low order bits of quotient
+        mov     esi,eax         ; ebx:esi <- quotient
+;
+; Now we need to do a multiply so that we can compute the remainder.
+;
+        mov     eax,ebx         ; set up high word of quotient
+        mul     CRT_LOWORD(DVSR) ; CRT_HIWORD(QUOT) * DVSR
+        mov     ecx,eax         ; save the result in ecx
+        mov     eax,esi         ; set up low word of quotient
+        mul     CRT_LOWORD(DVSR) ; CRT_LOWORD(QUOT) * DVSR
+        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
+        jmp     short L4        ; complete remainder calculation
+
+;
+; Here we do it the hard way.  Remember, eax contains the high word of DVSR
+;
+
+L3:
+        mov     ebx,eax         ; ebx:ecx <- divisor
+        mov     ecx,CRT_LOWORD(DVSR)
+        mov     edx,CRT_HIWORD(DVND) ; edx:eax <- dividend
+        mov     eax,CRT_LOWORD(DVND)
+L5:
+        shr     ebx,1           ; shift divisor right one bit
+        rcr     ecx,1
+        shr     edx,1           ; shift dividend right one bit
+        rcr     eax,1
+        or      ebx,ebx
+        jnz     short L5        ; loop until divisor < 4194304K
+        div     ecx             ; now divide, ignore remainder
+        mov     esi,eax         ; save quotient
+
+;
+; We may be off by one, so to check, we will multiply the quotient
+; by the divisor and check the result against the orignal dividend
+; Note that we must also check for overflow, which can occur if the
+; dividend is close to 2**64 and the quotient is off by 1.
+;
+
+        mul     CRT_HIWORD(DVSR) ; QUOT * CRT_HIWORD(DVSR)
+        mov     ecx,eax
+        mov     eax,CRT_LOWORD(DVSR)
+        mul     esi             ; QUOT * CRT_LOWORD(DVSR)
+        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
+        jc      short L6        ; carry means Quotient is off by 1
+
+;
+; do long compare here between original dividend and the result of the
+; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
+; subtract one (1) from the quotient.
+;
+
+        cmp     edx,CRT_HIWORD(DVND) ; compare hi words of result and original
+        ja      short L6        ; if result > original, do subtract
+        jb      short L7        ; if result < original, we are ok
+        cmp     eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words
+        jbe     short L7        ; if less or equal we are ok, else subtract
+L6:
+        dec     esi             ; subtract 1 from quotient
+        sub     eax,CRT_LOWORD(DVSR) ; subtract divisor from result
+        sbb     edx,CRT_HIWORD(DVSR)
+L7:
+        xor     ebx,ebx         ; ebx:esi <- quotient
+
+L4:
+;
+; Calculate remainder by subtracting the result from the original dividend.
+; Since the result is already in a register, we will do the subtract in the
+; opposite direction and negate the result if necessary.
+;
+
+        sub     eax,CRT_LOWORD(DVND) ; subtract dividend from result
+        sbb     edx,CRT_HIWORD(DVND)
+
+;
+; Now check the result sign flag to see if the result is supposed to be positive
+; or negative.  It is currently negated (because we subtracted in the 'wrong'
+; direction), so if the sign flag is set we are done, otherwise we must negate
+; the result to make it positive again.
+;
+
+        dec     ebp             ; check result sign flag
+        jns     short L9        ; result is ok, set up the quotient
+        neg     edx             ; otherwise, negate the result
+        neg     eax
+        sbb     edx,0
+
+;
+; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx.
+;
+L9:
+        mov     ecx,edx
+        mov     edx,ebx
+        mov     ebx,ecx
+        mov     ecx,eax
+        mov     eax,esi
+
+;
+; Just the cleanup left to do.  edx:eax contains the quotient.  Set the sign
+; according to the save value, cleanup the stack, and return.
+;
+
+        dec     edi             ; check to see if result is negative
+        jnz     short L8        ; if EDI == 0, result should be negative
+        neg     edx             ; otherwise, negate the result
+        neg     eax
+        sbb     edx,0
+
+;
+; Restore the saved registers and return.
+;
+
+L8:
+        pop     ebp
+        pop     esi
+        pop     edi
+
+        ret     16
+        }
+
+        #undef DVND
+        #undef DVSR
+    }
+
+    __declspec(naked) void _allmul()
+    {
+        #define A       esp + 8       // stack address of a
+        #define B       esp + 16      // stack address of b
+
+        __asm
+        {
+        push    ebx
+
+        mov     eax,CRT_HIWORD(A)
+        mov     ecx,CRT_LOWORD(B)
+        mul     ecx             ;eax has AHI, ecx has BLO, so AHI * BLO
+        mov     ebx,eax         ;save result
+
+        mov     eax,CRT_LOWORD(A)
+        mul     CRT_HIWORD(B)       ;ALO * BHI
+        add     ebx,eax         ;ebx = ((ALO * BHI) + (AHI * BLO))
+
+        mov     eax,CRT_LOWORD(A)   ;ecx = BLO
+        mul     ecx             ;so edx:eax = ALO*BLO
+        add     edx,ebx         ;now edx has all the LO*HI stuff
+
+        pop     ebx
+
+        ret     16              ; callee restores the stack
+        }
+
+        #undef A
+        #undef B
+    }
+
+    __declspec(naked) void _allrem()
+    {
+        #define DVND    esp + 12      // stack address of dividend (a)
+        #define DVSR    esp + 20      // stack address of divisor (b)
+
+        __asm
+        {
+        push    ebx
+        push    edi
+
+
+; Determine sign of the result (edi = 0 if result is positive, non-zero
+; otherwise) and make operands positive.
+
+        xor     edi,edi         ; result sign assumed positive
+
+        mov     eax,CRT_HIWORD(DVND) ; hi word of a
+        or      eax,eax         ; test to see if signed
+        jge     short L1        ; skip rest if a is already positive
+        inc     edi             ; complement result sign flag bit
+        mov     edx,CRT_LOWORD(DVND) ; lo word of a
+        neg     eax             ; make a positive
+        neg     edx
+        sbb     eax,0
+        mov     CRT_HIWORD(DVND),eax ; save positive value
+        mov     CRT_LOWORD(DVND),edx
+L1:
+        mov     eax,CRT_HIWORD(DVSR) ; hi word of b
+        or      eax,eax         ; test to see if signed
+        jge     short L2        ; skip rest if b is already positive
+        mov     edx,CRT_LOWORD(DVSR) ; lo word of b
+        neg     eax             ; make b positive
+        neg     edx
+        sbb     eax,0
+        mov     CRT_HIWORD(DVSR),eax ; save positive value
+        mov     CRT_LOWORD(DVSR),edx
+L2:
+
+;
+; Now do the divide.  First look to see if the divisor is less than 4194304K.
+; If so, then we can use a simple algorithm with word divides, otherwise
+; things get a little more complex.
+;
+; NOTE - eax currently contains the high order word of DVSR
+;
+
+        or      eax,eax         ; check to see if divisor < 4194304K
+        jnz     short L3        ; nope, gotta do this the hard way
+        mov     ecx,CRT_LOWORD(DVSR) ; load divisor
+        mov     eax,CRT_HIWORD(DVND) ; load high word of dividend
+        xor     edx,edx
+        div     ecx             ; edx <- remainder
+        mov     eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
+        div     ecx             ; edx <- final remainder
+        mov     eax,edx         ; edx:eax <- remainder
+        xor     edx,edx
+        dec     edi             ; check result sign flag
+        jns     short L4        ; negate result, restore stack and return
+        jmp     short L8        ; result sign ok, restore stack and return
+
+;
+; Here we do it the hard way.  Remember, eax contains the high word of DVSR
+;
+
+L3:
+        mov     ebx,eax         ; ebx:ecx <- divisor
+        mov     ecx,CRT_LOWORD(DVSR)
+        mov     edx,CRT_HIWORD(DVND) ; edx:eax <- dividend
+        mov     eax,CRT_LOWORD(DVND)
+L5:
+        shr     ebx,1           ; shift divisor right one bit
+        rcr     ecx,1
+        shr     edx,1           ; shift dividend right one bit
+        rcr     eax,1
+        or      ebx,ebx
+        jnz     short L5        ; loop until divisor < 4194304K
+        div     ecx             ; now divide, ignore remainder
+
+;
+; We may be off by one, so to check, we will multiply the quotient
+; by the divisor and check the result against the orignal dividend
+; Note that we must also check for overflow, which can occur if the
+; dividend is close to 2**64 and the quotient is off by 1.
+;
+
+        mov     ecx,eax         ; save a copy of quotient in ECX
+        mul     CRT_HIWORD(DVSR)
+        xchg    ecx,eax         ; save product, get quotient in EAX
+        mul     CRT_LOWORD(DVSR)
+        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
+        jc      short L6        ; carry means Quotient is off by 1
+
+;
+; do long compare here between original dividend and the result of the
+; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
+; subtract the original divisor from the result.
+;
+
+        cmp     edx,CRT_HIWORD(DVND) ; compare hi words of result and original
+        ja      short L6        ; if result > original, do subtract
+        jb      short L7        ; if result < original, we are ok
+        cmp     eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words
+        jbe     short L7        ; if less or equal we are ok, else subtract
+L6:
+        sub     eax,CRT_LOWORD(DVSR) ; subtract divisor from result
+        sbb     edx,CRT_HIWORD(DVSR)
+L7:
+
+;
+; Calculate remainder by subtracting the result from the original dividend.
+; Since the result is already in a register, we will do the subtract in the
+; opposite direction and negate the result if necessary.
+;
+
+        sub     eax,CRT_LOWORD(DVND) ; subtract dividend from result
+        sbb     edx,CRT_HIWORD(DVND)
+
+;
+; Now check the result sign flag to see if the result is supposed to be positive
+; or negative.  It is currently negated (because we subtracted in the 'wrong'
+; direction), so if the sign flag is set we are done, otherwise we must negate
+; the result to make it positive again.
+;
+
+        dec     edi             ; check result sign flag
+        jns     short L8        ; result is ok, restore stack and return
+L4:
+        neg     edx             ; otherwise, negate the result
+        neg     eax
+        sbb     edx,0
+
+;
+; Just the cleanup left to do.  edx:eax contains the quotient.
+; Restore the saved registers and return.
+;
+
+L8:
+        pop     edi
+        pop     ebx
+
+        ret     16
+        }
+
+        #undef DVND
+        #undef DVSR
+    }
+
+    __declspec(naked) void _allshl()
+    {
+        __asm
+        {
+;
+; Handle shifts of 64 or more bits (all get 0)
+;
+        cmp     cl, 64
+        jae     short RETZERO
+
+;
+; Handle shifts of between 0 and 31 bits
+;
+        cmp     cl, 32
+        jae     short MORE32
+        shld    edx,eax,cl
+        shl     eax,cl
+        ret
+
+;
+; Handle shifts of between 32 and 63 bits
+;
+MORE32:
+        mov     edx,eax
+        xor     eax,eax
+        and     cl,31
+        shl     edx,cl
+        ret
+
+;
+; return 0 in edx:eax
+;
+RETZERO:
+        xor     eax,eax
+        xor     edx,edx
+        ret
+        }
+    }
+
+    __declspec(naked) void _allshr()
+    {
+        __asm
+        {
+;
+; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result
+; depends only on the high order bit of edx).
+;
+        cmp     cl,64
+        jae     short RETSIGN
+
+;
+; Handle shifts of between 0 and 31 bits
+;
+        cmp     cl, 32
+        jae     short MORE32
+        shrd    eax,edx,cl
+        sar     edx,cl
+        ret
+
+;
+; Handle shifts of between 32 and 63 bits
+;
+MORE32:
+        mov     eax,edx
+        sar     edx,31
+        and     cl,31
+        sar     eax,cl
+        ret
+
+;
+; Return double precision 0 or -1, depending on the sign of edx
+;
+RETSIGN:
+        sar     edx,31
+        mov     eax,edx
+        ret
+        }
+    }
+
+    __declspec(naked) void _aulldiv()
+    {
+        #define DVND    esp + 12      // stack address of dividend (a)
+        #define DVSR    esp + 20      // stack address of divisor (b)
+
+        __asm
+        {
+        push    ebx
+        push    esi
+
+;
+; Now do the divide.  First look to see if the divisor is less than 4194304K.
+; If so, then we can use a simple algorithm with word divides, otherwise
+; things get a little more complex.
+;
+
+        mov     eax,CRT_HIWORD(DVSR) ; check to see if divisor < 4194304K
+        or      eax,eax
+        jnz     short L1        ; nope, gotta do this the hard way
+        mov     ecx,CRT_LOWORD(DVSR) ; load divisor
+        mov     eax,CRT_HIWORD(DVND) ; load high word of dividend
+        xor     edx,edx
+        div     ecx             ; get high order bits of quotient
+        mov     ebx,eax         ; save high bits of quotient
+        mov     eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
+        div     ecx             ; get low order bits of quotient
+        mov     edx,ebx         ; edx:eax <- quotient hi:quotient lo
+        jmp     short L2        ; restore stack and return
+
+;
+; Here we do it the hard way.  Remember, eax contains DVSRHI
+;
+
+L1:
+        mov     ecx,eax         ; ecx:ebx <- divisor
+        mov     ebx,CRT_LOWORD(DVSR)
+        mov     edx,CRT_HIWORD(DVND) ; edx:eax <- dividend
+        mov     eax,CRT_LOWORD(DVND)
+L3:
+        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
+        rcr     ebx,1
+        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
+        rcr     eax,1
+        or      ecx,ecx
+        jnz     short L3        ; loop until divisor < 4194304K
+        div     ebx             ; now divide, ignore remainder
+        mov     esi,eax         ; save quotient
+
+;
+; We may be off by one, so to check, we will multiply the quotient
+; by the divisor and check the result against the orignal dividend
+; Note that we must also check for overflow, which can occur if the
+; dividend is close to 2**64 and the quotient is off by 1.
+;
+
+        mul     CRT_HIWORD(DVSR) ; QUOT * CRT_HIWORD(DVSR)
+        mov     ecx,eax
+        mov     eax,CRT_LOWORD(DVSR)
+        mul     esi             ; QUOT * CRT_LOWORD(DVSR)
+        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
+        jc      short L4        ; carry means Quotient is off by 1
+
+;
+; do long compare here between original dividend and the result of the
+; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
+; subtract one (1) from the quotient.
+;
+
+        cmp     edx,CRT_HIWORD(DVND) ; compare hi words of result and original
+        ja      short L4        ; if result > original, do subtract
+        jb      short L5        ; if result < original, we are ok
+        cmp     eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words
+        jbe     short L5        ; if less or equal we are ok, else subtract
+L4:
+        dec     esi             ; subtract 1 from quotient
+L5:
+        xor     edx,edx         ; edx:eax <- quotient
+        mov     eax,esi
+
+;
+; Just the cleanup left to do.  edx:eax contains the quotient.
+; Restore the saved registers and return.
+;
+
+L2:
+
+        pop     esi
+        pop     ebx
+
+        ret     16
+        }
+
+        #undef DVND
+        #undef DVSR
+    }
+
+    __declspec(naked) void _aulldvrm()
+    {
+        #define DVND    esp + 8       // stack address of dividend (a)
+        #define DVSR    esp + 16      // stack address of divisor (b)
+
+        __asm
+        {
+        push    esi
+
+;
+; Now do the divide.  First look to see if the divisor is less than 4194304K.
+; If so, then we can use a simple algorithm with word divides, otherwise
+; things get a little more complex.
+;
+
+        mov     eax,CRT_HIWORD(DVSR) ; check to see if divisor < 4194304K
+        or      eax,eax
+        jnz     short L1        ; nope, gotta do this the hard way
+        mov     ecx,CRT_LOWORD(DVSR) ; load divisor
+        mov     eax,CRT_HIWORD(DVND) ; load high word of dividend
+        xor     edx,edx
+        div     ecx             ; get high order bits of quotient
+        mov     ebx,eax         ; save high bits of quotient
+        mov     eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
+        div     ecx             ; get low order bits of quotient
+        mov     esi,eax         ; ebx:esi <- quotient
+
+;
+; Now we need to do a multiply so that we can compute the remainder.
+;
+        mov     eax,ebx         ; set up high word of quotient
+        mul     CRT_LOWORD(DVSR) ; CRT_HIWORD(QUOT) * DVSR
+        mov     ecx,eax         ; save the result in ecx
+        mov     eax,esi         ; set up low word of quotient
+        mul     CRT_LOWORD(DVSR) ; CRT_LOWORD(QUOT) * DVSR
+        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
+        jmp     short L2        ; complete remainder calculation
+
+;
+; Here we do it the hard way.  Remember, eax contains DVSRHI
+;
+
+L1:
+        mov     ecx,eax         ; ecx:ebx <- divisor
+        mov     ebx,CRT_LOWORD(DVSR)
+        mov     edx,CRT_HIWORD(DVND) ; edx:eax <- dividend
+        mov     eax,CRT_LOWORD(DVND)
+L3:
+        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
+        rcr     ebx,1
+        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
+        rcr     eax,1
+        or      ecx,ecx
+        jnz     short L3        ; loop until divisor < 4194304K
+        div     ebx             ; now divide, ignore remainder
+        mov     esi,eax         ; save quotient
+
+;
+; We may be off by one, so to check, we will multiply the quotient
+; by the divisor and check the result against the orignal dividend
+; Note that we must also check for overflow, which can occur if the
+; dividend is close to 2**64 and the quotient is off by 1.
+;
+
+        mul     CRT_HIWORD(DVSR) ; QUOT * CRT_HIWORD(DVSR)
+        mov     ecx,eax
+        mov     eax,CRT_LOWORD(DVSR)
+        mul     esi             ; QUOT * CRT_LOWORD(DVSR)
+        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
+        jc      short L4        ; carry means Quotient is off by 1
+
+;
+; do long compare here between original dividend and the result of the
+; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
+; subtract one (1) from the quotient.
+;
+
+        cmp     edx,CRT_HIWORD(DVND) ; compare hi words of result and original
+        ja      short L4        ; if result > original, do subtract
+        jb      short L5        ; if result < original, we are ok
+        cmp     eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words
+        jbe     short L5        ; if less or equal we are ok, else subtract
+L4:
+        dec     esi             ; subtract 1 from quotient
+        sub     eax,CRT_LOWORD(DVSR) ; subtract divisor from result
+        sbb     edx,CRT_HIWORD(DVSR)
+L5:
+        xor     ebx,ebx         ; ebx:esi <- quotient
+
+L2:
+;
+; Calculate remainder by subtracting the result from the original dividend.
+; Since the result is already in a register, we will do the subtract in the
+; opposite direction and negate the result.
+;
+
+        sub     eax,CRT_LOWORD(DVND) ; subtract dividend from result
+        sbb     edx,CRT_HIWORD(DVND)
+        neg     edx             ; otherwise, negate the result
+        neg     eax
+        sbb     edx,0
+
+;
+; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx.
+;
+        mov     ecx,edx
+        mov     edx,ebx
+        mov     ebx,ecx
+        mov     ecx,eax
+        mov     eax,esi
+;
+; Just the cleanup left to do.  edx:eax contains the quotient.
+; Restore the saved registers and return.
+;
+
+        pop     esi
+
+        ret     16
+        }
+
+        #undef DVND
+        #undef DVSR
+    }
+
+    __declspec(naked) void _aullrem()
+    {
+        #define DVND    esp + 8       // stack address of dividend (a)
+        #define DVSR    esp + 16      // stack address of divisor (b)
+
+        __asm
+        {
+        push    ebx
+
+; Now do the divide.  First look to see if the divisor is less than 4194304K.
+; If so, then we can use a simple algorithm with word divides, otherwise
+; things get a little more complex.
+;
+
+        mov     eax,CRT_HIWORD(DVSR) ; check to see if divisor < 4194304K
+        or      eax,eax
+        jnz     short L1        ; nope, gotta do this the hard way
+        mov     ecx,CRT_LOWORD(DVSR) ; load divisor
+        mov     eax,CRT_HIWORD(DVND) ; load high word of dividend
+        xor     edx,edx
+        div     ecx             ; edx <- remainder, eax <- quotient
+        mov     eax,CRT_LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
+        div     ecx             ; edx <- final remainder
+        mov     eax,edx         ; edx:eax <- remainder
+        xor     edx,edx
+        jmp     short L2        ; restore stack and return
+
+;
+; Here we do it the hard way.  Remember, eax contains DVSRHI
+;
+
+L1:
+        mov     ecx,eax         ; ecx:ebx <- divisor
+        mov     ebx,CRT_LOWORD(DVSR)
+        mov     edx,CRT_HIWORD(DVND) ; edx:eax <- dividend
+        mov     eax,CRT_LOWORD(DVND)
+L3:
+        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
+        rcr     ebx,1
+        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
+        rcr     eax,1
+        or      ecx,ecx
+        jnz     short L3        ; loop until divisor < 4194304K
+        div     ebx             ; now divide, ignore remainder
+
+;
+; We may be off by one, so to check, we will multiply the quotient
+; by the divisor and check the result against the orignal dividend
+; Note that we must also check for overflow, which can occur if the
+; dividend is close to 2**64 and the quotient is off by 1.
+;
+
+        mov     ecx,eax         ; save a copy of quotient in ECX
+        mul     CRT_HIWORD(DVSR)
+        xchg    ecx,eax         ; put partial product in ECX, get quotient in EAX
+        mul     CRT_LOWORD(DVSR)
+        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
+        jc      short L4        ; carry means Quotient is off by 1
+
+;
+; do long compare here between original dividend and the result of the
+; multiply in edx:eax.  If original is larger or equal, we're ok, otherwise
+; subtract the original divisor from the result.
+;
+
+        cmp     edx,CRT_HIWORD(DVND) ; compare hi words of result and original
+        ja      short L4        ; if result > original, do subtract
+        jb      short L5        ; if result < original, we're ok
+        cmp     eax,CRT_LOWORD(DVND) ; hi words are equal, compare lo words
+        jbe     short L5        ; if less or equal we're ok, else subtract
+L4:
+        sub     eax,CRT_LOWORD(DVSR) ; subtract divisor from result
+        sbb     edx,CRT_HIWORD(DVSR)
+L5:
+
+;
+; Calculate remainder by subtracting the result from the original dividend.
+; Since the result is already in a register, we will perform the subtract in
+; the opposite direction and negate the result to make it positive.
+;
+
+        sub     eax,CRT_LOWORD(DVND) ; subtract original dividend from result
+        sbb     edx,CRT_HIWORD(DVND)
+        neg     edx             ; and negate it
+        neg     eax
+        sbb     edx,0
+
+;
+; Just the cleanup left to do.  dx:ax contains the remainder.
+; Restore the saved registers and return.
+;
+
+L2:
+
+        pop     ebx
+
+        ret     16
+        }
+
+        #undef DVND
+        #undef DVSR
+    }
+
+    __declspec(naked) void _aullshr()
+    {
+        __asm
+        {
+        cmp     cl,64
+        jae     short RETZERO
+
+;
+; Handle shifts of between 0 and 31 bits
+;
+        cmp     cl, 32
+        jae     short MORE32
+        shrd    eax,edx,cl
+        shr     edx,cl
+        ret
+
+;
+; Handle shifts of between 32 and 63 bits
+;
+MORE32:
+        mov     eax,edx
+        xor     edx,edx
+        and     cl,31
+        shr     eax,cl
+        ret
+
+;
+; return 0 in edx:eax
+;
+RETZERO:
+        xor     eax,eax
+        xor     edx,edx
+        ret
+        }
+    }
+}
+
+#undef CRT_LOWORD
+#undef CRT_HIWORD
+
+#endif
diff --git a/installer/signplugin/win32_crt_memory.cpp b/installer/signplugin/win32_crt_memory.cpp
new file mode 100644
index 0000000..b6bd6b6
--- /dev/null
+++ b/installer/signplugin/win32_crt_memory.cpp
@@ -0,0 +1,26 @@
+#include <string.h>
+extern "C"
+{
+    #pragma function(memset)
+    void *memset(void *dest, int c, size_t count)
+    {
+        char *bytes = (char *)dest;
+        while (count--)
+        {
+            *bytes++ = (char)c;
+        }
+        return dest;
+    }
+
+    #pragma function(memcpy)
+    void *memcpy(void *dest, const void *src, size_t count)
+    {
+        char *dest8 = (char *)dest;
+        const char *src8 = (const char *)src;
+        while (count--)
+        {
+            *dest8++ = *src8++;
+        }
+        return dest;
+    }
+}
diff --git a/installer/signplugin/win32_crt_seh.cpp b/installer/signplugin/win32_crt_seh.cpp
new file mode 100644
index 0000000..51feb8e
--- /dev/null
+++ b/installer/signplugin/win32_crt_seh.cpp
@@ -0,0 +1,99 @@
+extern "C"
+{
+#if _M_IX86
+
+EXCEPTION_DISPOSITION
+_except_handler3(
+    struct _EXCEPTION_RECORD* ExceptionRecord,
+    void* EstablisherFrame,
+    struct _CONTEXT* ContextRecord,
+    void* DispatcherContext)
+{
+  typedef EXCEPTION_DISPOSITION Function(struct _EXCEPTION_RECORD*, void*, struct _CONTEXT*, void*);
+  static Function* FunctionPtr;
+
+  if (!FunctionPtr)
+  {
+      HMODULE Library = LoadLibraryA("msvcrt.dll");
+      FunctionPtr = (Function*)GetProcAddress(Library, "_except_handler3");
+  }
+
+  return FunctionPtr(ExceptionRecord, EstablisherFrame, ContextRecord, DispatcherContext);
+}
+
+UINT_PTR __security_cookie = 0xBB40E64E;
+
+extern PVOID __safe_se_handler_table[];
+extern BYTE  __safe_se_handler_count;
+
+typedef struct {
+    DWORD       Size;
+    DWORD       TimeDateStamp;
+    WORD        MajorVersion;
+    WORD        MinorVersion;
+    DWORD       GlobalFlagsClear;
+    DWORD       GlobalFlagsSet;
+    DWORD       CriticalSectionDefaultTimeout;
+    DWORD       DeCommitFreeBlockThreshold;
+    DWORD       DeCommitTotalFreeThreshold;
+    DWORD       LockPrefixTable;
+    DWORD       MaximumAllocationSize;
+    DWORD       VirtualMemoryThreshold;
+    DWORD       ProcessHeapFlags;
+    DWORD       ProcessAffinityMask;
+    WORD        CSDVersion;
+    WORD        Reserved1;
+    DWORD       EditList;
+    PUINT_PTR   SecurityCookie;
+    PVOID       *SEHandlerTable;
+    DWORD       SEHandlerCount;
+} IMAGE_LOAD_CONFIG_DIRECTORY32_2;
+
+const
+IMAGE_LOAD_CONFIG_DIRECTORY32_2 _load_config_used = {
+    sizeof(IMAGE_LOAD_CONFIG_DIRECTORY32_2),
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    0,
+    &__security_cookie,
+    __safe_se_handler_table,
+    (DWORD)(DWORD_PTR) &__safe_se_handler_count
+};
+
+#elif _M_AMD64
+
+EXCEPTION_DISPOSITION
+__C_specific_handler(
+    struct _EXCEPTION_RECORD* ExceptionRecord,
+    void* EstablisherFrame,
+    struct _CONTEXT* ContextRecord,
+    struct _DISPATCHER_CONTEXT* DispatcherContext)
+{
+  typedef EXCEPTION_DISPOSITION Function(struct _EXCEPTION_RECORD*, void*, struct _CONTEXT*, _DISPATCHER_CONTEXT*);
+  static Function* FunctionPtr;
+
+  if (!FunctionPtr)
+  {
+      HMODULE Library = LoadLibraryA("msvcrt.dll");
+      FunctionPtr = (Function*)GetProcAddress(Library, "__C_specific_handler");
+  }
+
+  return FunctionPtr(ExceptionRecord, EstablisherFrame, ContextRecord, DispatcherContext);
+}
+
+#endif
+
+}
diff --git a/installer/tap/.gitignore b/installer/tap/.gitignore
new file mode 100644
index 0000000..fee6563
--- /dev/null
+++ b/installer/tap/.gitignore
@@ -0,0 +1,2 @@
+/TunSafe-TAP-auto.exe.sig
+/TunSafe-TAP-auto.exe
\ No newline at end of file
diff --git a/installer/tap/COPYING b/installer/tap/COPYING
new file mode 100644
index 0000000..d8f3e59
--- /dev/null
+++ b/installer/tap/COPYING
@@ -0,0 +1,365 @@
+You can find and download the source code for this
+TunSafe-TAP Network Adapter at: https://tunsafe.com/open-source
+
+The source and object code of the tap-windows6 project
+is Copyright (C) 2002-2014 OpenVPN Technologies, Inc. The
+NSIS installer is Copyright (C) 2018 TunSafe, Copyright (C)
+2014 OpenVPN Technologies, Inc. and (C) 2012 Alon Bar-Lev.
+Both are released under the GPL version 2. See COPYING
+for the full GPL license. The licensors also make the following
+statement borrowed from the SPICE project:
+
+With respect to binaries built using the Microsoft(R)
+Windows Driver Kit (WDK), GPLv2 does not extend to any code
+contained in or derived from the WDK ("WDK Code").  As to
+WDK Code, by using or distributing such binaries you agree
+to be bound by the Microsoft Software License Terms for the
+WDK.  All WDK Code is considered by the GPLv2 licensors to
+qualify for the special exception stated in section 3 of
+GPLv2 (commonly known as the system library exception).
+
+The tap-windows.h file has been released under the MIT
+license (see COPYRIGHT.MIT) as well as under GPLv2 (see
+COPYRIGHT.GPL). This has been done to allow the use of the
+header file in non-GPLv2 compatible projects.
+
+                    GNU GENERAL PUBLIC LICENSE
+                       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+                            NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) year name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Ty Coon>, 1 April 1989
+  Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.
+
diff --git a/installer/tap/ShellLink.dll b/installer/tap/ShellLink.dll
new file mode 100644
index 0000000..f57ded3
Binary files /dev/null and b/installer/tap/ShellLink.dll differ
diff --git a/installer/tap/build.bat b/installer/tap/build.bat
new file mode 100644
index 0000000..45251d3
--- /dev/null
+++ b/installer/tap/build.bat
@@ -0,0 +1 @@
+"C:\Dev\NSIS\makensis.exe" tap-windows6.nsi
\ No newline at end of file
diff --git a/installer/tap/icon.ico b/installer/tap/icon.ico
new file mode 100644
index 0000000..06b583b
Binary files /dev/null and b/installer/tap/icon.ico differ
diff --git a/installer/tap/install-whirl.bmp b/installer/tap/install-whirl.bmp
new file mode 100644
index 0000000..e1186bd
Binary files /dev/null and b/installer/tap/install-whirl.bmp differ
diff --git a/installer/tap/prebuilt/x64/OemVista.inf b/installer/tap/prebuilt/x64/OemVista.inf
new file mode 100644
index 0000000..d92e255
--- /dev/null
+++ b/installer/tap/prebuilt/x64/OemVista.inf
@@ -0,0 +1,191 @@
+; ****************************************************************************
+; * Copyright (C) 2002-2014 OpenVPN Technologies, Inc.                            *
+; *  This program is free software; you can redistribute it and/or modify    *
+; *  it under the terms of the GNU General Public License version 2          *
+; *  as published by the Free Software Foundation.                           *
+; ****************************************************************************
+
+; SYNTAX CHECKER
+; cd \WINDDK\3790\tools\chkinf
+; chkinf c:\src\openvpn\tap-win32\i386\oemvista.inf
+; OUTPUT -> file:///c:/WINDDK/3790/tools/chkinf/htm/c%23+src+openvpn+tap-win32+i386+__OemWin2k.htm
+
+; INSTALL/REMOVE DRIVER
+;   tapinstall install OemVista.inf tapoas
+;   tapinstall update OemVista.inf tapoas
+;   tapinstall remove tapoas
+
+;*********************************************************
+; Note to Developers:
+;
+; If you are bundling the TAP-Windows driver with your app,
+; you should try to rename it in such a way that it will
+; not collide with other instances of TAP-Windows defined
+; by other apps.  Multiple versions of the TAP-Windows
+; driver, each installed by different apps, can coexist
+; on the same machine if you follow these guidelines.
+; NOTE: these instructions assume you are editing the
+; generated OemWin2k.inf file, not the source
+; OemWin2k.inf.in file which is preprocessed by winconfig
+; and uses macro definitions from settings.in.
+;
+; (1) Rename all tapXXXX instances in this file to
+;     something different (use at least 5 characters
+;     for this name!)
+; (2) Change the "!define TAP" definition in openvpn.nsi
+;     to match what you changed tapXXXX to.
+; (3) Change TARGETNAME in SOURCES to match what you
+;     changed tapXXXX to.
+; (4) Change TAP_COMPONENT_ID in common.h to match what
+;     you changed tapXXXX to.
+; (5) Change SZDEPENDENCIES in service.h to match what
+;     you changed tapXXXX to.
+; (6) Change DeviceDescription and Provider strings.
+; (7) Change PRODUCT_TAP_WIN_DEVICE_DESCRIPTION in constants.h to what you
+;     set DeviceDescription to.
+;
+;*********************************************************
+
+[Version]
+   Signature = "$Windows NT$"
+   CatalogFile = tap0901.cat
+   ClassGUID = {4d36e972-e325-11ce-bfc1-08002be10318}
+   Provider = %Provider%
+   Class = Net
+
+; This version number should match the version
+; number given in SOURCES.
+   DriverVer=04/21/2016,9.00.00.21
+
+[Strings]
+   DeviceDescription = "TAP-Windows Adapter V9"
+   Provider = "TAP-Windows Provider V9"
+
+;----------------------------------------------------------------
+;                      Manufacturer + Product Section (Done)
+;----------------------------------------------------------------
+[Manufacturer]
+   %Provider% = tap0901, NTamd64
+
+[tap0901.NTamd64]
+   %DeviceDescription% = tap0901.ndi, root\tap0901 ; Root enumerated
+   %DeviceDescription% = tap0901.ndi, tap0901      ; Legacy
+
+;---------------------------------------------------------------
+;                             Driver Section (Done)
+;---------------------------------------------------------------
+
+;----------------- Characteristics ------------
+;    NCF_PHYSICAL = 0x04
+;    NCF_VIRTUAL = 0x01
+;    NCF_SOFTWARE_ENUMERATED = 0x02
+;    NCF_HIDDEN = 0x08
+;    NCF_NO_SERVICE = 0x10
+;    NCF_HAS_UI = 0x80
+;----------------- Characteristics ------------
+
+[tap0901.ndi]
+   CopyFiles       = tap0901.driver, tap0901.files
+   AddReg          = tap0901.reg
+   AddReg          = tap0901.params.reg
+   Characteristics = 
+   *IfType            = 0x6 ; IF_TYPE_ETHERNET_CSMACD
+   *MediaType         = 0x0 ; NdisMedium802_3
+   *PhysicalMediaType = 14  ; NdisPhysicalMedium802_3
+
+[tap0901.ndi.Services]
+   AddService = tap0901,        2, tap0901.service
+
+[tap0901.reg]
+   HKR, Ndi,            Service,      0, "tap0901"
+   HKR, Ndi\Interfaces, UpperRange,   0, "ndis5"
+   HKR, Ndi\Interfaces, LowerRange,   0, "ethernet"
+   HKR, ,               Manufacturer, 0, "%Provider%"
+   HKR, ,               ProductName,  0, "%DeviceDescription%"
+
+[tap0901.params.reg]
+   HKR, Ndi\params\MTU,                  ParamDesc, 0, "MTU"
+   HKR, Ndi\params\MTU,                  Type,      0, "int"
+   HKR, Ndi\params\MTU,                  Default,   0, "1500"
+   HKR, Ndi\params\MTU,                  Optional,  0, "0"
+   HKR, Ndi\params\MTU,                  Min,       0, "100"
+   HKR, Ndi\params\MTU,                  Max,       0, "1500"
+   HKR, Ndi\params\MTU,                  Step,      0, "1"
+   HKR, Ndi\params\MediaStatus,          ParamDesc, 0, "Media Status"
+   HKR, Ndi\params\MediaStatus,          Type,      0, "enum"
+   HKR, Ndi\params\MediaStatus,          Default,   0, "0"
+   HKR, Ndi\params\MediaStatus,          Optional,  0, "0"
+   HKR, Ndi\params\MediaStatus\enum,     "0",       0, "Application Controlled"
+   HKR, Ndi\params\MediaStatus\enum,     "1",       0, "Always Connected"
+   HKR, Ndi\params\MAC,                  ParamDesc, 0, "MAC Address"
+   HKR, Ndi\params\MAC,                  Type,      0, "edit"
+   HKR, Ndi\params\MAC,                  Optional,  0, "1"
+   HKR, Ndi\params\AllowNonAdmin,        ParamDesc, 0, "Non-Admin Access"
+   HKR, Ndi\params\AllowNonAdmin,        Type,      0, "enum"
+   HKR, Ndi\params\AllowNonAdmin,        Default,   0, "1"
+   HKR, Ndi\params\AllowNonAdmin,        Optional,  0, "0"
+   HKR, Ndi\params\AllowNonAdmin\enum,   "0",       0, "Not Allowed"
+   HKR, Ndi\params\AllowNonAdmin\enum,   "1",       0, "Allowed"
+
+;----------------------------------------------------------------
+;                             Service Section
+;----------------------------------------------------------------
+
+;---------- Service Type -------------
+;    SERVICE_KERNEL_DRIVER     = 0x01
+;    SERVICE_WIN32_OWN_PROCESS = 0x10
+;---------- Service Type -------------
+
+;---------- Start Mode ---------------
+;    SERVICE_BOOT_START   = 0x0
+;    SERVICE_SYSTEM_START = 0x1
+;    SERVICE_AUTO_START   = 0x2
+;    SERVICE_DEMAND_START = 0x3
+;    SERVICE_DISABLED     = 0x4
+;---------- Start Mode ---------------
+
+[tap0901.service]
+   DisplayName = %DeviceDescription%
+   ServiceType = 1
+   StartType = 3
+   ErrorControl = 1
+   LoadOrderGroup = NDIS
+   ServiceBinary = %12%\tap0901.sys
+
+;-----------------------------------------------------------------
+;                                File Installation
+;-----------------------------------------------------------------
+
+;----------------- Copy Flags ------------
+;    COPYFLG_NOSKIP = 0x02
+;    COPYFLG_NOVERSIONCHECK = 0x04
+;----------------- Copy Flags ------------
+
+; SourceDisksNames
+; diskid = description[, [tagfile] [, <unused>, subdir]]
+; 1 = "Intel Driver Disk 1",e100bex.sys,,
+
+[SourceDisksNames]
+   1 = %DeviceDescription%, tap0901.sys
+
+; SourceDisksFiles
+; filename_on_source = diskID[, [subdir][, size]]
+; e100bex.sys = 1,, ; on distribution disk 1
+
+[SourceDisksFiles]
+tap0901.sys = 1
+
+[DestinationDirs]
+   tap0901.files  = 11
+   tap0901.driver = 12
+
+[tap0901.files]
+;   TapPanel.cpl,,,6   ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+;   cipsrvr.exe,,,6     ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+
+[tap0901.driver]
+   tap0901.sys,,,6     ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+
+;---------------------------------------------------------------
+;                                      End
+;---------------------------------------------------------------
diff --git a/installer/tap/prebuilt/x64/tap0901.cat b/installer/tap/prebuilt/x64/tap0901.cat
new file mode 100644
index 0000000..70ddd2c
Binary files /dev/null and b/installer/tap/prebuilt/x64/tap0901.cat differ
diff --git a/installer/tap/prebuilt/x64/tap0901.sys b/installer/tap/prebuilt/x64/tap0901.sys
new file mode 100644
index 0000000..c662820
Binary files /dev/null and b/installer/tap/prebuilt/x64/tap0901.sys differ
diff --git a/installer/tap/prebuilt/x64/tapinstall.exe b/installer/tap/prebuilt/x64/tapinstall.exe
new file mode 100644
index 0000000..a1ebb9f
Binary files /dev/null and b/installer/tap/prebuilt/x64/tapinstall.exe differ
diff --git a/installer/tap/prebuilt/x86/OemVista.inf b/installer/tap/prebuilt/x86/OemVista.inf
new file mode 100644
index 0000000..6cd6791
--- /dev/null
+++ b/installer/tap/prebuilt/x86/OemVista.inf
@@ -0,0 +1,191 @@
+; ****************************************************************************
+; * Copyright (C) 2002-2014 OpenVPN Technologies, Inc.                            *
+; *  This program is free software; you can redistribute it and/or modify    *
+; *  it under the terms of the GNU General Public License version 2          *
+; *  as published by the Free Software Foundation.                           *
+; ****************************************************************************
+
+; SYNTAX CHECKER
+; cd \WINDDK\3790\tools\chkinf
+; chkinf c:\src\openvpn\tap-win32\i386\oemvista.inf
+; OUTPUT -> file:///c:/WINDDK/3790/tools/chkinf/htm/c%23+src+openvpn+tap-win32+i386+__OemWin2k.htm
+
+; INSTALL/REMOVE DRIVER
+;   tapinstall install OemVista.inf tapoas
+;   tapinstall update OemVista.inf tapoas
+;   tapinstall remove tapoas
+
+;*********************************************************
+; Note to Developers:
+;
+; If you are bundling the TAP-Windows driver with your app,
+; you should try to rename it in such a way that it will
+; not collide with other instances of TAP-Windows defined
+; by other apps.  Multiple versions of the TAP-Windows
+; driver, each installed by different apps, can coexist
+; on the same machine if you follow these guidelines.
+; NOTE: these instructions assume you are editing the
+; generated OemWin2k.inf file, not the source
+; OemWin2k.inf.in file which is preprocessed by winconfig
+; and uses macro definitions from settings.in.
+;
+; (1) Rename all tapXXXX instances in this file to
+;     something different (use at least 5 characters
+;     for this name!)
+; (2) Change the "!define TAP" definition in openvpn.nsi
+;     to match what you changed tapXXXX to.
+; (3) Change TARGETNAME in SOURCES to match what you
+;     changed tapXXXX to.
+; (4) Change TAP_COMPONENT_ID in common.h to match what
+;     you changed tapXXXX to.
+; (5) Change SZDEPENDENCIES in service.h to match what
+;     you changed tapXXXX to.
+; (6) Change DeviceDescription and Provider strings.
+; (7) Change PRODUCT_TAP_WIN_DEVICE_DESCRIPTION in constants.h to what you
+;     set DeviceDescription to.
+;
+;*********************************************************
+
+[Version]
+   Signature = "$Windows NT$"
+   CatalogFile = tap0901.cat
+   ClassGUID = {4d36e972-e325-11ce-bfc1-08002be10318}
+   Provider = %Provider%
+   Class = Net
+
+; This version number should match the version
+; number given in SOURCES.
+   DriverVer=04/21/2016,9.00.00.21
+
+[Strings]
+   DeviceDescription = "TAP-Windows Adapter V9"
+   Provider = "TAP-Windows Provider V9"
+
+;----------------------------------------------------------------
+;                      Manufacturer + Product Section (Done)
+;----------------------------------------------------------------
+[Manufacturer]
+   %Provider% = tap0901
+
+[tap0901]
+   %DeviceDescription% = tap0901.ndi, root\tap0901 ; Root enumerated
+   %DeviceDescription% = tap0901.ndi, tap0901      ; Legacy
+
+;---------------------------------------------------------------
+;                             Driver Section (Done)
+;---------------------------------------------------------------
+
+;----------------- Characteristics ------------
+;    NCF_PHYSICAL = 0x04
+;    NCF_VIRTUAL = 0x01
+;    NCF_SOFTWARE_ENUMERATED = 0x02
+;    NCF_HIDDEN = 0x08
+;    NCF_NO_SERVICE = 0x10
+;    NCF_HAS_UI = 0x80
+;----------------- Characteristics ------------
+
+[tap0901.ndi]
+   CopyFiles       = tap0901.driver, tap0901.files
+   AddReg          = tap0901.reg
+   AddReg          = tap0901.params.reg
+   Characteristics = 
+   *IfType            = 0x6 ; IF_TYPE_ETHERNET_CSMACD
+   *MediaType         = 0x0 ; NdisMedium802_3
+   *PhysicalMediaType = 14  ; NdisPhysicalMedium802_3
+
+[tap0901.ndi.Services]
+   AddService = tap0901,        2, tap0901.service
+
+[tap0901.reg]
+   HKR, Ndi,            Service,      0, "tap0901"
+   HKR, Ndi\Interfaces, UpperRange,   0, "ndis5"
+   HKR, Ndi\Interfaces, LowerRange,   0, "ethernet"
+   HKR, ,               Manufacturer, 0, "%Provider%"
+   HKR, ,               ProductName,  0, "%DeviceDescription%"
+
+[tap0901.params.reg]
+   HKR, Ndi\params\MTU,                  ParamDesc, 0, "MTU"
+   HKR, Ndi\params\MTU,                  Type,      0, "int"
+   HKR, Ndi\params\MTU,                  Default,   0, "1500"
+   HKR, Ndi\params\MTU,                  Optional,  0, "0"
+   HKR, Ndi\params\MTU,                  Min,       0, "100"
+   HKR, Ndi\params\MTU,                  Max,       0, "1500"
+   HKR, Ndi\params\MTU,                  Step,      0, "1"
+   HKR, Ndi\params\MediaStatus,          ParamDesc, 0, "Media Status"
+   HKR, Ndi\params\MediaStatus,          Type,      0, "enum"
+   HKR, Ndi\params\MediaStatus,          Default,   0, "0"
+   HKR, Ndi\params\MediaStatus,          Optional,  0, "0"
+   HKR, Ndi\params\MediaStatus\enum,     "0",       0, "Application Controlled"
+   HKR, Ndi\params\MediaStatus\enum,     "1",       0, "Always Connected"
+   HKR, Ndi\params\MAC,                  ParamDesc, 0, "MAC Address"
+   HKR, Ndi\params\MAC,                  Type,      0, "edit"
+   HKR, Ndi\params\MAC,                  Optional,  0, "1"
+   HKR, Ndi\params\AllowNonAdmin,        ParamDesc, 0, "Non-Admin Access"
+   HKR, Ndi\params\AllowNonAdmin,        Type,      0, "enum"
+   HKR, Ndi\params\AllowNonAdmin,        Default,   0, "1"
+   HKR, Ndi\params\AllowNonAdmin,        Optional,  0, "0"
+   HKR, Ndi\params\AllowNonAdmin\enum,   "0",       0, "Not Allowed"
+   HKR, Ndi\params\AllowNonAdmin\enum,   "1",       0, "Allowed"
+
+;----------------------------------------------------------------
+;                             Service Section
+;----------------------------------------------------------------
+
+;---------- Service Type -------------
+;    SERVICE_KERNEL_DRIVER     = 0x01
+;    SERVICE_WIN32_OWN_PROCESS = 0x10
+;---------- Service Type -------------
+
+;---------- Start Mode ---------------
+;    SERVICE_BOOT_START   = 0x0
+;    SERVICE_SYSTEM_START = 0x1
+;    SERVICE_AUTO_START   = 0x2
+;    SERVICE_DEMAND_START = 0x3
+;    SERVICE_DISABLED     = 0x4
+;---------- Start Mode ---------------
+
+[tap0901.service]
+   DisplayName = %DeviceDescription%
+   ServiceType = 1
+   StartType = 3
+   ErrorControl = 1
+   LoadOrderGroup = NDIS
+   ServiceBinary = %12%\tap0901.sys
+
+;-----------------------------------------------------------------
+;                                File Installation
+;-----------------------------------------------------------------
+
+;----------------- Copy Flags ------------
+;    COPYFLG_NOSKIP = 0x02
+;    COPYFLG_NOVERSIONCHECK = 0x04
+;----------------- Copy Flags ------------
+
+; SourceDisksNames
+; diskid = description[, [tagfile] [, <unused>, subdir]]
+; 1 = "Intel Driver Disk 1",e100bex.sys,,
+
+[SourceDisksNames]
+   1 = %DeviceDescription%, tap0901.sys
+
+; SourceDisksFiles
+; filename_on_source = diskID[, [subdir][, size]]
+; e100bex.sys = 1,, ; on distribution disk 1
+
+[SourceDisksFiles]
+tap0901.sys = 1
+
+[DestinationDirs]
+   tap0901.files  = 11
+   tap0901.driver = 12
+
+[tap0901.files]
+;   TapPanel.cpl,,,6   ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+;   cipsrvr.exe,,,6     ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+
+[tap0901.driver]
+   tap0901.sys,,,6     ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+
+;---------------------------------------------------------------
+;                                      End
+;---------------------------------------------------------------
diff --git a/installer/tap/prebuilt/x86/tap0901.cat b/installer/tap/prebuilt/x86/tap0901.cat
new file mode 100644
index 0000000..d845310
Binary files /dev/null and b/installer/tap/prebuilt/x86/tap0901.cat differ
diff --git a/installer/tap/prebuilt/x86/tap0901.sys b/installer/tap/prebuilt/x86/tap0901.sys
new file mode 100644
index 0000000..fcba857
Binary files /dev/null and b/installer/tap/prebuilt/x86/tap0901.sys differ
diff --git a/installer/tap/prebuilt/x86/tapinstall.exe b/installer/tap/prebuilt/x86/tapinstall.exe
new file mode 100644
index 0000000..bc351c3
Binary files /dev/null and b/installer/tap/prebuilt/x86/tapinstall.exe differ
diff --git a/installer/tap/src/.appveyor.yml b/installer/tap/src/.appveyor.yml
new file mode 100644
index 0000000..09f2094
--- /dev/null
+++ b/installer/tap/src/.appveyor.yml
@@ -0,0 +1,5 @@
+version: 1.0.{build}
+build_script:
+- cmd: python buildtap.py -b
+artifacts:
+- path: '*'
diff --git a/installer/tap/src/.gitattributes b/installer/tap/src/.gitattributes
new file mode 100644
index 0000000..14d9eb5
--- /dev/null
+++ b/installer/tap/src/.gitattributes
@@ -0,0 +1 @@
+*.yml text=auto
diff --git a/installer/tap/src/.gitignore b/installer/tap/src/.gitignore
new file mode 100644
index 0000000..c84d745
--- /dev/null
+++ b/installer/tap/src/.gitignore
@@ -0,0 +1,10 @@
+dist/**
+*.pyc
+*.tar.gz
+src/config.h
+src/SOURCES
+src/build*.log
+src/obj*
+src/i386
+src/amd64
+tap-windows-*.exe
diff --git a/installer/tap/src/CONTRIBUTING.rst b/installer/tap/src/CONTRIBUTING.rst
new file mode 100644
index 0000000..6ee5908
--- /dev/null
+++ b/installer/tap/src/CONTRIBUTING.rst
@@ -0,0 +1,26 @@
+Contributing to tap-windows6
+============================
+
+To contribute to tap-windows6 please send your patches to openvpn-devel mailing 
+list:
+
+- https://lists.sourceforge.net/lists/listinfo/openvpn-devel
+
+The subject line should look like this:
+
+  [PATCH: tap-windows6] summary of the patch
+
+To avoid merging issues patches should be created with git-format-patch or sent
+using git-send-email. The easiest way to add the subject line prefix is to use
+this option:
+
+  --subject-prefix='PATCH: tap-windows6'
+
+Patches that do not modify the actual driver code can be sent as GitHub pull 
+requests. Try to split large patches into small, atomic pieces to make reviews 
+and merging easier.
+
+If you want quick feedback on a patch, you can visit the #openvpn-devel channel 
+on Freenode. Note that you need to be logged in to join the channel:
+
+- http://freenode.net/faq.shtml#nicksetup
diff --git a/installer/tap/src/COPYING b/installer/tap/src/COPYING
new file mode 100644
index 0000000..a2dbdb8
--- /dev/null
+++ b/installer/tap/src/COPYING
@@ -0,0 +1,24 @@
+tap-windows6 license
+--------------------
+
+The source and object code of the tap-windows6 project
+is Copyright (C) 2002-2014 OpenVPN Technologies, Inc. The
+NSIS installer is Copyright (C) 2014 OpenVPN Technologies,
+Inc. and (C) 2012 Alon Bar-Lev. Both are released under the
+GPL version 2. See COPYRIGHT.GPL for the full GPL license.
+The licensors also make the following statement borrowed
+from the SPICE project:
+
+With respect to binaries built using the Microsoft(R)
+Windows Driver Kit (WDK), GPLv2 does not extend to any code
+contained in or derived from the WDK ("WDK Code").  As to
+WDK Code, by using or distributing such binaries you agree
+to be bound by the Microsoft Software License Terms for the
+WDK.  All WDK Code is considered by the GPLv2 licensors to
+qualify for the special exception stated in section 3 of
+GPLv2 (commonly known as the system library exception).
+
+The tap-windows.h file has been released under the MIT
+license (see COPYRIGHT.MIT) as well as under GPLv2 (see
+COPYRIGHT.GPL). This has been done to allow the use of the
+header file in non-GPLv2 compatible projects.
diff --git a/installer/tap/src/COPYRIGHT.GPL b/installer/tap/src/COPYRIGHT.GPL
new file mode 100644
index 0000000..d159169
--- /dev/null
+++ b/installer/tap/src/COPYRIGHT.GPL
@@ -0,0 +1,339 @@
+                    GNU GENERAL PUBLIC LICENSE
+                       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+                            NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) year name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Ty Coon>, 1 April 1989
+  Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.
diff --git a/installer/tap/src/COPYRIGHT.MIT b/installer/tap/src/COPYRIGHT.MIT
new file mode 100644
index 0000000..bfbb900
--- /dev/null
+++ b/installer/tap/src/COPYRIGHT.MIT
@@ -0,0 +1,20 @@
+The MIT License (MIT)
+Copyright © 2014 OpenVPN Technologies, Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the “Software”), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/installer/tap/src/MSCV-VSClass3.cer b/installer/tap/src/MSCV-VSClass3.cer
new file mode 100644
index 0000000..831757d
Binary files /dev/null and b/installer/tap/src/MSCV-VSClass3.cer differ
diff --git a/installer/tap/src/README.rst b/installer/tap/src/README.rst
new file mode 100644
index 0000000..c7039a9
--- /dev/null
+++ b/installer/tap/src/README.rst
@@ -0,0 +1,142 @@
+TAP-Windows driver (NDIS 6)
+===========================
+
+This is an NDIS 6 implementation of the TAP-Windows driver, used by OpenVPN and 
+other apps. NDIS 6 drivers can run on Windows Vista or higher.
+
+Build
+-----
+
+To build, the following prerequisites are required:
+
+- Python 2.7
+- Microsoft Windows 7 WDK (Windows Driver Kit)
+- Windows code signing certificate
+- Git (not strictly required, but useful for running commands using bundled bash shell)
+- MakeNSIS (optional)
+- Patched source code directory of **devcon** sample from WDK (optional)
+- Prebuilt tapinstall.exe binaries (optional)
+
+Make sure you add Python's install directory (usually c:\\python27) to the PATH 
+environment variable.
+
+These instructions have been tested on Windows 7 using Git Bash, as well as on 
+Windows 2012 Server using Git Bash and Windows Powershell.
+
+View build script options::
+
+  $ python buildtap.py
+  Usage: buildtap.py [options]
+
+  Options:
+    -h, --help         show this help message and exit
+    -s SRC, --src=SRC  TAP-Windows top-level directory, default=<CWD>
+    --ti=TAPINSTALL    tapinstall (i.e. devcon) directory (optional)
+    -d, --debug        enable debug build
+    -c, --clean        do an nmake clean before build
+    -b, --build        build TAP-Windows and possibly tapinstall (add -c to
+                       clean before build)
+    --sign         sign the driver files (disabled by default)
+    -p, --package      generate an NSIS installer from the compiled files
+    --cert=CERT        Common name of code signing certificate, default=openvpn
+    --crosscert=CERT   The cross-certificate file to use, default=MSCV-
+                       VSClass3.cer
+    --timestamp=URL    Timestamp URL to use, default=http://timestamp.verisign.c
+                       om/scripts/timstamp.dll
+    -a, --oas          Build for OpenVPN Access Server clients
+
+Edit **version.m4** and **paths.py** as necessary then build::
+
+  $ python buildtap.py -b
+
+On successful completion, all build products will be placed in the "dist" 
+directory as well as tap6.tar.gz. The NSIS installer package will be placed to
+the build root directory.
+
+Note that due to the strict driver signing requirements in Windows 10 you need
+an EV certificate to sign the driver files. These EV certificates may be
+stored inside a hardware device, which makes fully automated signing process
+difficult, dangerous or impossible. Eventually the signing process will become
+even more involved, with drivers having to be submitted to the Windows
+Hardware Developer Center Dashboard portal. Therefore, by default, this
+buildsystem no longer signs any files. You can revert to the old behavior
+by using the --sign parameter.
+
+Building tapinstall (optional)
+------------------------------
+
+The build system supports building tapinstall.exe (a.k.a. devcon.exe). However
+the devcon source code in WinDDK does not build without modifications which
+cannot be made public due to licensing restrictions. For these reasons the
+default behavior is to reuse pre-built executables. To make sure the buildsystem
+finds the executables create the following directory structure under
+tap-windows6 directory:
+::
+  tapinstall
+  └── 7600
+      ├── objfre_wlh_amd64
+      │   └── amd64
+      │       └── tapinstall.exe
+      └── objfre_wlh_x86
+          └── i386
+              └── tapinstall.exe
+
+This structure is equal to what building tapinstall would create. Replace 7600
+with the major number of your WinDDK version. Finally call buildtap.py with
+"--ti=tapinstall".
+
+Please note that the NSIS packaging (-p) step will fail if you don't have
+tapinstall.exe available. Also don't use the "-c" flag or the above directories
+will get wiped before MakeNSIS is able to find them.
+
+Install/Update/Remove
+---------------------
+
+The driver can be installed using a command-line tool, tapinstall.exe, which is
+bundled with OpenVPN and tap-windows installers. Note that in some versions of
+OpenVPN tapinstall.exe is called devcon.exe. To install, update or remove the
+tap-windows NDIS 6 driver follow these steps:
+
+- place tapinstall.exe/devcon.exe to your PATH
+- open an Administrator shell
+- cd to **dist**
+- cd to **amd64** or **i386** depending on your system's processor architecture.
+
+Install::
+
+  $ tapinstall install OemVista.inf TAP0901
+
+Update::
+
+  $ tapinstall update OemVista.inf TAP0901
+
+Remove::
+
+  $ tapinstall remove TAP0901
+
+Notes on proxies
+----------------
+
+It is possible to build tap-windows6 without connectivity to the Internet but 
+any attempt to timestamp the driver will fail. For this reason configure your 
+outbound proxy server before starting the build. Note that the command prompt 
+also needs to be restarted to make use of new proxy settings.
+
+Notes on Authenticode signatures
+--------------------------------
+
+Recent Windows versions such as Windows 10 are fairly picky about the
+Authenticode signatures of kernel-mode drivers. In addition making older Windows
+versions such as Vista play along with signatures that Windows 10 accepts can be
+rather challenging. A good starting point on this topic is the
+`building tap-windows6 <https://community.openvpn.net/openvpn/wiki/BuildingTapWindows6>`_
+page on the OpenVPN community wiki. As that page points out, having two
+completely separate Authenticode signatures may be the only reasonable option.
+Fortunately there is a tool, `Sign-Tap6 <https://github.com/mattock/sign-tap6/>`_,
+which can be used to append secondary signatures to the tap-windows6 driver or
+to handle the entire signing process if necessary.
+
+License
+-------
+
+See the file `COPYING <COPYING>`_.
diff --git a/installer/tap/src/buildtap.py b/installer/tap/src/buildtap.py
new file mode 100644
index 0000000..6f506cd
--- /dev/null
+++ b/installer/tap/src/buildtap.py
@@ -0,0 +1,513 @@
+# build TAP-Windows NDIS 6.0 driver
+
+import sys, os, re, shutil, tarfile
+
+import paths
+
+class BuildTAPWindows(object):
+    # regex for doing search replace on @MACRO@ style macros
+    macro_amper = re.compile(r"@(\w+)@")
+
+    def __init__(self, opt):
+        self.opt = opt                                               # command line options
+        if not opt.src:
+            raise ValueError("source directory undefined")
+        self.top = os.path.realpath(opt.src)                         # top-level dir
+        self.src = os.path.join(self.top, 'src')                     # src/openvpn dir
+        if opt.tapinstall:
+            self.top_tapinstall = os.path.realpath(opt.tapinstall)   # tapinstall dir
+        else:
+            self.top_tapinstall = None
+            if opt.package:
+                raise ValueError("parameter -p must be used with --ti")
+
+        # path to DDK
+        self.ddk_path = paths.DDK
+
+        # path to makensis
+        self.makensis = os.path.join(paths.NSIS, 'makensis.exe')
+
+        # driver signing options
+        self.codesign = opt.codesign
+        self.sign_cn = opt.cert
+        self.sign_cert = opt.certfile
+        self.cert_pw = opt.certpw
+        self.crosscert = os.path.join(self.top, opt.crosscert)
+
+        self.inf2cat_cmd = os.path.join(self.ddk_path, 'bin', 'selfsign', 'Inf2Cat')
+        self.signtool_cmd = os.path.join(self.ddk_path, 'bin', 'x86', 'SignTool')
+
+        self.timestamp_server = opt.timestamp
+
+    # split a path into a list of components
+    @staticmethod
+    def path_split(path):
+        folders = []
+        while True:
+            path, folder = os.path.split(path)
+            if folder:
+                folders.append(folder)
+            else:
+                if path:
+                    folders.append(path)
+                break
+        folders.reverse()
+        return folders
+
+    # run a command
+    def system(self, cmd):
+        print "RUN:", cmd
+        os.system(cmd)
+
+    # make a directory
+    def mkdir(self, dir):
+        try:
+            os.mkdir(dir)
+        except:
+            pass
+        else:
+            print "MKDIR", dir
+
+    # make a directory including parents
+    def makedirs(self, dir):
+        try:
+            os.makedirs(dir)
+        except:
+            pass
+        else:
+            print "MAKEDIRS", dir
+
+    # copy a file
+    def cp(self, src, dest):
+        print "COPY %s %s" % (src, dest)
+        shutil.copy2(src, dest)
+
+    # make a tarball
+    @staticmethod
+    def make_tarball(output_filename, source_dir, arcname=None):
+        if arcname is None:
+            arcname = os.path.basename(source_dir)
+        tar = tarfile.open(output_filename, "w:gz")
+        tar.add(source_dir, arcname=arcname)
+        tar.close()
+        print "***** Generated tarball:", output_filename
+
+    # remove a file
+    def rm(self, file):
+        print "RM", file
+        os.remove(file)
+
+    # remove whole directory tree, like rm -rf
+    def rmtree(self, dir):
+        print "RMTREE", dir
+        shutil.rmtree(dir, ignore_errors=True)
+
+    # return path of dist directory
+    def dist_path(self):
+        return os.path.join(self.top, 'dist')
+
+    # return path of dist include directory
+    def dist_include_path(self):
+        return os.path.join(self.dist_path(), 'include')
+
+    # make a distribution directory (if absent) and return its path
+    def mkdir_dist(self, x64):
+        dir = self.drvdir(self.dist_path(), x64)
+        self.makedirs(dir)
+        return dir
+
+    # run an MSVC command
+    def build_vc(self, cmd):
+        self.system('cmd /c "vcvarsall.bat x86 && %s"' % (cmd,))
+
+    # parse version.m4 file
+    def parse_version_m4(self):
+        kv = {}
+        r = re.compile(r'^define\(\[?(\w+)\]?,\s*\[(.*)\]\)')
+        with open(os.path.join(self.top, 'version.m4')) as f:
+            for line in f:
+                line = line.rstrip()
+                m = re.match(r, line)
+                if m:
+                    g = m.groups()
+                    kv[g[0]] = g[1]
+        return kv
+
+    # our tap-windows version.m4 settings
+    def gen_version_m4(self, x64):
+        kv = self.parse_version_m4()
+        if self.opt.oas: # for OpenVPN Connect (i.e. OpenVPN Access Server)
+            kv['PRODUCT_NAME'] = "OpenVPNAS"
+            kv['PRODUCT_TAP_WIN_DEVICE_DESCRIPTION'] = "TAP Adapter OAS NDIS 6.0"
+            kv['PRODUCT_TAP_WIN_PROVIDER'] = "TAP-Win32 Provider OAS"
+            kv['PRODUCT_TAP_WIN_COMPONENT_ID'] = "tapoas"
+
+        if (x64):
+            kv['INF_PROVIDER_SUFFIX'] = ", NTamd64"
+            kv['INF_SECTION_SUFFIX'] = ".NTamd64"
+        else:
+            kv['INF_PROVIDER_SUFFIX'] = ""
+            kv['INF_SECTION_SUFFIX'] = ""
+        return kv
+
+    # DDK major version number (as a string)
+    def ddk_major(self):
+        ddk_ver = os.path.basename(self.ddk_path)
+        ddk_ver_major = re.match(r'^(\d+)\.', ddk_ver).groups()[0]
+        return ddk_ver_major
+
+    # return tapinstall source directory
+    def tapinstall_src(self):
+        if self.top_tapinstall:
+            d = os.path.join(self.top_tapinstall, self.ddk_major())
+            if os.path.exists(d):
+                return d
+            else:
+                return self.top_tapinstall
+
+    # preprocess a file, doing macro substitution on @MACRO@
+    def preprocess(self, kv, in_path, out_path=None):
+        def repfn(m):
+            var, = m.groups()
+            return kv.get(var, '')
+        if out_path is None:
+            out_path = in_path
+        with open(in_path+'.in') as f:
+            modtxt = re.sub(self.macro_amper, repfn, f.read())
+        with open(out_path, "w") as f:
+            f.write(modtxt)
+
+    # set up configuration files for building tap driver
+    def config_tap(self, x64):
+        kv = self.gen_version_m4(x64)
+        drvdir = self.drvdir(self.src, x64)
+        self.mkdir(drvdir)
+        self.preprocess(kv, os.path.join(self.src, "OemVista.inf"), os.path.join(drvdir, "OemVista.inf"))
+        self.preprocess(kv, os.path.join(self.src, "SOURCES"))
+        self.preprocess(kv, os.path.join(self.src, "config.h"))
+
+    # set up configuration files for building tapinstall
+    def config_tapinstall(self, x64):
+        kv = {}
+        tisrc = self.tapinstall_src()
+        self.preprocess(kv, os.path.join(tisrc, "sources"))
+
+    # build a "build" file using DDK
+    def build_ddk(self, dir, x64, debug):
+        setenv_bat = os.path.join(self.ddk_path, 'bin', 'setenv.bat')
+        target = 'chk' if debug else 'fre'
+        if x64:
+            target += ' x64'
+        else:
+            target += ' x86'
+
+        target += ' wlh'  # vista
+
+        self.system('cmd /c "%s %s %s no_oacr && cd %s && build -cef"' % (
+               setenv_bat,
+               self.ddk_path,
+               target,
+               dir
+               ))
+
+    # copy tap driver files to dist
+    def copy_tap_to_dist(self, x64):
+        dist = self.mkdir_dist(x64)
+        drvdir = self.drvdir(self.src, x64)
+        for dirpath, dirnames, filenames in os.walk(drvdir):
+            for f in filenames:
+                path = os.path.join(dirpath, f)
+                if f.endswith('.inf') or f.endswith('.cat') or f.endswith('.sys'):
+                    destfn = os.path.join(dist, f)
+                    self.cp(path, destfn)
+
+    # copy tap-windows.h to dist/include
+    def copy_include(self):
+        incdir = os.path.join(self.dist_path(), 'include')
+        self.makedirs(incdir)
+        self.cp(os.path.join(self.src, 'tap-windows.h'), incdir)
+
+    # copy tapinstall to dist
+    def copy_tapinstall_to_dist(self, x64):
+        dist = self.mkdir_dist(x64)
+        t = os.path.basename(dist)
+        tisrc = self.tapinstall_src()
+        for dirpath, dirnames, filenames in os.walk(tisrc):
+            if os.path.basename(dirpath) == t:
+                for f in filenames:
+                    path = os.path.join(dirpath, f)
+                    if f == 'tapinstall.exe':
+                        destfn = os.path.join(dist, f)
+                        self.cp(path, destfn)
+
+    # copy dist-src to dist; dist-src contains prebuilt files
+    # for some old platforms (such as win2k)
+    def copy_dist_src_to_dist(self):
+        dist_path = self.path_split(self.dist_path())
+        dist_src = os.path.join(self.top, "dist-src")
+        baselen = len(self.path_split(dist_src))
+        for dirpath, dirnames, filenames in os.walk(dist_src):
+            dirpath_split = self.path_split(dirpath)
+            depth = len(dirpath_split) - baselen
+            dircomp = ()
+            if depth > 0:
+                dircomp = dirpath_split[-depth:]
+            for exclude_dir in ('.svn', '.git'):
+                if exclude_dir in dirnames:
+                    dirnames.remove(exclude_dir)
+            for f in filenames:
+                path = os.path.join(dirpath, f)
+                destdir = os.path.join(*(dist_path + dircomp))
+                destfn = os.path.join(destdir, f)
+                self.makedirs(destdir)
+                self.cp(path, destfn)
+
+    # build, sign, and verify tap driver
+    def build_tap(self):
+        for x64 in (False, True):
+            print "***** BUILD TAP x64=%s" % (x64,)
+            self.config_tap(x64=x64)
+            self.build_ddk(dir=self.src, x64=x64, debug=opt.debug)
+            if self.codesign:
+                self.sign_verify(x64=x64)
+            self.copy_tap_to_dist(x64=x64)
+
+    # build tapinstall
+    def build_tapinstall(self):
+        for x64 in (False, True):
+            print "***** BUILD TAPINSTALL x64=%s" % (x64,)
+            tisrc = self.tapinstall_src()
+            # Only build if we have a chance of succeeding
+            sources_in = os.path.join(tisrc, "sources.in")
+            if os.path.isfile(sources_in):
+                self.config_tapinstall(x64=x64)
+                self.build_ddk(tisrc, x64=x64, debug=opt.debug)
+            if self.codesign:
+                self.sign_verify_ti(x64=x64)
+            self.copy_tapinstall_to_dist(x64)
+
+    # build tap driver and tapinstall
+    def build(self):
+        self.build_tap()
+        self.copy_include()
+        if self.top_tapinstall:
+            self.build_tapinstall()
+        self.copy_dist_src_to_dist()
+
+        print "***** Generated files"
+        self.dump_dist()
+
+        tapbase = "tapoas6" if self.opt.oas else "tap6"
+        self.make_tarball(os.path.join(self.top, tapbase+".tar.gz"),
+                          self.dist_path(),
+                          tapbase)
+
+    # package the produced files into an NSIS installer
+    def package(self):
+
+        # Generate license.txt and converting LF -> CRLF as we go. Apparently
+        # this type of conversion will stop working in Python 3.x.
+        dst = open(os.path.join(self.dist_path(), 'license.txt'), mode='wb')
+
+        for f in (os.path.join(self.top, 'COPYING'), os.path.join(self.top, 'COPYRIGHT.GPL')):
+            src=open(f, mode='rb')
+            dst.write(src.read()+'\r\n')
+            src.close()
+
+        dst.close()
+
+        # Copy tap-windows.h to dist include directory
+        self.mkdir(self.dist_include_path())
+        self.cp(os.path.join(self.src, 'tap-windows.h'), self.dist_include_path())
+
+        # Get variables from version.m4
+        kv = self.gen_version_m4(True)
+
+        installer_type = ""
+        if self.opt.oas:
+            installer_type = "-oas"
+        installer_file=os.path.join(self.top, 'tap-windows'+installer_type+'-'+kv['PRODUCT_VERSION']+'-I'+kv['PRODUCT_TAP_WIN_BUILD']+'.exe')
+
+        installer_cmd = "\"%s\" -DDEVCON32=%s -DDEVCON64=%s -DDEVCON_BASENAME=%s -DPRODUCT_TAP_WIN_COMPONENT_ID=%s -DPRODUCT_NAME=%s -DPRODUCT_VERSION=%s -DPRODUCT_TAP_WIN_BUILD=%s -DOUTPUT=%s -DIMAGE=%s %s" % \
+                        (self.makensis,
+                         self.tifile(x64=False),
+                         self.tifile(x64=True),
+                         'tapinstall.exe',
+                         kv['PRODUCT_TAP_WIN_COMPONENT_ID'],
+                         kv['PRODUCT_NAME'],
+                         kv['PRODUCT_VERSION'],
+                         kv['PRODUCT_TAP_WIN_BUILD'],
+                         installer_file,
+                         self.dist_path(),
+                         os.path.join(self.top, 'installer', 'tap-windows6.nsi')
+                        )
+
+        self.system(installer_cmd)
+        self.sign(installer_file)
+
+    # like find . | sort
+    def enum_tree(self, dir):
+        data = []
+        for dirpath, dirnames, filenames in os.walk(dir):
+            data.append(dirpath)
+            for f in filenames:
+                data.append(os.path.join(dirpath, f))
+        data.sort()
+        return data
+
+    # show files in dist
+    def dump_dist(self):
+        for f in self.enum_tree(self.dist_path()):
+            print f
+
+    # remove generated files from given directory tree
+    def clean_tree(self, top):
+        for dirpath, dirnames, filenames in os.walk(top):
+            for d in list(dirnames):
+                if d in ('.svn', '.git'):
+                    dirnames.remove(d)
+                else:
+                    path = os.path.join(dirpath, d)
+                    deldir = False
+                    if d in ('amd64', 'i386', 'dist'):
+                        deldir = True
+                    if d.endswith('_amd64') or d.endswith('_x86'):
+                        deldir = True
+                    if deldir:
+                        self.rmtree(path)
+                        dirnames.remove(d)
+            for f in filenames:
+                path = os.path.join(dirpath, f)
+                if f in ('SOURCES', 'sources', 'config.h'):
+                    self.rm(path)
+                if f.endswith('.log') or f.endswith('.wrn') or f.endswith('.cod'):
+                    self.rm(path)
+
+    # remove generated files for both tap-windows and tapinstall
+    def clean(self):
+        self.clean_tree(self.top)
+        if self.top_tapinstall:
+            self.clean_tree(self.top_tapinstall)
+
+    # BEGIN Driver signing
+
+    def drvdir(self, dir, x64):
+        if x64:
+            return os.path.join(dir, "amd64")
+        else:
+            return os.path.join(dir, "i386")
+
+    def drvfile(self, x64, ext):
+        dd = self.drvdir(self.src, x64)
+        for dirpath, dirnames, filenames in os.walk(dd):
+            catlist = [ f for f in filenames if f.endswith(ext) ]
+            assert(len(catlist)==1)
+            return os.path.join(dd, catlist[0])
+
+    def tifile(self, x64):
+        if x64:
+            return os.path.join(self.tapinstall_src(), 'objfre_wlh_amd64', 'amd64', 'tapinstall.exe')
+        else:
+            return os.path.join(self.tapinstall_src(), 'objfre_wlh_x86', 'i386', 'tapinstall.exe')
+
+    def inf2cat(self, x64):
+        if x64:
+            oslist = "Vista_X64,Server2008_X64,Server2008R2_X64,7_X64"
+        else:
+            oslist = "Vista_X86,Server2008_X86,7_X86"
+        self.system("%s /driver:%s /os:%s" % (self.inf2cat_cmd, self.drvdir(self.src, x64), oslist))
+
+    def sign(self, file):
+        certspec = ""
+        if self.sign_cert:
+            certspec += "/f '%s' " % self.sign_cert
+            if self.cert_pw:
+                certspec += "/p '%s' " % self.cert_pw
+        else:
+            certspec += "/s my /n '%s' " % self.sign_cn
+
+        self.system("%s sign /v /ac %s %s /t %s %s" % (
+                self.signtool_cmd,
+                self.crosscert,
+                certspec,
+                self.timestamp_server,
+                file,
+            ))
+
+    def sign_driver(self, x64):
+        self.sign(self.drvfile(x64, '.cat'))
+
+    def verify(self, x64):
+            self.system("%s verify /kp /v /c %s %s" % (
+                    self.signtool_cmd,
+                    self.drvfile(x64, '.cat'),
+                    self.drvfile(x64, '.sys'),
+                ))
+
+    def sign_verify(self, x64):
+        self.inf2cat(x64)
+        self.sign_driver(x64)
+        self.verify(x64)
+
+    def sign_verify_ti(self, x64):
+        self.sign(self.tifile(x64))
+        self.system("%s verify /pa %s" % (self.signtool_cmd, self.tifile(x64)))
+
+    # END Driver signing
+
+if __name__ == '__main__':
+    # parse options
+    import optparse, codecs
+    codecs.register(lambda name: codecs.lookup('utf-8') if name == 'cp65001' else None) # windows UTF-8 hack
+    op = optparse.OptionParser()
+
+    # defaults
+    src = os.path.dirname(os.path.realpath(__file__))
+    cert = "openvpn"
+    crosscert = "MSCV-VSClass3.cer" # cross certs available here: http://msdn.microsoft.com/en-us/library/windows/hardware/dn170454(v=vs.85).aspx
+    timestamp = "http://timestamp.verisign.com/scripts/timstamp.dll"
+
+    op.add_option("-s", "--src", dest="src", metavar="SRC",
+
+                  default=src,
+                  help="TAP-Windows top-level directory, default=%s" % (src,))
+    op.add_option("--ti", dest="tapinstall", metavar="TAPINSTALL",
+                  help="tapinstall (i.e. devcon) directory (optional)")
+    op.add_option("-d", "--debug", action="store_true", dest="debug",
+                  help="enable debug build")
+    op.add_option("-c", "--clean", action="store_true", dest="clean",
+                  help="do an nmake clean before build")
+    op.add_option("-b", "--build", action="store_true", dest="build",
+                  help="build TAP-Windows and possibly tapinstall (add -c to clean before build)")
+    op.add_option("--sign", action="store_true", dest="codesign",
+                  default=False, help="sign the driver files")
+    op.add_option("-p", "--package", action="store_true", dest="package",
+                  help="generate an NSIS installer from the compiled files")
+    op.add_option("--cert", dest="cert", metavar="CERT",
+                  default=cert,
+                  help="Common name of code signing certificate, default=%s" % (cert,))
+    op.add_option("--certfile", dest="certfile", metavar="CERTFILE",
+                  help="Path to the code signing certificate")
+    op.add_option("--certpw", dest="certpw", metavar="CERTPW",
+                  help="Password for the code signing certificate/key (optional)")
+    op.add_option("--crosscert", dest="crosscert", metavar="CERT",
+	              default=crosscert,
+				  help="The cross-certificate file to use, default=%s" % (crosscert,))
+    op.add_option("--timestamp", dest="timestamp", metavar="URL",
+                  default=timestamp,
+                  help="Timestamp URL to use, default=%s" % (timestamp,))
+    op.add_option("-a", "--oas", action="store_true", dest="oas",
+                  help="Build for OpenVPN Access Server clients")
+    (opt, args) = op.parse_args()
+
+    if len(sys.argv) <= 1:
+        op.print_help()
+        sys.exit(1)
+
+    btw = BuildTAPWindows(opt)
+    if opt.clean:
+        btw.clean()
+    if opt.build:
+        btw.build()
+    if opt.package:
+        btw.package()
diff --git a/installer/tap/src/installer/ShellLink.dll b/installer/tap/src/installer/ShellLink.dll
new file mode 100644
index 0000000..f57ded3
Binary files /dev/null and b/installer/tap/src/installer/ShellLink.dll differ
diff --git a/installer/tap/src/installer/icon.ico b/installer/tap/src/installer/icon.ico
new file mode 100644
index 0000000..03ea0b1
Binary files /dev/null and b/installer/tap/src/installer/icon.ico differ
diff --git a/installer/tap/src/installer/install-whirl.bmp b/installer/tap/src/installer/install-whirl.bmp
new file mode 100644
index 0000000..03f33fc
Binary files /dev/null and b/installer/tap/src/installer/install-whirl.bmp differ
diff --git a/installer/tap/src/installer/tap-windows6.nsi b/installer/tap/src/installer/tap-windows6.nsi
new file mode 100644
index 0000000..2d03566
--- /dev/null
+++ b/installer/tap/src/installer/tap-windows6.nsi
@@ -0,0 +1,340 @@
+; ****************************************************************************
+; * Copyright (C) 2002-2010 OpenVPN Technologies, Inc.                       *
+; * Copyright (C)      2012 Alon Bar-Lev <alon.barlev@gmail.com>             *
+; *  This program is free software; you can redistribute it and/or modify    *
+; *  it under the terms of the GNU General Public License version 2          *
+; *  as published by the Free Software Foundation.                           *
+; ****************************************************************************
+
+; TAP-Windows install script for Windows, using NSIS
+
+SetCompressor /SOLID lzma
+
+!addplugindir .
+!include "MUI.nsh"
+!include "StrFunc.nsh"
+!include "x64.nsh"
+!define MULTIUSER_EXECUTIONLEVEL Admin
+!include "MultiUser.nsh"
+!include FileFunc.nsh
+!insertmacro GetParameters
+!insertmacro GetOptions
+
+${StrLoc}
+
+;--------------------------------
+;Configuration
+
+;General
+
+OutFile "${OUTPUT}"
+
+ShowInstDetails show
+ShowUninstDetails show
+
+;Remember install folder
+InstallDirRegKey HKLM "SOFTWARE\${PRODUCT_NAME}" ""
+
+;--------------------------------
+;Modern UI Configuration
+
+Name "${PRODUCT_NAME} ${PRODUCT_VERSION}-I${PRODUCT_TAP_WIN_BUILD}"
+
+!define MUI_WELCOMEPAGE_TEXT "This wizard will guide you through the installation of ${PRODUCT_NAME}, a kernel driver to provide virtual tap device functionality on Windows originally written by James Yonan.\r\n\r\nNote that ${PRODUCT_NAME} will only run on Windows Vista or later.\r\n\r\n\r\n"
+
+!define MUI_COMPONENTSPAGE_TEXT_TOP "Select the components to install/upgrade.  Stop any ${PRODUCT_NAME} processes or the ${PRODUCT_NAME} service if it is running.  All DLLs are installed locally."
+
+!define MUI_COMPONENTSPAGE_SMALLDESC
+!define MUI_FINISHPAGE_NOAUTOCLOSE
+!define MUI_ABORTWARNING
+!define MUI_ICON "icon.ico"
+!define MUI_UNICON "icon.ico"
+!define MUI_HEADERIMAGE
+!define MUI_HEADERIMAGE_BITMAP "install-whirl.bmp"
+!define MUI_UNFINISHPAGE_NOAUTOCLOSE
+
+!insertmacro MUI_PAGE_WELCOME
+!insertmacro MUI_PAGE_LICENSE "${IMAGE}\license.txt"
+!insertmacro MUI_PAGE_COMPONENTS
+!insertmacro MUI_PAGE_DIRECTORY
+!insertmacro MUI_PAGE_INSTFILES
+!insertmacro MUI_PAGE_FINISH
+
+!insertmacro MUI_UNPAGE_CONFIRM
+!insertmacro MUI_UNPAGE_INSTFILES
+!insertmacro MUI_UNPAGE_FINISH
+
+;--------------------------------
+;Languages
+
+!insertmacro MUI_LANGUAGE "English"
+
+;--------------------------------
+;Language Strings
+
+LangString DESC_SecTAP ${LANG_ENGLISH} "Install/Upgrade the TAP virtual device driver.  Will not interfere with CIPE."
+LangString DESC_SecTAPUtilities ${LANG_ENGLISH} "Install the TAP Utilities."
+LangString DESC_SecTAPSDK ${LANG_ENGLISH} "Install the TAP SDK."
+
+;--------------------------------
+;Reserve Files
+
+;Things that need to be extracted on first (keep these lines before any File command!)
+;Only useful for BZIP2 compression
+
+ReserveFile "install-whirl.bmp"
+
+;--------------------------------
+;Macros
+
+!macro SelectByParameter SECT PARAMETER DEFAULT
+	${GetOptions} $R0 "/${PARAMETER}=" $0
+	${If} ${DEFAULT} == 0
+		${If} $0 == 1
+			!insertmacro SelectSection ${SECT}
+		${EndIf}
+	${Else}
+		${If} $0 != 0
+			!insertmacro SelectSection ${SECT}
+		${EndIf}
+	${EndIf}
+!macroend
+
+;--------------------------------
+;Installer Sections
+
+Section /o "TAP Virtual Ethernet Adapter" SecTAP
+
+	SetOverwrite on
+
+	${If} ${RunningX64}
+		DetailPrint "We are running on a 64-bit system."
+
+		SetOutPath "$INSTDIR\bin"
+		File "${DEVCON64}"
+
+		SetOutPath "$INSTDIR\driver"
+		File "${IMAGE}\amd64\OemVista.inf"
+		File "${IMAGE}\amd64\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat"
+		File "${IMAGE}\amd64\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys"
+	${Else}
+		DetailPrint "We are running on a 32-bit system."
+
+		SetOutPath "$INSTDIR\bin"
+		File "${DEVCON32}"
+
+		SetOutPath "$INSTDIR\driver"
+		File "${IMAGE}\i386\OemVista.inf"
+		File "${IMAGE}\i386\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat"
+		File "${IMAGE}\i386\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys"
+	${EndIf}
+SectionEnd
+
+Section /o "TAP Utilities" SecTAPUtilities
+	SetOverwrite on
+
+	# Delete previous start menu
+	RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}"
+
+	FileOpen $R0 "$INSTDIR\bin\addtap.bat" w
+	FileWrite $R0 "rem Add a new TAP virtual ethernet adapter$\r$\n"
+	FileWrite $R0 '"$INSTDIR\bin\${DEVCON_BASENAME}" install "$INSTDIR\driver\OemVista.inf" ${PRODUCT_TAP_WIN_COMPONENT_ID}$\r$\n'
+	FileWrite $R0 "pause$\r$\n"
+	FileClose $R0
+
+	FileOpen $R0 "$INSTDIR\bin\deltapall.bat" w
+	FileWrite $R0 "echo WARNING: this script will delete ALL TAP virtual adapters (use the device manager to delete adapters one at a time)$\r$\n"
+	FileWrite $R0 "pause$\r$\n"
+	FileWrite $R0 '"$INSTDIR\bin\${DEVCON_BASENAME}" remove ${PRODUCT_TAP_WIN_COMPONENT_ID}$\r$\n'
+	FileWrite $R0 "pause$\r$\n"
+	FileClose $R0
+
+	; Create shortcuts
+	CreateDirectory "$SMPROGRAMS\${PRODUCT_NAME}\Utilities"
+	CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Add a new TAP virtual ethernet adapter.lnk" "$INSTDIR\bin\addtap.bat" ""
+	; set runas admin flag on the addtap link
+	ShellLink::SetRunAsAdministrator "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Add a new TAP virtual ethernet adapter.lnk"
+	Pop $0
+	${If} $0 != 0
+		DetailPrint "Setting RunAsAdmin flag on addtap failed: status = $0"
+	${Endif}
+	CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Delete ALL TAP virtual ethernet adapters.lnk" "$INSTDIR\bin\deltapall.bat" ""
+	; set runas admin flag on the deltapall link
+	ShellLink::SetRunAsAdministrator "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Delete ALL TAP virtual ethernet adapters.lnk"
+	Pop $0
+	${If} $0 != 0
+		DetailPrint "Setting RunAsAdmin flag on deltapall failed: status = $0"
+	${Endif}
+SectionEnd
+
+Section /o "TAP SDK" SecTAPSDK
+	SetOverwrite on
+	SetOutPath "$INSTDIR\include"
+	File "${IMAGE}\include\tap-windows.h"
+SectionEnd
+
+Function .onInit
+	${GetParameters} $R0
+	ClearErrors
+
+${IfNot} ${AtLeastWinVista}
+	MessageBox MB_OK "This package requires at least Windows Vista"
+	SetErrorLevel 1
+	Quit
+${EndIf}
+
+	!insertmacro SelectByParameter ${SecTAP} SELECT_TAP 1
+	!insertmacro SelectByParameter ${SecTAPUtilities} SELECT_UTILITIES 0
+	!insertmacro SelectByParameter ${SecTAPSDK} SELECT_SDK 0
+
+	!insertmacro MULTIUSER_INIT
+	SetShellVarContext all
+
+	${If} ${RunningX64}
+		SetRegView 64
+		StrCpy $INSTDIR "$PROGRAMFILES64\${PRODUCT_NAME}"
+	${Else}
+		StrCpy $INSTDIR "$PROGRAMFILES\${PRODUCT_NAME}"
+	${EndIf}
+FunctionEnd
+
+;--------------------------------
+;Dependencies
+
+Function .onSelChange
+	${If} ${SectionIsSelected} ${SecTAPUtilities}
+		!insertmacro SelectSection ${SecTAP}
+	${EndIf}
+FunctionEnd
+
+;--------------------
+;Post-install section
+
+Section -post
+
+	; Store README, license, icon
+	SetOverwrite on
+	SetOutPath $INSTDIR
+	File "${IMAGE}\license.txt"
+	File "icon.ico"
+
+	${If} ${SectionIsSelected} ${SecTAP}
+		;
+		; install/upgrade TAP driver if selected, using devcon
+		;
+		; TAP install/update was selected.
+		; Should we install or update?
+		; If tapinstall error occurred, $R5 will
+		; be nonzero.
+		IntOp $R5 0 & 0
+		nsExec::ExecToStack '"$INSTDIR\bin\${DEVCON_BASENAME}" hwids ${PRODUCT_TAP_WIN_COMPONENT_ID}'
+		Pop $R0 # return value/error/timeout
+		IntOp $R5 $R5 | $R0
+		DetailPrint "${DEVCON_BASENAME} hwids returned: $R0"
+
+		; If tapinstall output string contains "${PRODUCT_TAP_WIN_COMPONENT_ID}" we assume
+		; that TAP device has been previously installed,
+		; therefore we will update, not install.
+		Push "${PRODUCT_TAP_WIN_COMPONENT_ID}"
+		Push ">"
+		Call StrLoc
+		Pop $R0
+
+		${If} $R5 == 0
+			${If} $R0 == ""
+				StrCpy $R1 "install"
+			${Else}
+				StrCpy $R1 "update"
+			${EndIf}
+			DetailPrint "TAP $R1 (${PRODUCT_TAP_WIN_COMPONENT_ID}) (May require confirmation)"
+			nsExec::ExecToLog '"$INSTDIR\bin\${DEVCON_BASENAME}" $R1 "$INSTDIR\driver\OemVista.inf" ${PRODUCT_TAP_WIN_COMPONENT_ID}'
+			Pop $R0 # return value/error/timeout
+			${If} $R0 == ""
+				IntOp $R0 0 & 0
+				SetRebootFlag true
+				DetailPrint "REBOOT flag set"
+			${EndIf}
+			IntOp $R5 $R5 | $R0
+			DetailPrint "${DEVCON_BASENAME} returned: $R0"
+		${EndIf}
+
+		DetailPrint "${DEVCON_BASENAME} cumulative status: $R5"
+		${If} $R5 != 0
+			MessageBox MB_OK "An error occurred installing the TAP device driver."
+		${EndIf}
+
+		; Store install folder in registry
+		WriteRegStr HKLM SOFTWARE\${PRODUCT_NAME} "" $INSTDIR
+	${EndIf}
+
+	; Create uninstaller
+	WriteUninstaller "$INSTDIR\Uninstall.exe"
+
+	; Show up in Add/Remove programs
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayName" "${PRODUCT_NAME} ${PRODUCT_VERSION}"
+	WriteRegExpandStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "UninstallString" "$INSTDIR\Uninstall.exe"
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayIcon" "$INSTDIR\icon.ico"
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayVersion" "${PRODUCT_VERSION}"
+	WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoModify" 1
+	WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoRepair" 1
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "Publisher" "${PRODUCT_PUBLISHER}"
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "HelpLink" "https://openvpn.net/index.php/open-source.html"
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "URLInfoAbout" "https://openvpn.net"
+
+	${GetSize} "$INSTDIR" "/S=0K" $0 $1 $2
+	IntFmt $0 "0x%08X" $0
+	WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "EstimatedSize" "$0"
+
+SectionEnd
+
+;--------------------------------
+;Descriptions
+
+!insertmacro MUI_FUNCTION_DESCRIPTION_BEGIN
+!insertmacro MUI_DESCRIPTION_TEXT ${SecTAP} $(DESC_SecTAP)
+!insertmacro MUI_DESCRIPTION_TEXT ${SecTAPUtilities} $(DESC_SecTAPUtilities)
+!insertmacro MUI_DESCRIPTION_TEXT ${SecTAPSDK} $(DESC_SecTAPSDK)
+!insertmacro MUI_FUNCTION_DESCRIPTION_END
+
+;--------------------------------
+;Uninstaller Section
+
+Function un.onInit
+	ClearErrors
+	!insertmacro MULTIUSER_UNINIT
+	SetShellVarContext all
+	${If} ${RunningX64}
+		SetRegView 64
+	${EndIf}
+FunctionEnd
+
+Section "Uninstall"
+	DetailPrint "TAP REMOVE"
+	nsExec::ExecToLog '"$INSTDIR\bin\${DEVCON_BASENAME}" remove ${PRODUCT_TAP_WIN_COMPONENT_ID}'
+	Pop $R0 # return value/error/timeout
+	DetailPrint "${DEVCON_BASENAME} remove returned: $R0"
+
+	Delete "$INSTDIR\bin\${DEVCON_BASENAME}"
+	Delete "$INSTDIR\bin\addtap.bat"
+	Delete "$INSTDIR\bin\deltapall.bat"
+
+	Delete "$INSTDIR\driver\OemVista.inf"
+	Delete "$INSTDIR\driver\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat"
+	Delete "$INSTDIR\driver\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys"
+
+	Delete "$INSTDIR\include\tap-windows.h"
+
+	Delete "$INSTDIR\icon.ico"
+	Delete "$INSTDIR\license.txt"
+	Delete "$INSTDIR\Uninstall.exe"
+
+	RMDir "$INSTDIR\bin"
+	RMDir "$INSTDIR\driver"
+	RMDir "$INSTDIR\include"
+	RMDir "$INSTDIR"
+	RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}"
+
+	DeleteRegKey HKLM "SOFTWARE\${PRODUCT_NAME}"
+	DeleteRegKey HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}"
+
+SectionEnd
diff --git a/installer/tap/src/paths.py b/installer/tap/src/paths.py
new file mode 100644
index 0000000..8598446
--- /dev/null
+++ b/installer/tap/src/paths.py
@@ -0,0 +1,3 @@
+# Windows 7 DDK
+DDK = "C:\\winddk\\7600.16385.1"
+NSIS = "C:\\Program Files (x86)\\NSIS"
diff --git a/installer/tap/src/src/MAKEFILE b/installer/tap/src/src/MAKEFILE
new file mode 100644
index 0000000..d5bedee
--- /dev/null
+++ b/installer/tap/src/src/MAKEFILE
@@ -0,0 +1,8 @@
+#
+# DO NOT EDIT THIS FILE!!!  Edit .\sources. if you want to add a new source
+# file to this component.  This file merely indirects to the real make file
+# that is shared by all the driver components of the Windows NT DDK
+#
+
+!INCLUDE $(NTMAKEENV)\makefile.def
+
diff --git a/installer/tap/src/src/OemVista.inf.in b/installer/tap/src/src/OemVista.inf.in
new file mode 100644
index 0000000..004ed62
--- /dev/null
+++ b/installer/tap/src/src/OemVista.inf.in
@@ -0,0 +1,191 @@
+; ****************************************************************************
+; * Copyright (C) 2002-2014 OpenVPN Technologies, Inc.                            *
+; *  This program is free software; you can redistribute it and/or modify    *
+; *  it under the terms of the GNU General Public License version 2          *
+; *  as published by the Free Software Foundation.                           *
+; ****************************************************************************
+
+; SYNTAX CHECKER
+; cd \WINDDK\3790\tools\chkinf
+; chkinf c:\src\openvpn\tap-win32\i386\oemvista.inf
+; OUTPUT -> file:///c:/WINDDK/3790/tools/chkinf/htm/c%23+src+openvpn+tap-win32+i386+__OemWin2k.htm
+
+; INSTALL/REMOVE DRIVER
+;   tapinstall install OemVista.inf tapoas
+;   tapinstall update OemVista.inf tapoas
+;   tapinstall remove tapoas
+
+;*********************************************************
+; Note to Developers:
+;
+; If you are bundling the TAP-Windows driver with your app,
+; you should try to rename it in such a way that it will
+; not collide with other instances of TAP-Windows defined
+; by other apps.  Multiple versions of the TAP-Windows
+; driver, each installed by different apps, can coexist
+; on the same machine if you follow these guidelines.
+; NOTE: these instructions assume you are editing the
+; generated OemWin2k.inf file, not the source
+; OemWin2k.inf.in file which is preprocessed by winconfig
+; and uses macro definitions from settings.in.
+;
+; (1) Rename all tapXXXX instances in this file to
+;     something different (use at least 5 characters
+;     for this name!)
+; (2) Change the "!define TAP" definition in openvpn.nsi
+;     to match what you changed tapXXXX to.
+; (3) Change TARGETNAME in SOURCES to match what you
+;     changed tapXXXX to.
+; (4) Change TAP_COMPONENT_ID in common.h to match what
+;     you changed tapXXXX to.
+; (5) Change SZDEPENDENCIES in service.h to match what
+;     you changed tapXXXX to.
+; (6) Change DeviceDescription and Provider strings.
+; (7) Change PRODUCT_TAP_WIN_DEVICE_DESCRIPTION in constants.h to what you
+;     set DeviceDescription to.
+;
+;*********************************************************
+
+[Version]
+   Signature = "$Windows NT$"
+   CatalogFile = @PRODUCT_TAP_WIN_COMPONENT_ID@.cat
+   ClassGUID = {4d36e972-e325-11ce-bfc1-08002be10318}
+   Provider = %Provider%
+   Class = Net
+
+; This version number should match the version
+; number given in SOURCES.
+   DriverVer=@PRODUCT_TAP_WIN_RELDATE@,@PRODUCT_TAP_WIN_MAJOR@.00.00.@PRODUCT_TAP_WIN_MINOR@
+
+[Strings]
+   DeviceDescription = "@PRODUCT_TAP_WIN_DEVICE_DESCRIPTION@"
+   Provider = "@PRODUCT_TAP_WIN_PROVIDER@"
+
+;----------------------------------------------------------------
+;                      Manufacturer + Product Section (Done)
+;----------------------------------------------------------------
+[Manufacturer]
+   %Provider% = @PRODUCT_TAP_WIN_COMPONENT_ID@@INF_PROVIDER_SUFFIX@
+
+[@PRODUCT_TAP_WIN_COMPONENT_ID@@INF_SECTION_SUFFIX@]
+   %DeviceDescription% = @PRODUCT_TAP_WIN_COMPONENT_ID@.ndi, root\@PRODUCT_TAP_WIN_COMPONENT_ID@ ; Root enumerated
+   %DeviceDescription% = @PRODUCT_TAP_WIN_COMPONENT_ID@.ndi, @PRODUCT_TAP_WIN_COMPONENT_ID@      ; Legacy
+
+;---------------------------------------------------------------
+;                             Driver Section (Done)
+;---------------------------------------------------------------
+
+;----------------- Characteristics ------------
+;    NCF_PHYSICAL = 0x04
+;    NCF_VIRTUAL = 0x01
+;    NCF_SOFTWARE_ENUMERATED = 0x02
+;    NCF_HIDDEN = 0x08
+;    NCF_NO_SERVICE = 0x10
+;    NCF_HAS_UI = 0x80
+;----------------- Characteristics ------------
+
+[@PRODUCT_TAP_WIN_COMPONENT_ID@.ndi]
+   CopyFiles       = @PRODUCT_TAP_WIN_COMPONENT_ID@.driver, @PRODUCT_TAP_WIN_COMPONENT_ID@.files
+   AddReg          = @PRODUCT_TAP_WIN_COMPONENT_ID@.reg
+   AddReg          = @PRODUCT_TAP_WIN_COMPONENT_ID@.params.reg
+   Characteristics = @PRODUCT_TAP_WIN_CHARACTERISTICS@
+   *IfType            = 0x6 ; IF_TYPE_ETHERNET_CSMACD
+   *MediaType         = 0x0 ; NdisMedium802_3
+   *PhysicalMediaType = 14  ; NdisPhysicalMedium802_3
+
+[@PRODUCT_TAP_WIN_COMPONENT_ID@.ndi.Services]
+   AddService = @PRODUCT_TAP_WIN_COMPONENT_ID@,        2, @PRODUCT_TAP_WIN_COMPONENT_ID@.service
+
+[@PRODUCT_TAP_WIN_COMPONENT_ID@.reg]
+   HKR, Ndi,            Service,      0, "@PRODUCT_TAP_WIN_COMPONENT_ID@"
+   HKR, Ndi\Interfaces, UpperRange,   0, "ndis5"
+   HKR, Ndi\Interfaces, LowerRange,   0, "ethernet"
+   HKR, ,               Manufacturer, 0, "%Provider%"
+   HKR, ,               ProductName,  0, "%DeviceDescription%"
+
+[@PRODUCT_TAP_WIN_COMPONENT_ID@.params.reg]
+   HKR, Ndi\params\MTU,                  ParamDesc, 0, "MTU"
+   HKR, Ndi\params\MTU,                  Type,      0, "int"
+   HKR, Ndi\params\MTU,                  Default,   0, "1500"
+   HKR, Ndi\params\MTU,                  Optional,  0, "0"
+   HKR, Ndi\params\MTU,                  Min,       0, "100"
+   HKR, Ndi\params\MTU,                  Max,       0, "1500"
+   HKR, Ndi\params\MTU,                  Step,      0, "1"
+   HKR, Ndi\params\MediaStatus,          ParamDesc, 0, "Media Status"
+   HKR, Ndi\params\MediaStatus,          Type,      0, "enum"
+   HKR, Ndi\params\MediaStatus,          Default,   0, "0"
+   HKR, Ndi\params\MediaStatus,          Optional,  0, "0"
+   HKR, Ndi\params\MediaStatus\enum,     "0",       0, "Application Controlled"
+   HKR, Ndi\params\MediaStatus\enum,     "1",       0, "Always Connected"
+   HKR, Ndi\params\MAC,                  ParamDesc, 0, "MAC Address"
+   HKR, Ndi\params\MAC,                  Type,      0, "edit"
+   HKR, Ndi\params\MAC,                  Optional,  0, "1"
+   HKR, Ndi\params\AllowNonAdmin,        ParamDesc, 0, "Non-Admin Access"
+   HKR, Ndi\params\AllowNonAdmin,        Type,      0, "enum"
+   HKR, Ndi\params\AllowNonAdmin,        Default,   0, "1"
+   HKR, Ndi\params\AllowNonAdmin,        Optional,  0, "0"
+   HKR, Ndi\params\AllowNonAdmin\enum,   "0",       0, "Not Allowed"
+   HKR, Ndi\params\AllowNonAdmin\enum,   "1",       0, "Allowed"
+
+;----------------------------------------------------------------
+;                             Service Section
+;----------------------------------------------------------------
+
+;---------- Service Type -------------
+;    SERVICE_KERNEL_DRIVER     = 0x01
+;    SERVICE_WIN32_OWN_PROCESS = 0x10
+;---------- Service Type -------------
+
+;---------- Start Mode ---------------
+;    SERVICE_BOOT_START   = 0x0
+;    SERVICE_SYSTEM_START = 0x1
+;    SERVICE_AUTO_START   = 0x2
+;    SERVICE_DEMAND_START = 0x3
+;    SERVICE_DISABLED     = 0x4
+;---------- Start Mode ---------------
+
+[@PRODUCT_TAP_WIN_COMPONENT_ID@.service]
+   DisplayName = %DeviceDescription%
+   ServiceType = 1
+   StartType = 3
+   ErrorControl = 1
+   LoadOrderGroup = NDIS
+   ServiceBinary = %12%\@PRODUCT_TAP_WIN_COMPONENT_ID@.sys
+
+;-----------------------------------------------------------------
+;                                File Installation
+;-----------------------------------------------------------------
+
+;----------------- Copy Flags ------------
+;    COPYFLG_NOSKIP = 0x02
+;    COPYFLG_NOVERSIONCHECK = 0x04
+;----------------- Copy Flags ------------
+
+; SourceDisksNames
+; diskid = description[, [tagfile] [, <unused>, subdir]]
+; 1 = "Intel Driver Disk 1",e100bex.sys,,
+
+[SourceDisksNames]
+   1 = %DeviceDescription%, @PRODUCT_TAP_WIN_COMPONENT_ID@.sys
+
+; SourceDisksFiles
+; filename_on_source = diskID[, [subdir][, size]]
+; e100bex.sys = 1,, ; on distribution disk 1
+
+[SourceDisksFiles]
+@PRODUCT_TAP_WIN_COMPONENT_ID@.sys = 1
+
+[DestinationDirs]
+   @PRODUCT_TAP_WIN_COMPONENT_ID@.files  = 11
+   @PRODUCT_TAP_WIN_COMPONENT_ID@.driver = 12
+
+[@PRODUCT_TAP_WIN_COMPONENT_ID@.files]
+;   TapPanel.cpl,,,6   ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+;   cipsrvr.exe,,,6     ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+
+[@PRODUCT_TAP_WIN_COMPONENT_ID@.driver]
+   @PRODUCT_TAP_WIN_COMPONENT_ID@.sys,,,6     ; COPYFLG_NOSKIP | COPYFLG_NOVERSIONCHECK
+
+;---------------------------------------------------------------
+;                                      End
+;---------------------------------------------------------------
diff --git a/installer/tap/src/src/SOURCES.in b/installer/tap/src/src/SOURCES.in
new file mode 100644
index 0000000..cf98d5f
--- /dev/null
+++ b/installer/tap/src/src/SOURCES.in
@@ -0,0 +1,62 @@
+# Build TAP-Windows NDIS 6.0 driver.
+# Build Command: build -cef
+
+MAJORCOMP=ntos
+MINORCOMP=ndis
+
+TARGETNAME=@PRODUCT_TAP_WIN_COMPONENT_ID@
+TARGETTYPE=DRIVER
+TARGETPATH=.
+
+TARGETLIBS=\
+    $(DDK_LIB_PATH)\ndis.lib \
+    $(DDK_LIB_PATH)\ntstrsafe.lib \
+    $(DDK_LIB_PATH)\wdmsec.lib
+
+INCLUDES=$(DDK_INCLUDE_PATH) ..
+
+# System and NDIS wrapper definitions.
+C_DEFINES=$(C_DEFINES) -DNDIS_MINIPORT_DRIVER=1
+C_DEFINES=$(C_DEFINES) -DNDIS61_MINIPORT=1
+C_DEFINES=$(C_DEFINES) -DNDIS_SUPPORT_NDIS61=1
+C_DEFINES=$(C_DEFINES) -DNDIS_WDM=1 
+
+# The TAP version numbers here must be >=
+# PRODUCT_TAP_WIN32_MIN_x values defined in version.m4
+C_DEFINES=$(C_DEFINES) -DTAP_DRIVER_MAJOR_VERSION=@PRODUCT_TAP_WIN_MAJOR@
+C_DEFINES=$(C_DEFINES) -DTAP_DRIVER_MINOR_VERSION=@PRODUCT_TAP_WIN_MINOR@
+
+# Produce the same symbolic information for both free & checked builds.
+# This will allow us to perform full source-level debugging on both
+# builds without affecting the free build's performance.
+!IF "$(DDKBUILDENV)" != "chk"
+NTDEBUGTYPE=both
+USE_PDB=1
+!ELSE
+NTDEBUGTYPE=both
+USE_PDB=1
+!ENDIF
+
+# Generate a linker map file just in case we need one for debugging
+LINKER_FLAGS=$(LINKER_FLAGS) /INCREMENTAL:NO /MAP /MAPINFO:EXPORTS
+
+
+# MSC_WARNING_LEVEL=/W4 /WX
+
+# disabled warning 4201 -- nonstandard extension used : nameless struct/union
+# disabled warning 4214 -- nonstandard extension used : bit field types other than int
+# disabled warning 4127 -- conditional expression is constant
+MSC_WARNING_LEVEL=$(MSC_WARNING_LEVEL) /wd4201 /wd4214 /wd4127
+
+SOURCES=\
+        tapdrvr.c \
+        adapter.c \
+        device.c \
+        rxpath.c \
+        txpath.c \
+        oidrequest.c \
+        mem.c \
+        macinfo.c \
+        error.c \
+        dhcp.c \
+        resource.rc
diff --git a/installer/tap/src/src/adapter.c b/installer/tap/src/src/adapter.c
new file mode 100644
index 0000000..2883b79
--- /dev/null
+++ b/installer/tap/src/src/adapter.c
@@ -0,0 +1,1717 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//
+// Include files.
+//
+
+#include "tap.h"
+
+NDIS_OID TAPSupportedOids[] =
+{
+        OID_GEN_HARDWARE_STATUS,
+        OID_GEN_TRANSMIT_BUFFER_SPACE,
+        OID_GEN_RECEIVE_BUFFER_SPACE,
+        OID_GEN_TRANSMIT_BLOCK_SIZE,
+        OID_GEN_RECEIVE_BLOCK_SIZE,
+        OID_GEN_VENDOR_ID,
+        OID_GEN_VENDOR_DESCRIPTION,
+        OID_GEN_VENDOR_DRIVER_VERSION,
+        OID_GEN_CURRENT_PACKET_FILTER,
+        OID_GEN_CURRENT_LOOKAHEAD,
+        OID_GEN_DRIVER_VERSION,
+        OID_GEN_MAXIMUM_TOTAL_SIZE,
+        OID_GEN_XMIT_OK,
+        OID_GEN_RCV_OK,
+        OID_GEN_STATISTICS,
+#ifdef IMPLEMENT_OPTIONAL_OIDS
+        OID_GEN_TRANSMIT_QUEUE_LENGTH,       // Optional
+#endif // IMPLEMENT_OPTIONAL_OIDS
+        OID_GEN_LINK_PARAMETERS,
+        OID_GEN_INTERRUPT_MODERATION,
+        OID_GEN_MEDIA_SUPPORTED,
+        OID_GEN_MEDIA_IN_USE,
+        OID_GEN_MAXIMUM_SEND_PACKETS,
+        OID_GEN_XMIT_ERROR,
+        OID_GEN_RCV_ERROR,
+        OID_GEN_RCV_NO_BUFFER,
+        OID_802_3_PERMANENT_ADDRESS,
+        OID_802_3_CURRENT_ADDRESS,
+        OID_802_3_MULTICAST_LIST,
+        OID_802_3_MAXIMUM_LIST_SIZE,
+        OID_802_3_RCV_ERROR_ALIGNMENT,
+        OID_802_3_XMIT_ONE_COLLISION,
+        OID_802_3_XMIT_MORE_COLLISIONS,
+#ifdef IMPLEMENT_OPTIONAL_OIDS
+        OID_802_3_XMIT_DEFERRED,             // Optional
+        OID_802_3_XMIT_MAX_COLLISIONS,       // Optional
+        OID_802_3_RCV_OVERRUN,               // Optional
+        OID_802_3_XMIT_UNDERRUN,             // Optional
+        OID_802_3_XMIT_HEARTBEAT_FAILURE,    // Optional
+        OID_802_3_XMIT_TIMES_CRS_LOST,       // Optional
+        OID_802_3_XMIT_LATE_COLLISIONS,      // Optional
+        OID_PNP_CAPABILITIES,                // Optional
+#endif // IMPLEMENT_OPTIONAL_OIDS
+};
+
+//======================================================================
+// TAP NDIS 6 Miniport Callbacks
+//======================================================================
+
+// Returns with reference count initialized to one.
+PTAP_ADAPTER_CONTEXT
+tapAdapterContextAllocate(
+    __in NDIS_HANDLE        MiniportAdapterHandle
+)
+{
+    PTAP_ADAPTER_CONTEXT   adapter = NULL;
+
+    adapter = (PTAP_ADAPTER_CONTEXT )NdisAllocateMemoryWithTagPriority(
+        GlobalData.NdisDriverHandle,
+        sizeof(TAP_ADAPTER_CONTEXT),
+        TAP_ADAPTER_TAG,
+        NormalPoolPriority
+        );
+
+    if(adapter)
+    {
+        NET_BUFFER_LIST_POOL_PARAMETERS  nblPoolParameters = {0};
+
+        NdisZeroMemory(adapter,sizeof(TAP_ADAPTER_CONTEXT));
+
+        adapter->MiniportAdapterHandle = MiniportAdapterHandle;
+
+        // Initialize cancel-safe IRP queue
+        tapIrpCsqInitialize(&adapter->PendingReadIrpQueue);
+
+        // Initialize TAP send packet queue.
+        tapPacketQueueInitialize(&adapter->SendPacketQueue);
+
+        // Allocate the adapter lock.
+        NdisAllocateSpinLock(&adapter->AdapterLock);
+
+        // NBL pool for making TAP receive indications.
+        NdisZeroMemory(&nblPoolParameters, sizeof(NET_BUFFER_LIST_POOL_PARAMETERS));
+
+        // Initialize event used to determine when all receive NBLs have been returned.
+        NdisInitializeEvent(&adapter->ReceiveNblInFlightCountZeroEvent);
+
+        nblPoolParameters.Header.Type = NDIS_OBJECT_TYPE_DEFAULT;
+        nblPoolParameters.Header.Revision = NET_BUFFER_LIST_POOL_PARAMETERS_REVISION_1;
+        nblPoolParameters.Header.Size = NDIS_SIZEOF_NET_BUFFER_LIST_POOL_PARAMETERS_REVISION_1;
+        nblPoolParameters.ProtocolId = NDIS_PROTOCOL_ID_DEFAULT;
+        nblPoolParameters.ContextSize = 0;
+        //nblPoolParameters.ContextSize = sizeof(RX_NETBUFLIST_RSVD);
+        nblPoolParameters.fAllocateNetBuffer = TRUE;
+        nblPoolParameters.PoolTag = TAP_RX_NBL_TAG;
+
+#pragma warning( suppress : 28197 )
+        adapter->ReceiveNblPool = NdisAllocateNetBufferListPool(
+            adapter->MiniportAdapterHandle,
+            &nblPoolParameters); 
+
+        if (adapter->ReceiveNblPool == NULL)
+        {
+            DEBUGP (("[TAP] Couldn't allocate adapter receive NBL pool\n"));
+            NdisFreeMemory(adapter,0,0);
+        }
+
+        // Add initial reference. Normally removed in AdapterHalt.
+        adapter->RefCount = 1;
+
+        // Safe for multiple removes.
+        NdisInitializeListHead(&adapter->AdapterListLink);
+
+        //
+        // The miniport adapter is initially powered up
+        //
+        adapter->CurrentPowerState = NdisDeviceStateD0;
+    }
+
+    return adapter;
+}
+
+VOID
+tapReadPermanentAddress(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in NDIS_HANDLE            ConfigurationHandle,
+    __out MACADDR               PermanentAddress
+    )
+{
+    NDIS_STATUS status;
+    NDIS_CONFIGURATION_PARAMETER *configParameter;
+    NDIS_STRING macKey = NDIS_STRING_CONST("MAC");
+    ANSI_STRING macString;
+    BOOLEAN macFromRegistry = FALSE;
+
+    // Read MAC parameter from registry.
+    NdisReadConfiguration(
+        &status,
+        &configParameter,
+        ConfigurationHandle,
+        &macKey,
+        NdisParameterString
+        );
+
+    if (status == NDIS_STATUS_SUCCESS)
+    {
+        if( (configParameter->ParameterType == NdisParameterString)
+            && (configParameter->ParameterData.StringData.Length >= 12)
+            )
+        {
+            if (RtlUnicodeStringToAnsiString(
+                    &macString,
+                    &configParameter->ParameterData.StringData,
+                    TRUE) == STATUS_SUCCESS
+                    )
+            {
+                macFromRegistry = ParseMAC (PermanentAddress, macString.Buffer);
+                RtlFreeAnsiString (&macString);
+            }
+        }
+    }
+
+    if(!macFromRegistry)
+    {
+        //
+        // There is no (valid) address stashed in the registry parameter.
+        //
+        // Make up a dummy mac address based on the ANSI representation of the
+        // NetCfgInstanceId GUID.
+        //
+        GenerateRandomMac(PermanentAddress, MINIPORT_INSTANCE_ID(Adapter));
+    }
+}
+
+NDIS_STATUS
+tapReadConfiguration(
+    __in PTAP_ADAPTER_CONTEXT     Adapter
+    )
+{
+    NDIS_STATUS                 status = NDIS_STATUS_SUCCESS;
+    NDIS_CONFIGURATION_OBJECT   configObject;
+    NDIS_HANDLE                 configHandle;
+
+    DEBUGP (("[TAP] --> tapReadConfiguration\n"));
+
+    //
+    // Setup defaults in case configuration cannot be opened.
+    //
+    Adapter->MtuSize = ETHERNET_MTU;
+    Adapter->MediaStateAlwaysConnected = FALSE;
+    Adapter->LogicalMediaState = FALSE;
+    Adapter->AllowNonAdmin = FALSE;
+    //
+    // Open the registry for this adapter to read advanced
+    // configuration parameters stored by the INF file.
+    //
+    NdisZeroMemory(&configObject, sizeof(configObject));
+
+    {C_ASSERT(sizeof(configObject) >= NDIS_SIZEOF_CONFIGURATION_OBJECT_REVISION_1);}
+    configObject.Header.Type = NDIS_OBJECT_TYPE_CONFIGURATION_OBJECT;
+    configObject.Header.Size = NDIS_SIZEOF_CONFIGURATION_OBJECT_REVISION_1;
+    configObject.Header.Revision = NDIS_CONFIGURATION_OBJECT_REVISION_1;
+
+    configObject.NdisHandle = Adapter->MiniportAdapterHandle;
+    configObject.Flags = 0;
+
+    status = NdisOpenConfigurationEx(
+                &configObject,
+                &configHandle
+                );
+
+    // Read on the opened configuration handle.
+    if(status == NDIS_STATUS_SUCCESS)
+    {
+        NDIS_CONFIGURATION_PARAMETER *configParameter;
+        NDIS_STRING mkey = NDIS_STRING_CONST("NetCfgInstanceId");
+
+        //
+        // Read NetCfgInstanceId from the registry.
+        // ------------------------------------
+        // NetCfgInstanceId is required to create device and associated
+        // symbolic link for the adapter device.
+        //
+        // NetCfgInstanceId is  a GUID string provided by NDIS that identifies
+        // the adapter instance. An example is:
+        // 
+        //    NetCfgInstanceId={410EB49D-2381-4FE7-9B36-498E22619DF0}
+        //
+        // Other names are derived from NetCfgInstanceId. For example, MiniportName:
+        //
+        //    MiniportName=\DEVICE\{410EB49D-2381-4FE7-9B36-498E22619DF0}
+        //
+        NdisReadConfiguration (
+            &status,
+            &configParameter,
+            configHandle,
+            &mkey,
+            NdisParameterString
+            );
+
+        if (status == NDIS_STATUS_SUCCESS)
+        {
+            if (configParameter->ParameterType == NdisParameterString
+		&& configParameter->ParameterData.StringData.Length <= sizeof(Adapter->NetCfgInstanceIdBuffer) - sizeof(WCHAR))
+            {
+                DEBUGP (("[TAP] NdisReadConfiguration (NetCfgInstanceId=%wZ)\n",
+                    &configParameter->ParameterData.StringData ));
+
+                // Save NetCfgInstanceId as UNICODE_STRING.
+                Adapter->NetCfgInstanceId.Length = Adapter->NetCfgInstanceId.MaximumLength
+                    = configParameter->ParameterData.StringData.Length;
+
+                Adapter->NetCfgInstanceId.Buffer = Adapter->NetCfgInstanceIdBuffer;
+
+                NdisMoveMemory(
+                    Adapter->NetCfgInstanceId.Buffer, 
+                    configParameter->ParameterData.StringData.Buffer,
+                    Adapter->NetCfgInstanceId.Length
+                    );
+
+                // Save NetCfgInstanceId as ANSI_STRING as well.
+                if (RtlUnicodeStringToAnsiString (
+                        &Adapter->NetCfgInstanceIdAnsi,
+                        &configParameter->ParameterData.StringData,
+                        TRUE) != STATUS_SUCCESS
+                    )
+                {
+                    DEBUGP (("[TAP] NetCfgInstanceId ANSI name conversion failed\n"));
+                    status = NDIS_STATUS_RESOURCES;
+                }
+            }
+            else
+            {
+                DEBUGP (("[TAP] NetCfgInstanceId has invalid type\n"));
+                status = NDIS_STATUS_INVALID_DATA;
+            }
+        }
+        else
+        {
+            DEBUGP (("[TAP] NetCfgInstanceId failed\n"));
+            status = NDIS_STATUS_INVALID_DATA;
+        }
+
+        if (status == NDIS_STATUS_SUCCESS)
+        {
+            NDIS_STATUS localStatus;    // Use default if these fail.
+            NDIS_CONFIGURATION_PARAMETER *configParameter;
+            NDIS_STRING mtuKey = NDIS_STRING_CONST("MTU");
+            NDIS_STRING mediaStatusKey = NDIS_STRING_CONST("MediaStatus");
+#if ENABLE_NONADMIN
+            NDIS_STRING allowNonAdminKey = NDIS_STRING_CONST("AllowNonAdmin");
+#endif
+
+            // Read MTU from the registry.
+            NdisReadConfiguration (
+                &localStatus,
+                &configParameter,
+                configHandle,
+                &mtuKey,
+                NdisParameterInteger
+                );
+
+            if (localStatus == NDIS_STATUS_SUCCESS)
+            {
+                if (configParameter->ParameterType == NdisParameterInteger)
+                {
+                    int mtu = configParameter->ParameterData.IntegerData;
+
+                    if(mtu == 0)
+                    {
+                        mtu = ETHERNET_MTU;
+                    }
+
+                    // Sanity check
+                    if (mtu < MINIMUM_MTU)
+                    {
+                        mtu = MINIMUM_MTU;
+                    }
+                    else if (mtu > MAXIMUM_MTU)
+                    {
+                        mtu = MAXIMUM_MTU;
+                    }
+
+                    Adapter->MtuSize = mtu;
+                }
+            }
+
+            DEBUGP (("[%s] Using MTU %d\n",
+                MINIPORT_INSTANCE_ID (Adapter),
+                Adapter->MtuSize
+                ));
+
+            // Read MediaStatus setting from registry.
+            NdisReadConfiguration (
+                &localStatus,
+                &configParameter,
+                configHandle,
+                &mediaStatusKey,
+                NdisParameterInteger
+                );
+
+            if (localStatus == NDIS_STATUS_SUCCESS)
+            {
+                if (configParameter->ParameterType == NdisParameterInteger)
+                {
+                    if(configParameter->ParameterData.IntegerData == 0)
+                    {
+                        // Connect state is appplication controlled.
+                        DEBUGP(("[%s] Initial MediaConnectState: Application Controlled\n",
+                            MINIPORT_INSTANCE_ID (Adapter)));
+
+                        Adapter->MediaStateAlwaysConnected = FALSE;
+                        Adapter->LogicalMediaState = FALSE;
+                    }
+                    else
+                    {
+                        // Connect state is always connected.
+                        DEBUGP(("[%s] Initial MediaConnectState: Always Connected\n",
+                            MINIPORT_INSTANCE_ID (Adapter)));
+
+                        Adapter->MediaStateAlwaysConnected = TRUE;
+                        Adapter->LogicalMediaState = TRUE;
+                    }
+                }
+            }
+
+            // Read MAC PermanentAddress setting from registry.
+            tapReadPermanentAddress(
+                Adapter,
+                configHandle,
+                Adapter->PermanentAddress
+                );
+
+            DEBUGP (("[%s] Using MAC PermanentAddress %2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x\n",
+                MINIPORT_INSTANCE_ID (Adapter),
+                Adapter->PermanentAddress[0],
+                Adapter->PermanentAddress[1],
+                Adapter->PermanentAddress[2],
+                Adapter->PermanentAddress[3],
+                Adapter->PermanentAddress[4],
+                Adapter->PermanentAddress[5])
+                );
+
+            // Now seed the current MAC address with the permanent address.
+            ETH_COPY_NETWORK_ADDRESS(Adapter->CurrentAddress, Adapter->PermanentAddress);
+
+            DEBUGP (("[%s] Using MAC CurrentAddress %2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x\n",
+                MINIPORT_INSTANCE_ID (Adapter),
+                Adapter->CurrentAddress[0],
+                Adapter->CurrentAddress[1],
+                Adapter->CurrentAddress[2],
+                Adapter->CurrentAddress[3],
+                Adapter->CurrentAddress[4],
+                Adapter->CurrentAddress[5])
+                );
+
+            // Read optional AllowNonAdmin setting from registry.
+#if ENABLE_NONADMIN
+            NdisReadConfiguration (
+                &localStatus,
+                &configParameter,
+                configHandle,
+                &allowNonAdminKey,
+                NdisParameterInteger
+                );
+
+            if (localStatus == NDIS_STATUS_SUCCESS)
+            {
+                if (configParameter->ParameterType == NdisParameterInteger)
+                {
+                    Adapter->AllowNonAdmin = TRUE;
+                }
+            }
+#endif
+        }
+
+        // Close the configuration handle.
+        NdisCloseConfiguration(configHandle);
+    }
+    else
+    {
+        DEBUGP (("[TAP] Couldn't open adapter registry\n"));
+    }
+
+    DEBUGP (("[TAP] <-- tapReadConfiguration; status = %8.8X\n",status));
+
+    return status;
+}
+
+VOID
+tapAdapterContextAddToGlobalList(
+    __in PTAP_ADAPTER_CONTEXT       Adapter
+    )
+{
+    LOCK_STATE      lockState;
+    PLIST_ENTRY     listEntry = &Adapter->AdapterListLink;
+
+    // Acquire global adapter list lock.
+    NdisAcquireReadWriteLock(
+        &GlobalData.Lock,
+        TRUE,      // Acquire for write
+        &lockState
+        );
+
+    // Adapter context should NOT be in any list.
+    ASSERT( (listEntry->Flink == listEntry) && (listEntry->Blink == listEntry ) );
+
+    // Add reference to persist until after removal.
+    tapAdapterContextReference(Adapter);
+
+    // Add the adapter context to the global list.
+    InsertTailList(&GlobalData.AdapterList,&Adapter->AdapterListLink);
+
+    // Release global adapter list lock.
+    NdisReleaseReadWriteLock(&GlobalData.Lock,&lockState);
+}
+
+VOID
+tapAdapterContextRemoveFromGlobalList(
+    __in PTAP_ADAPTER_CONTEXT       Adapter
+    )
+{
+    LOCK_STATE              lockState;
+
+    // Acquire global adapter list lock.
+    NdisAcquireReadWriteLock(
+        &GlobalData.Lock,
+        TRUE,      // Acquire for write
+        &lockState
+        );
+
+    // Remove the adapter context from the global list.
+    RemoveEntryList(&Adapter->AdapterListLink);
+
+    // Safe for multiple removes.
+    NdisInitializeListHead(&Adapter->AdapterListLink);
+
+    // Remove reference added in tapAdapterContextAddToGlobalList.
+    tapAdapterContextDereference(Adapter);
+
+    // Release global adapter list lock.
+    NdisReleaseReadWriteLock(&GlobalData.Lock,&lockState);
+}
+
+// Returns with added reference on adapter context.
+PTAP_ADAPTER_CONTEXT
+tapAdapterContextFromDeviceObject(
+    __in PDEVICE_OBJECT DeviceObject
+    )
+{
+    LOCK_STATE              lockState;
+
+    // Acquire global adapter list lock.
+    NdisAcquireReadWriteLock(
+        &GlobalData.Lock,
+        FALSE,      // Acquire for read
+        &lockState
+        );
+
+    if (!IsListEmpty(&GlobalData.AdapterList))
+    {
+        PLIST_ENTRY             entry = GlobalData.AdapterList.Flink;
+        PTAP_ADAPTER_CONTEXT    adapter;
+
+        while (entry != &GlobalData.AdapterList)
+        {
+            adapter = CONTAINING_RECORD(entry, TAP_ADAPTER_CONTEXT, AdapterListLink);
+
+            // Match on DeviceObject
+            if(adapter->DeviceObject == DeviceObject )
+            {
+                // Add reference to adapter context.
+                tapAdapterContextReference(adapter);
+
+                // Release global adapter list lock.
+                NdisReleaseReadWriteLock(&GlobalData.Lock,&lockState);
+
+                return adapter;
+            }
+
+            // Move to next entry
+            entry = entry->Flink;
+        }
+    }
+
+    // Release global adapter list lock.
+    NdisReleaseReadWriteLock(&GlobalData.Lock,&lockState);
+
+    return (PTAP_ADAPTER_CONTEXT )NULL;
+}
+
+NDIS_STATUS
+AdapterSetOptions(
+    __in  NDIS_HANDLE             NdisDriverHandle,
+    __in  NDIS_HANDLE             DriverContext
+    )
+/*++
+Routine Description:
+
+    The MiniportSetOptions function registers optional handlers.  For each
+    optional handler that should be registered, this function makes a call
+    to NdisSetOptionalHandlers.
+
+    MiniportSetOptions runs at IRQL = PASSIVE_LEVEL.
+
+Arguments:
+
+    DriverContext  The context handle
+
+Return Value:
+
+    NDIS_STATUS_xxx code
+
+--*/
+{
+    NDIS_STATUS status;
+
+    DEBUGP (("[TAP] --> AdapterSetOptions\n"));
+
+    //
+    // Set any optional handlers by filling out the appropriate struct and
+    // calling NdisSetOptionalHandlers here.
+    //
+
+    status = NDIS_STATUS_SUCCESS;
+
+    DEBUGP (("[TAP] <-- AdapterSetOptions; status = %8.8X\n",status));
+
+    return status;
+}
+
+NDIS_STATUS
+AdapterCreate(
+    __in  NDIS_HANDLE                         MiniportAdapterHandle,
+    __in  NDIS_HANDLE                         MiniportDriverContext,
+    __in  PNDIS_MINIPORT_INIT_PARAMETERS      MiniportInitParameters
+    )
+{
+    PTAP_ADAPTER_CONTEXT    adapter = NULL;
+    NDIS_STATUS             status;
+
+    UNREFERENCED_PARAMETER(MiniportDriverContext);
+    UNREFERENCED_PARAMETER(MiniportInitParameters);
+
+    DEBUGP (("[TAP] --> AdapterCreate\n"));
+
+    do
+    {
+        NDIS_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES regAttributes = {0};
+        NDIS_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES genAttributes = {0};
+        NDIS_PNP_CAPABILITIES pnpCapabilities = {0};
+
+        //
+        // Allocate adapter context structure and initialize all the
+        // memory resources for sending and receiving packets.
+        //
+        // Returns with reference count initialized to one.
+        //
+        adapter = tapAdapterContextAllocate(MiniportAdapterHandle);
+
+        if(adapter == NULL)
+        {
+            DEBUGP (("[TAP] Couldn't allocate adapter memory\n"));
+            status = NDIS_STATUS_RESOURCES;
+            break;
+        }
+
+        // Enter the Initializing state.
+        DEBUGP (("[TAP] Miniport State: Initializing\n"));
+
+        tapAdapterAcquireLock(adapter,FALSE);
+        adapter->Locked.AdapterState = MiniportInitializingState;
+        tapAdapterReleaseLock(adapter,FALSE);
+
+        //
+        // First read adapter configuration from registry.
+        // -----------------------------------------------
+        // Subsequent device registration will fail if NetCfgInstanceId
+        // has not been successfully read.
+        //
+        status = tapReadConfiguration(adapter);
+
+        //
+        // Set the registration attributes.
+        //
+        {C_ASSERT(sizeof(regAttributes) >= NDIS_SIZEOF_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES_REVISION_1);}
+        regAttributes.Header.Type = NDIS_OBJECT_TYPE_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES;
+        regAttributes.Header.Size = NDIS_SIZEOF_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES_REVISION_1;
+        regAttributes.Header.Revision = NDIS_SIZEOF_MINIPORT_ADAPTER_REGISTRATION_ATTRIBUTES_REVISION_1;
+
+        regAttributes.MiniportAdapterContext = adapter;
+        regAttributes.AttributeFlags = TAP_ADAPTER_ATTRIBUTES_FLAGS;
+
+        regAttributes.CheckForHangTimeInSeconds = TAP_ADAPTER_CHECK_FOR_HANG_TIME_IN_SECONDS;
+        regAttributes.InterfaceType = TAP_INTERFACE_TYPE;
+
+        //NDIS_DECLARE_MINIPORT_ADAPTER_CONTEXT(TAP_ADAPTER_CONTEXT);
+        status = NdisMSetMiniportAttributes(
+                    MiniportAdapterHandle,
+                    (PNDIS_MINIPORT_ADAPTER_ATTRIBUTES)&regAttributes
+                    );
+
+        if (status != NDIS_STATUS_SUCCESS)
+        {
+            DEBUGP (("[TAP] NdisSetOptionalHandlers failed; Status 0x%08x\n",status));
+            break;
+        }
+
+        //
+        // Next, set the general attributes.
+        //
+        {C_ASSERT(sizeof(genAttributes) >= NDIS_SIZEOF_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES_REVISION_1);}
+        genAttributes.Header.Type = NDIS_OBJECT_TYPE_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES;
+        genAttributes.Header.Size = NDIS_SIZEOF_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES_REVISION_1;
+        genAttributes.Header.Revision = NDIS_MINIPORT_ADAPTER_GENERAL_ATTRIBUTES_REVISION_1;
+
+        //
+        // Specify the medium type that the NIC can support but not
+        // necessarily the medium type that the NIC currently uses.
+        //
+        genAttributes.MediaType = TAP_MEDIUM_TYPE;
+
+        //
+        // Specifiy medium type that the NIC currently uses.
+        //
+        genAttributes.PhysicalMediumType = TAP_PHYSICAL_MEDIUM;
+
+        //
+        // Specifiy the maximum network frame size, in bytes, that the NIC
+        // supports excluding the header.
+        //
+        genAttributes.MtuSize = TAP_FRAME_MAX_DATA_SIZE;
+        genAttributes.MaxXmitLinkSpeed = TAP_XMIT_SPEED;
+        genAttributes.XmitLinkSpeed = TAP_XMIT_SPEED;
+        genAttributes.MaxRcvLinkSpeed = TAP_RECV_SPEED;
+        genAttributes.RcvLinkSpeed = TAP_RECV_SPEED;
+
+        if(adapter->MediaStateAlwaysConnected)
+        {
+            DEBUGP(("[%s] Initial MediaConnectState: Connected\n",
+                MINIPORT_INSTANCE_ID (adapter)));
+
+            genAttributes.MediaConnectState = MediaConnectStateConnected;
+        }
+        else
+        {
+            DEBUGP(("[%s] Initial MediaConnectState: Disconnected\n",
+                MINIPORT_INSTANCE_ID (adapter)));
+
+            genAttributes.MediaConnectState = MediaConnectStateDisconnected;
+        }
+
+        genAttributes.MediaDuplexState = MediaDuplexStateFull;
+
+        //
+        // The maximum number of bytes the NIC can provide as lookahead data.
+        // If that value is different from the size of the lookahead buffer
+        // supported by bound protocols, NDIS will call MiniportOidRequest to
+        // set the size of the lookahead buffer provided by the miniport driver
+        // to the minimum of the miniport driver and protocol(s) values. If the
+        // driver always indicates up full packets with
+        // NdisMIndicateReceiveNetBufferLists, it should set this value to the
+        // maximum total frame size, which excludes the header.
+        //
+        // Upper-layer drivers examine lookahead data to determine whether a
+        // packet that is associated with the lookahead data is intended for
+        // one or more of their clients. If the underlying driver supports
+        // multipacket receive indications, bound protocols are given full net
+        // packets on every indication. Consequently, this value is identical
+        // to that returned for OID_GEN_RECEIVE_BLOCK_SIZE.
+        //
+        genAttributes.LookaheadSize = TAP_MAX_LOOKAHEAD;
+        genAttributes.MacOptions = TAP_MAC_OPTIONS;
+        genAttributes.SupportedPacketFilters = TAP_SUPPORTED_FILTERS;
+
+        //
+        // The maximum number of multicast addresses the NIC driver can manage.
+        // This list is global for all protocols bound to (or above) the NIC.
+        // Consequently, a protocol can receive NDIS_STATUS_MULTICAST_FULL from
+        // the NIC driver when attempting to set the multicast address list,
+        // even if the number of elements in the given list is less than the
+        // number originally returned for this query.
+        //
+        genAttributes.MaxMulticastListSize = TAP_MAX_MCAST_LIST;
+        genAttributes.MacAddressLength = MACADDR_SIZE;
+
+        //
+        // Return the MAC address of the NIC burnt in the hardware.
+        //
+        ETH_COPY_NETWORK_ADDRESS(genAttributes.PermanentMacAddress, adapter->PermanentAddress);
+
+        //
+        // Return the MAC address the NIC is currently programmed to use. Note
+        // that this address could be different from the permananent address as
+        // the user can override using registry. Read NdisReadNetworkAddress
+        // doc for more info.
+        //
+        ETH_COPY_NETWORK_ADDRESS(genAttributes.CurrentMacAddress, adapter->CurrentAddress);
+
+        genAttributes.RecvScaleCapabilities = NULL;
+        genAttributes.AccessType = TAP_ACCESS_TYPE;
+        genAttributes.DirectionType = TAP_DIRECTION_TYPE;
+        genAttributes.ConnectionType = TAP_CONNECTION_TYPE;
+        genAttributes.IfType = TAP_IFTYPE;
+        genAttributes.IfConnectorPresent = TAP_HAS_PHYSICAL_CONNECTOR;
+        genAttributes.SupportedStatistics = TAP_SUPPORTED_STATISTICS;
+        genAttributes.SupportedPauseFunctions = NdisPauseFunctionsUnsupported; // IEEE 802.3 pause frames 
+        genAttributes.DataBackFillSize = 0;
+        genAttributes.ContextBackFillSize = 0;
+
+        //
+        // The SupportedOidList is an array of OIDs for objects that the
+        // underlying driver or its NIC supports.  Objects include general,
+        // media-specific, and implementation-specific objects. NDIS forwards a
+        // subset of the returned list to protocols that make this query. That
+        // is, NDIS filters any supported statistics OIDs out of the list
+        // because protocols never make statistics queries.
+        //
+        genAttributes.SupportedOidList = TAPSupportedOids;
+        genAttributes.SupportedOidListLength = sizeof(TAPSupportedOids);
+        genAttributes.AutoNegotiationFlags = NDIS_LINK_STATE_DUPLEX_AUTO_NEGOTIATED;
+
+        //
+        // Set power management capabilities
+        //
+        NdisZeroMemory(&pnpCapabilities, sizeof(pnpCapabilities));
+        pnpCapabilities.WakeUpCapabilities.MinMagicPacketWakeUp = NdisDeviceStateUnspecified;
+        pnpCapabilities.WakeUpCapabilities.MinPatternWakeUp = NdisDeviceStateUnspecified;
+        genAttributes.PowerManagementCapabilities = &pnpCapabilities;
+
+        status = NdisMSetMiniportAttributes(
+                    MiniportAdapterHandle,
+                    (PNDIS_MINIPORT_ADAPTER_ATTRIBUTES)&genAttributes
+                    );
+
+        if (status != NDIS_STATUS_SUCCESS)
+        {
+            DEBUGP (("[TAP] NdisMSetMiniportAttributes failed; Status 0x%08x\n",status));
+            break;
+        }
+
+        //
+        // Create the Win32 device I/O interface.
+        //
+        status = CreateTapDevice(adapter);
+
+        if (status == NDIS_STATUS_SUCCESS)
+        {
+            // Add this adapter to the global adapter list.
+            tapAdapterContextAddToGlobalList(adapter);
+        }
+        else
+        {
+            DEBUGP (("[TAP] CreateTapDevice failed; Status 0x%08x\n",status));
+            break;
+        }
+    } while(FALSE);
+
+    if(status == NDIS_STATUS_SUCCESS)
+    {
+        // Enter the Paused state if initialization is complete.
+        DEBUGP (("[TAP] Miniport State: Paused\n"));
+
+        tapAdapterAcquireLock(adapter,FALSE);
+        adapter->Locked.AdapterState = MiniportPausedState;
+        tapAdapterReleaseLock(adapter,FALSE);
+    }
+    else
+    {
+        if(adapter != NULL)
+        {
+            DEBUGP (("[TAP] Miniport State: Halted\n"));
+
+            //
+            // Remove reference when adapter context was allocated
+            // ---------------------------------------------------
+            // This should result in freeing adapter context memory
+            // and assiciated resources.
+            //
+            tapAdapterContextDereference(adapter);
+            adapter = NULL;
+        }
+    }
+
+    DEBUGP (("[TAP] <-- AdapterCreate; status = %8.8X\n",status));
+
+    return status;
+}
+
+VOID
+AdapterHalt(
+    __in  NDIS_HANDLE             MiniportAdapterContext,
+    __in  NDIS_HALT_ACTION        HaltAction
+    )
+/*++
+
+Routine Description:
+
+    Halt handler is called when NDIS receives IRP_MN_STOP_DEVICE,
+    IRP_MN_SUPRISE_REMOVE or IRP_MN_REMOVE_DEVICE requests from the PNP
+    manager. Here, the driver should free all the resources acquired in
+    MiniportInitialize and stop access to the hardware. NDIS will not submit
+    any further request once this handler is invoked.
+
+    1) Free and unmap all I/O resources.
+    2) Disable interrupt and deregister interrupt handler.
+    3) Deregister shutdown handler regsitered by
+        NdisMRegisterAdapterShutdownHandler .
+    4) Cancel all queued up timer callbacks.
+    5) Finally wait indefinitely for all the outstanding receive
+        packets indicated to the protocol to return.
+
+    MiniportHalt runs at IRQL = PASSIVE_LEVEL.
+
+
+Arguments:
+
+    MiniportAdapterContext  Pointer to the Adapter
+    HaltAction  The reason for halting the adapter
+
+Return Value:
+
+    None.
+
+--*/
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+
+    UNREFERENCED_PARAMETER(HaltAction);
+
+    DEBUGP (("[TAP] --> AdapterHalt\n"));
+
+    // Enter the Halted state.
+    DEBUGP (("[TAP] Miniport State: Halted\n"));
+
+    tapAdapterAcquireLock(adapter,FALSE);
+    adapter->Locked.AdapterState = MiniportHaltedState;
+    tapAdapterReleaseLock(adapter,FALSE);
+
+    // Remove this adapter from the global adapter list.
+    tapAdapterContextRemoveFromGlobalList(adapter);
+
+    // BUGBUG!!! Call AdapterShutdownEx to do some of the work of stopping.
+
+    // TODO!!! More...
+
+    //
+    // Destroy the TAP Win32 device.
+    //
+    DestroyTapDevice(adapter);
+
+    //
+    // Remove initial reference added in AdapterCreate.
+    // ------------------------------------------------
+    // This should result in freeing adapter context memory
+    // and resources allocated in AdapterCreate.
+    //
+    tapAdapterContextDereference(adapter);
+    adapter = NULL;
+
+    DEBUGP (("[TAP] <-- AdapterHalt\n"));
+}
+
+VOID
+tapWaitForReceiveNblInFlightCountZeroEvent(
+    __in PTAP_ADAPTER_CONTEXT     Adapter
+    )
+{
+    LONG    nblCount;
+
+    //
+    // Wait until higher-level protocol has returned all NBLs
+    // to the driver.
+    //
+
+    // Add one NBL "bias" to insure allow event to be reset safely.
+    nblCount = NdisInterlockedIncrement(&Adapter->ReceiveNblInFlightCount);
+    ASSERT(nblCount > 0 );
+    NdisResetEvent(&Adapter->ReceiveNblInFlightCountZeroEvent);
+
+    //
+    // Now remove the bias and wait for the ReceiveNblInFlightCountZeroEvent
+    // if the count returned is not zero.
+    //
+    nblCount = NdisInterlockedDecrement(&Adapter->ReceiveNblInFlightCount);
+    ASSERT(nblCount >= 0);
+
+    if(nblCount)
+    {
+        LARGE_INTEGER   startTime, currentTime;
+
+        NdisGetSystemUpTimeEx(&startTime);
+
+        for (;;)
+        {
+            BOOLEAN waitResult = NdisWaitEvent(
+                &Adapter->ReceiveNblInFlightCountZeroEvent, 
+                TAP_WAIT_POLL_LOOP_TIMEOUT
+                );
+
+            NdisGetSystemUpTimeEx(&currentTime);
+
+            if (waitResult)
+            {
+                break;
+            }
+
+            DEBUGP (("[%s] Waiting for %d in-flight receive NBLs to be returned.\n",
+                MINIPORT_INSTANCE_ID (Adapter),
+                Adapter->ReceiveNblInFlightCount
+                ));
+        }
+
+        DEBUGP (("[%s] Waited %d ms for all in-flight NBLs to be returned.\n",
+            MINIPORT_INSTANCE_ID (Adapter),
+            (currentTime.LowPart - startTime.LowPart)
+            ));
+    }
+}
+
+NDIS_STATUS
+AdapterPause(
+    __in  NDIS_HANDLE                       MiniportAdapterContext,
+    __in  PNDIS_MINIPORT_PAUSE_PARAMETERS   PauseParameters
+    )
+/*++
+
+Routine Description:
+
+    When a miniport receives a pause request, it enters into a Pausing state.
+    The miniport should not indicate up any more network data.  Any pending
+    send requests must be completed, and new requests must be rejected with
+    NDIS_STATUS_PAUSED.
+
+    Once all sends have been completed and all recieve NBLs have returned to
+    the miniport, the miniport enters the Paused state.
+
+    While paused, the miniport can still service interrupts from the hardware
+    (to, for example, continue to indicate NDIS_STATUS_MEDIA_CONNECT
+    notifications).
+
+    The miniport must continue to be able to handle status indications and OID
+    requests.  MiniportPause is different from MiniportHalt because, in
+    general, the MiniportPause operation won't release any resources.
+    MiniportPause must not attempt to acquire any resources where allocation
+    can fail, since MiniportPause itself must not fail.
+
+
+    MiniportPause runs at IRQL = PASSIVE_LEVEL.
+
+Arguments:
+
+    MiniportAdapterContext  Pointer to the Adapter
+    MiniportPauseParameters  Additional information about the pause operation
+
+Return Value:
+
+    If the miniport is able to immediately enter the Paused state, it should
+    return NDIS_STATUS_SUCCESS.
+
+    If the miniport must wait for send completions or pending receive NBLs, it
+    should return NDIS_STATUS_PENDING now, and call NDISMPauseComplete when the
+    miniport has entered the Paused state.
+
+    No other return value is permitted.  The pause operation must not fail.
+
+--*/
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+    NDIS_STATUS    status;
+
+    UNREFERENCED_PARAMETER(PauseParameters);
+
+    DEBUGP (("[TAP] --> AdapterPause\n"));
+
+    // Enter the Pausing state.
+    DEBUGP (("[TAP] Miniport State: Pausing\n"));
+
+    tapAdapterAcquireLock(adapter,FALSE);
+    adapter->Locked.AdapterState = MiniportPausingState;
+    tapAdapterReleaseLock(adapter,FALSE);
+
+    //
+    // Stop the flow of network data through the receive path
+    // ------------------------------------------------------
+    // In the Pausing and Paused state tapAdapterSendAndReceiveReady
+    // will prevent new calls to NdisMIndicateReceiveNetBufferLists
+    // to indicate additional receive NBLs to the host.
+    //
+    // However, there may be some in-flight NBLs owned by the driver
+    // that have been indicated to the host but have not yet been
+    // returned.
+    //
+    // Wait here for all in-flight receive indications to be returned.
+    //
+    tapWaitForReceiveNblInFlightCountZeroEvent(adapter);
+
+    //
+    // Stop the flow of network data through the send path
+    // ---------------------------------------------------
+    // The initial implementation of the NDIS 6 send path follows the
+    // NDIS 5 pattern. Under this approach every send packet is copied
+    // into a driver-owned TAP_PACKET structure and the NBL owned by
+    // higher-level protocol is immediatly completed.
+    //
+    // With this deep-copy approach the driver never claims ownership
+    // of any send NBL.
+    //
+    // A future implementation may queue send NBLs and thereby eliminate
+    // the need for the unnecessary allocation and deep copy of each packet.
+    //
+    // So, nothing to do here for the send path for now...
+
+    status = NDIS_STATUS_SUCCESS;
+
+    // Enter the Paused state.
+    DEBUGP (("[TAP] Miniport State: Paused\n"));
+
+    tapAdapterAcquireLock(adapter,FALSE);
+    adapter->Locked.AdapterState = MiniportPausedState;
+    tapAdapterReleaseLock(adapter,FALSE);
+
+    DEBUGP (("[TAP] <-- AdapterPause; status = %8.8X\n",status));
+
+    return status;
+}
+
+NDIS_STATUS
+AdapterRestart(
+    __in  NDIS_HANDLE                             MiniportAdapterContext,
+    __in  PNDIS_MINIPORT_RESTART_PARAMETERS       RestartParameters
+    )
+/*++
+
+Routine Description:
+
+    When a miniport receives a restart request, it enters into a Restarting
+    state.  The miniport may begin indicating received data (e.g., using
+    NdisMIndicateReceiveNetBufferLists), handling status indications, and
+    processing OID requests in the Restarting state.  However, no sends will be
+    requested while the miniport is in the Restarting state.
+
+    Once the miniport is ready to send data, it has entered the Running state.
+    The miniport informs NDIS that it is in the Running state by returning
+    NDIS_STATUS_SUCCESS from this MiniportRestart function; or if this function
+    has already returned NDIS_STATUS_PENDING, by calling NdisMRestartComplete.
+
+
+    MiniportRestart runs at IRQL = PASSIVE_LEVEL.
+
+Arguments:
+
+    MiniportAdapterContext  Pointer to the Adapter
+    RestartParameters  Additional information about the restart operation
+
+Return Value:
+
+    If the miniport is able to immediately enter the Running state, it should
+    return NDIS_STATUS_SUCCESS.
+
+    If the miniport is still in the Restarting state, it should return
+    NDIS_STATUS_PENDING now, and call NdisMRestartComplete when the miniport
+    has entered the Running state.
+
+    Other NDIS_STATUS codes indicate errors.  If an error is encountered, the
+    miniport must return to the Paused state (i.e., stop indicating receives).
+
+--*/
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+    NDIS_STATUS    status;
+
+    UNREFERENCED_PARAMETER(RestartParameters);
+
+    DEBUGP (("[TAP] --> AdapterRestart\n"));
+
+    // Enter the Restarting state.
+    DEBUGP (("[TAP] Miniport State: Restarting\n"));
+
+    tapAdapterAcquireLock(adapter,FALSE);
+    adapter->Locked.AdapterState = MiniportRestartingState;
+    tapAdapterReleaseLock(adapter,FALSE);
+
+    status = NDIS_STATUS_SUCCESS;
+
+    if(status == NDIS_STATUS_SUCCESS)
+    {
+        // Enter the Running state.
+        DEBUGP (("[TAP] Miniport State: Running\n"));
+
+        tapAdapterAcquireLock(adapter,FALSE);
+        adapter->Locked.AdapterState = MiniportRunning;
+        tapAdapterReleaseLock(adapter,FALSE);
+    }
+    else
+    {
+        // Enter the Paused state if restart failed.
+        DEBUGP (("[TAP] Miniport State: Paused\n"));
+
+        tapAdapterAcquireLock(adapter,FALSE);
+        adapter->Locked.AdapterState = MiniportPausedState;
+        tapAdapterReleaseLock(adapter,FALSE);
+    }
+
+    DEBUGP (("[TAP] <-- AdapterRestart; status = %8.8X\n",status));
+
+    return status;
+}
+
+BOOLEAN
+tapAdapterReadAndWriteReady(
+    __in PTAP_ADAPTER_CONTEXT     Adapter
+    )
+/*++
+
+Routine Description:
+
+    This routine determines whether the adapter device interface can
+    accept read and write operations.
+
+Arguments:
+
+    Adapter              Pointer to our adapter context
+
+Return Value:
+
+    Returns TRUE if the adapter state allows it to queue IRPs passed to
+    the device read and write callbacks.
+--*/
+{
+    if(!Adapter->TapDeviceCreated)
+    {
+        // TAP device not created or is being destroyed.
+        return FALSE;
+    }
+
+    if(Adapter->TapFileObject == NULL)
+    {
+        // TAP application file object not open.
+        return FALSE;
+    }
+
+    if(!Adapter->TapFileIsOpen)
+    {
+        // TAP application file object may be closing.
+        return FALSE;
+    }
+
+    if(!Adapter->LogicalMediaState)
+    {
+        // Don't handle read/write if media not connected.
+        return FALSE;
+    }
+
+    if(Adapter->CurrentPowerState != NdisDeviceStateD0)
+    {
+        // Don't handle read/write if device is not fully powered.
+        return FALSE;
+    }
+
+    return TRUE;
+}
+
+NDIS_STATUS
+tapAdapterSendAndReceiveReady(
+    __in PTAP_ADAPTER_CONTEXT     Adapter
+    )
+/*++
+
+Routine Description:
+
+    This routine determines whether the adapter NDIS send and receive
+    paths are ready.
+
+    This routine examines various adapter state variables and returns
+    a value that indicates whether the adapter NDIS interfaces can
+    accept send packets or indicate receive packets.
+
+    In normal operation the adapter may temporarily enter and then exit
+    a not-ready condition. In particular, the adapter becomes not-ready
+    when in the Pausing/Paused states, but may become ready again when
+    Restarted.
+
+    Runs at IRQL <= DISPATCH_LEVEL
+
+Arguments:
+
+    Adapter              Pointer to our adapter context
+
+Return Value:
+
+    Returns NDIS_STATUS_SUCCESS if the adapter state allows it to
+    accept send packets and indicate receive packets.
+
+    Otherwise it returns a NDIS_STATUS value other than NDIS_STATUS_SUCCESS.
+    These status values can be used directly as the completion status for
+    packets that must be completed immediatly in the send path.
+--*/
+{
+    NDIS_STATUS status = NDIS_STATUS_SUCCESS;
+
+    //
+    // Check various state variables to insure adapter is ready.
+    //
+    tapAdapterAcquireLock(Adapter,FALSE);
+
+    if(!Adapter->LogicalMediaState)
+    {
+        status = NDIS_STATUS_MEDIA_DISCONNECTED;
+    }
+    else if(Adapter->CurrentPowerState != NdisDeviceStateD0)
+    {
+        status = NDIS_STATUS_LOW_POWER_STATE;
+    }
+    else if(Adapter->ResetInProgress)
+    {
+        status = NDIS_STATUS_RESET_IN_PROGRESS;
+    }
+    else
+    {
+        switch(Adapter->Locked.AdapterState)
+        {
+        case MiniportPausingState:
+        case MiniportPausedState:
+            status = NDIS_STATUS_PAUSED;
+            break;
+
+        case MiniportHaltedState:
+            status = NDIS_STATUS_INVALID_STATE;
+            break;
+
+        default:
+            status = NDIS_STATUS_SUCCESS;
+            break;
+        }
+    }
+
+    tapAdapterReleaseLock(Adapter,FALSE);
+
+    return status;
+}
+
+BOOLEAN
+AdapterCheckForHangEx(
+    __in  NDIS_HANDLE MiniportAdapterContext
+    )
+/*++
+
+Routine Description:
+
+    The MiniportCheckForHangEx handler is called to report the state of the
+    NIC, or to monitor the responsiveness of an underlying device driver.
+    This is an optional function. If this handler is not specified, NDIS
+    judges the driver unresponsive when the driver holds
+    MiniportQueryInformation or MiniportSetInformation requests for a
+    time-out interval (deafult 4 sec), and then calls the driver's
+    MiniportReset function. A NIC driver's MiniportInitialize function can
+    extend NDIS's time-out interval by calling NdisMSetAttributesEx to
+    avoid unnecessary resets.
+
+    MiniportCheckForHangEx runs at IRQL <= DISPATCH_LEVEL.
+
+Arguments:
+
+    MiniportAdapterContext  Pointer to our adapter
+
+Return Value:
+
+    TRUE    NDIS calls the driver's MiniportReset function.
+    FALSE   Everything is fine
+
+--*/
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+
+    //DEBUGP (("[TAP] --> AdapterCheckForHangEx\n"));
+
+    //DEBUGP (("[TAP] <-- AdapterCheckForHangEx; status = FALSE\n"));
+
+    return FALSE;   // Everything is fine
+}
+
+NDIS_STATUS
+AdapterReset(
+    __in   NDIS_HANDLE            MiniportAdapterContext,
+    __out PBOOLEAN                AddressingReset
+    )
+/*++
+
+Routine Description:
+
+    MiniportResetEx is a required to issue a hardware reset to the NIC
+    and/or to reset the driver's software state.
+
+    1) The miniport driver can optionally complete any pending
+        OID requests. NDIS will submit no further OID requests
+        to the miniport driver for the NIC being reset until
+        the reset operation has finished. After the reset,
+        NDIS will resubmit to the miniport driver any OID requests
+        that were pending but not completed by the miniport driver
+        before the reset.
+
+    2) A deserialized miniport driver must complete any pending send
+        operations. NDIS will not requeue pending send packets for
+        a deserialized driver since NDIS does not maintain the send
+        queue for such a driver.
+
+    3) If MiniportReset returns NDIS_STATUS_PENDING, the driver must
+        complete the original request subsequently with a call to
+        NdisMResetComplete.
+
+    MiniportReset runs at IRQL <= DISPATCH_LEVEL.
+
+Arguments:
+
+AddressingReset - If multicast or functional addressing information
+                  or the lookahead size, is changed by a reset,
+                  MiniportReset must set the variable at AddressingReset
+                  to TRUE before it returns control. This causes NDIS to
+                  call the MiniportSetInformation function to restore
+                  the information.
+
+MiniportAdapterContext - Pointer to our adapter
+
+Return Value:
+
+    NDIS_STATUS
+
+--*/
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+    NDIS_STATUS    status;
+
+    UNREFERENCED_PARAMETER(MiniportAdapterContext);
+    UNREFERENCED_PARAMETER(AddressingReset);
+
+    DEBUGP (("[TAP] --> AdapterReset\n"));
+
+    // Indicate that adapter reset is in progress.
+    adapter->ResetInProgress = TRUE;
+
+    // See note above...
+    *AddressingReset = FALSE;
+
+    // BUGBUG!!! TODO!!! Lots of work here...
+
+    // Indicate that adapter reset has completed.
+    adapter->ResetInProgress = FALSE;
+
+    status = NDIS_STATUS_SUCCESS;
+
+    DEBUGP (("[TAP] <-- AdapterReset; status = %8.8X\n",status));
+
+    return status;
+}
+
+VOID
+AdapterDevicePnpEventNotify(
+    __in  NDIS_HANDLE             MiniportAdapterContext,
+    __in  PNET_DEVICE_PNP_EVENT   NetDevicePnPEvent
+    )
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+
+    DEBUGP (("[TAP] --> AdapterDevicePnpEventNotify\n"));
+
+/*
+    switch (NetDevicePnPEvent->DevicePnPEvent)
+    {
+        case NdisDevicePnPEventSurpriseRemoved:
+            //
+            // Called when NDIS receives IRP_MN_SUPRISE_REMOVAL.
+            // NDIS calls MiniportHalt function after this call returns.
+            //
+            MP_SET_FLAG(Adapter, fMP_ADAPTER_SURPRISE_REMOVED);
+            DEBUGP(MP_INFO, "[%p] MPDevicePnpEventNotify: NdisDevicePnPEventSurpriseRemoved\n", Adapter);
+            break;
+
+        case NdisDevicePnPEventPowerProfileChanged:
+            //
+            // After initializing a miniport driver and after miniport driver
+            // receives an OID_PNP_SET_POWER notification that specifies
+            // a device power state of NdisDeviceStateD0 (the powered-on state),
+            // NDIS calls the miniport's MiniportPnPEventNotify function with
+            // PnPEvent set to NdisDevicePnPEventPowerProfileChanged.
+            //
+            DEBUGP(MP_INFO, "[%p] MPDevicePnpEventNotify: NdisDevicePnPEventPowerProfileChanged\n", Adapter);
+
+            if (NetDevicePnPEvent->InformationBufferLength == sizeof(ULONG))
+            {
+                ULONG NdisPowerProfile = *((PULONG)NetDevicePnPEvent->InformationBuffer);
+
+                if (NdisPowerProfile == NdisPowerProfileBattery)
+                {
+                    DEBUGP(MP_INFO, "[%p] The host system is running on battery power\n", Adapter);
+                }
+                if (NdisPowerProfile == NdisPowerProfileAcOnLine)
+                {
+                    DEBUGP(MP_INFO, "[%p] The host system is running on AC power\n", Adapter);
+                }
+            }
+            break;
+
+        default:
+            DEBUGP(MP_ERROR, "[%p] MPDevicePnpEventNotify: unknown PnP event 0x%x\n", Adapter, NetDevicePnPEvent->DevicePnPEvent);
+    }
+*/
+    DEBUGP (("[TAP] <-- AdapterDevicePnpEventNotify\n"));
+}
+
+VOID
+AdapterShutdownEx(
+    __in  NDIS_HANDLE             MiniportAdapterContext,
+    __in  NDIS_SHUTDOWN_ACTION    ShutdownAction
+    )
+/*++
+
+Routine Description:
+
+    The MiniportShutdownEx handler restores hardware to its initial state when
+    the system is shut down, whether by the user or because an unrecoverable
+    system error occurred. This is to ensure that the NIC is in a known
+    state and ready to be reinitialized when the machine is rebooted after
+    a system shutdown occurs for any reason, including a crash dump.
+
+    Here just disable the interrupt and stop the DMA engine.  Do not free
+    memory resources or wait for any packet transfers to complete.  Do not call
+    into NDIS at this time.
+
+    This can be called at aribitrary IRQL, including in the context of a
+    bugcheck.
+
+Arguments:
+
+    MiniportAdapterContext  Pointer to our adapter
+    ShutdownAction  The reason why NDIS called the shutdown function
+
+Return Value:
+
+    None.
+
+--*/
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+
+    UNREFERENCED_PARAMETER(ShutdownAction);
+    UNREFERENCED_PARAMETER(MiniportAdapterContext);
+
+    DEBUGP (("[TAP] --> AdapterShutdownEx\n"));
+
+    // Enter the Shutdown state.
+    DEBUGP (("[TAP] Miniport State: Shutdown\n"));
+
+    tapAdapterAcquireLock(adapter,FALSE);
+    adapter->Locked.AdapterState = MiniportShutdownState;
+    tapAdapterReleaseLock(adapter,FALSE);
+
+    //
+    // BUGBUG!!! FlushIrpQueues???
+    //
+
+    DEBUGP (("[TAP] <-- AdapterShutdownEx\n"));
+}
+
+
+// Free adapter context memory and associated resources.
+VOID
+tapAdapterContextFree(
+    __in PTAP_ADAPTER_CONTEXT     Adapter
+    )
+{
+    PLIST_ENTRY listEntry = &Adapter->AdapterListLink;
+
+    DEBUGP (("[TAP] --> tapAdapterContextFree\n"));
+
+    // Adapter context should already be removed.
+    ASSERT( (listEntry->Flink == listEntry) && (listEntry->Blink == listEntry ) );
+
+    // Insure that adapter context has been removed from global adapter list.
+    RemoveEntryList(&Adapter->AdapterListLink);
+
+    // Free the adapter lock.
+    NdisFreeSpinLock(&Adapter->AdapterLock);
+
+    // Free the ANSI NetCfgInstanceId buffer.
+    if(Adapter->NetCfgInstanceIdAnsi.Buffer != NULL)
+    {
+        RtlFreeAnsiString(&Adapter->NetCfgInstanceIdAnsi);
+    }
+
+    Adapter->NetCfgInstanceIdAnsi.Buffer = NULL;
+
+    // Free the receive NBL pool.
+    if(Adapter->ReceiveNblPool != NULL )
+    {
+        NdisFreeNetBufferListPool(Adapter->ReceiveNblPool);
+    }
+
+    Adapter->ReceiveNblPool = NULL;
+
+    NdisFreeMemory(Adapter,0,0);
+
+    DEBUGP (("[TAP] <-- tapAdapterContextFree\n"));
+}
+ULONG
+tapGetNetBufferFrameType(
+    __in PNET_BUFFER       NetBuffer
+    )
+/*++
+
+Routine Description:
+
+    Reads the network frame's destination address to determine the type
+    (broadcast, multicast, etc)
+
+    Runs at IRQL <= DISPATCH_LEVEL.
+
+Arguments:
+
+    NetBuffer                 The NB to examine
+
+Return Value:
+
+    NDIS_PACKET_TYPE_BROADCAST
+    NDIS_PACKET_TYPE_MULTICAST
+    NDIS_PACKET_TYPE_DIRECTED
+
+--*/
+{
+    PETH_HEADER ethernetHeader;
+
+    ethernetHeader = (PETH_HEADER )NdisGetDataBuffer(
+                        NetBuffer,
+                        sizeof(ETH_HEADER),
+                        NULL,
+                        1,
+                        0
+                        );
+
+    ASSERT(ethernetHeader);
+
+    if (ETH_IS_BROADCAST(ethernetHeader->dest))
+    {
+        return NDIS_PACKET_TYPE_BROADCAST;
+    }
+    else if(ETH_IS_MULTICAST(ethernetHeader->dest))
+    {
+        return NDIS_PACKET_TYPE_MULTICAST;
+    }
+    else
+    {
+        return NDIS_PACKET_TYPE_DIRECTED;
+    }
+
+}
+
+ULONG
+tapGetNetBufferCountsFromNetBufferList(
+    __in PNET_BUFFER_LIST   NetBufferList,
+    __inout_opt PULONG      TotalByteCount      // Of all linked NBs
+    )
+/*++
+
+Routine Description:
+
+    Returns the number of net buffers linked to the net buffer list.
+
+    Optionally retuens the total byte count of all net buffers linked
+    to the net buffer list
+
+    Runs at IRQL <= DISPATCH_LEVEL.
+
+Arguments:
+
+    NetBufferList                 The NBL to examine
+
+Return Value:
+
+    The number of net buffers linked to the net buffer list.
+
+--*/
+{
+    ULONG       netBufferCount = 0;
+    PNET_BUFFER currentNb;
+
+    if(TotalByteCount)
+    {
+        *TotalByteCount = 0;
+    }
+
+    currentNb = NET_BUFFER_LIST_FIRST_NB(NetBufferList);
+
+    while(currentNb)
+    {
+        ++netBufferCount;
+
+        if(TotalByteCount)
+        {
+            *TotalByteCount += NET_BUFFER_DATA_LENGTH(currentNb);
+        }
+
+        // Move to next NB
+        currentNb = NET_BUFFER_NEXT_NB(currentNb);
+    }
+
+    return netBufferCount;
+}
+
+VOID
+tapAdapterAcquireLock(
+    __in    PTAP_ADAPTER_CONTEXT    Adapter,
+    __in    BOOLEAN                 DispatchLevel
+    )
+{
+    ASSERT(!DispatchLevel || (DISPATCH_LEVEL == KeGetCurrentIrql()));
+   
+    if (DispatchLevel)
+    {
+        NdisDprAcquireSpinLock(&Adapter->AdapterLock);
+    }
+    else
+    {
+        NdisAcquireSpinLock(&Adapter->AdapterLock);
+    }
+}
+
+VOID
+tapAdapterReleaseLock(
+    __in    PTAP_ADAPTER_CONTEXT    Adapter,
+    __in    BOOLEAN                 DispatchLevel
+    )
+{
+    ASSERT(!DispatchLevel || (DISPATCH_LEVEL == KeGetCurrentIrql()));
+   
+    if (DispatchLevel)
+    {
+        NdisDprReleaseSpinLock(&Adapter->AdapterLock);
+    }
+    else
+    {
+        NdisReleaseSpinLock(&Adapter->AdapterLock);
+    }
+}
+
+
diff --git a/installer/tap/src/src/adapter.h b/installer/tap/src/src/adapter.h
new file mode 100644
index 0000000..2f09d12
--- /dev/null
+++ b/installer/tap/src/src/adapter.h
@@ -0,0 +1,346 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#ifndef __TAP_ADAPTER_CONTEXT_H_
+#define __TAP_ADAPTER_CONTEXT_H_
+
+// Memory allocation tags.
+#define TAP_ADAPTER_TAG             ((ULONG)'ApaT')     // "TapA
+#define TAP_RX_NBL_TAG              ((ULONG)'RpaT')     // "TapR
+#define TAP_RX_INJECT_BUFFER_TAG    ((ULONG)'IpaT')     // "TapI
+
+#define TAP_MAX_NDIS_NAME_LENGTH     64     // 38 character GUID string plus extra..
+
+// TAP receive indication NBL flag definitions.
+#define TAP_RX_NBL_FLAGS                    NBL_FLAGS_MINIPORT_RESERVED
+#define TAP_RX_NBL_FLAGS_CLEAR_ALL(_NBL)    ((_NBL)->Flags &= ~TAP_RX_NBL_FLAGS)
+#define TAP_RX_NBL_FLAG_SET(_NBL, _F)       ((_NBL)->Flags |= ((_F) & TAP_RX_NBL_FLAGS))
+#define TAP_RX_NBL_FLAG_CLEAR(_NBL, _F)     ((_NBL)->Flags &= ~((_F) & TAP_RX_NBL_FLAGS))
+#define TAP_RX_NBL_FLAG_TEST(_NBL, _F)      (((_NBL)->Flags & ((_F) & TAP_RX_NBL_FLAGS)) != 0)
+
+#define TAP_RX_NBL_FLAGS_IS_P2P             0x00001000
+#define TAP_RX_NBL_FLAGS_IS_INJECTED        0x00002000
+
+// MSDN Ref: http://msdn.microsoft.com/en-us/library/windows/hardware/ff560490(v=vs.85).aspx
+typedef
+enum _TAP_MINIPORT_ADAPTER_STATE
+{
+    // The Halted state is the initial state of all adapters. When an
+    // adapter is in the Halted state, NDIS can call the driver's
+    // MiniportInitializeEx function to initialize the adapter.
+    MiniportHaltedState,
+
+    // In the Shutdown state, a system shutdown and restart must occur
+    // before the system can use the adapter again.
+    MiniportShutdownState,
+
+    // In the Initializing state, a miniport driver completes any
+    //operations that are required to initialize an adapter.
+    MiniportInitializingState,
+
+    // Entering the Paused state...
+    MiniportPausingState,
+
+    // In the Paused state, the adapter does not indicate received
+    // network data or accept send requests.
+    MiniportPausedState,
+
+    // In the Running state, a miniport driver performs send and
+    // receive processing for an adapter.
+    MiniportRunning,
+
+    // In the Restarting state, a miniport driver completes any
+    // operations that are required to restart send and receive
+    // operations for an adapter.
+    MiniportRestartingState
+} TAP_MINIPORT_ADAPTER_STATE, *PTAP_MINIPORT_ADAPTER_STATE;
+
+//
+// Each adapter managed by this driver has a TapAdapter struct.
+// ------------------------------------------------------------
+// Since there is a one-to-one relationship between adapter instances
+// and device instances this structure is the device extension as well.
+//
+typedef struct _TAP_ADAPTER_CONTEXT
+{
+    LIST_ENTRY                  AdapterListLink;
+
+    volatile LONG               RefCount;
+
+    NDIS_HANDLE                 MiniportAdapterHandle;
+
+    NDIS_SPIN_LOCK              AdapterLock;    // Lock for protection of state and outstanding sends and recvs
+
+    //
+    // All fields that are protected by the AdapterLock are included
+    // in the Locked structure to remind us to take the Lock
+    // before accessing them :)
+    //
+    struct
+    {
+        TAP_MINIPORT_ADAPTER_STATE  AdapterState;
+    } Locked;
+
+    BOOLEAN                     ResetInProgress;
+
+    //
+    // NetCfgInstanceId as UNICODE_STRING
+    // ----------------------------------
+    // This a GUID string provided by NDIS that identifies the adapter instance.
+    // An example is:
+    // 
+    //    NetCfgInstanceId={410EB49D-2381-4FE7-9B36-498E22619DF0}
+    //
+    // Other names are derived from NetCfgInstanceId. For example, MiniportName:
+    //
+    //    MiniportName=\DEVICE\{410EB49D-2381-4FE7-9B36-498E22619DF0}
+    //
+    NDIS_STRING                 NetCfgInstanceId;
+    WCHAR                       NetCfgInstanceIdBuffer[TAP_MAX_NDIS_NAME_LENGTH];
+
+# define MINIPORT_INSTANCE_ID(a) ((a)->NetCfgInstanceIdAnsi.Buffer)
+    ANSI_STRING                 NetCfgInstanceIdAnsi;   // Used occasionally
+
+    ULONG                       MtuSize;        // 1500 byte (typical)
+
+    // TRUE if adapter should always be "connected" even when device node
+    // is not open by a userspace process.
+    //
+    // FALSE if connection state is application controlled.
+    BOOLEAN                     MediaStateAlwaysConnected;
+
+    // TRUE if device is "connected".
+    BOOLEAN                     LogicalMediaState;
+
+    NDIS_DEVICE_POWER_STATE     CurrentPowerState;
+
+    BOOLEAN                     AllowNonAdmin;
+
+    MACADDR                     PermanentAddress;   // From registry, if available
+    MACADDR                     CurrentAddress;
+
+    // Device registration parameters from NdisRegisterDeviceEx.
+    NDIS_STRING                 DeviceName;
+    WCHAR                       DeviceNameBuffer[TAP_MAX_NDIS_NAME_LENGTH];
+
+    NDIS_STRING                 LinkName;
+    WCHAR                       LinkNameBuffer[TAP_MAX_NDIS_NAME_LENGTH];
+
+    NDIS_HANDLE                 DeviceHandle;
+    PDEVICE_OBJECT              DeviceObject;
+    BOOLEAN                     TapDeviceCreated;   // WAS: m_TapIsRunning
+
+    PFILE_OBJECT                TapFileObject;      // Exclusive access
+    BOOLEAN                     TapFileIsOpen;      // WAS: m_TapOpens
+    LONG                        TapFileOpenCount;   // WAS: m_NumTapOpens
+
+    // Cancel-Safe read IRP queue.
+    TAP_IRP_CSQ                 PendingReadIrpQueue;
+
+    // Queue containing TAP packets representing host send NBs. These are
+    // waiting to be read by user-mode application.
+    TAP_PACKET_QUEUE            SendPacketQueue;
+
+    // NBL pool for making TAP receive indications.
+    NDIS_HANDLE                 ReceiveNblPool;
+
+    volatile LONG               ReceiveNblInFlightCount;
+#define TAP_WAIT_POLL_LOOP_TIMEOUT  3000    // 3 seconds
+    NDIS_EVENT                  ReceiveNblInFlightCountZeroEvent;
+
+    // Info for point-to-point mode
+    BOOLEAN                     m_tun;
+    IPADDR                      m_localIP;
+    IPADDR                      m_remoteNetwork;
+    IPADDR                      m_remoteNetmask;
+    ETH_HEADER                  m_TapToUser;
+    ETH_HEADER                  m_UserToTap;
+    ETH_HEADER                  m_UserToTap_IPv6; // same as UserToTap but proto=ipv6
+
+    // Info for DHCP server masquerade
+    BOOLEAN                     m_dhcp_enabled;
+    IPADDR                      m_dhcp_addr;
+    ULONG                       m_dhcp_netmask;
+    IPADDR                      m_dhcp_server_ip;
+    BOOLEAN                     m_dhcp_server_arp;
+    MACADDR                     m_dhcp_server_mac;
+    ULONG                       m_dhcp_lease_time;
+    UCHAR                       m_dhcp_user_supplied_options_buffer[DHCP_USER_SUPPLIED_OPTIONS_BUFFER_SIZE];
+    ULONG                       m_dhcp_user_supplied_options_buffer_len;
+    BOOLEAN                     m_dhcp_received_discover;
+    ULONG                       m_dhcp_bad_requests;
+
+    // Multicast list. Fixed size.
+    ULONG                       ulMCListSize;
+    UCHAR                       MCList[TAP_MAX_MCAST_LIST][MACADDR_SIZE];
+
+    ULONG                       PacketFilter;
+    ULONG                       ulLookahead;
+
+    //
+    // Statistics
+    // -------------------------------------------------------------------------
+    //
+
+    // Packet counts
+    ULONG64                     FramesRxDirected;
+    ULONG64                     FramesRxMulticast;
+    ULONG64                     FramesRxBroadcast;
+    ULONG64                     FramesTxDirected;
+    ULONG64                     FramesTxMulticast;
+    ULONG64                     FramesTxBroadcast;
+
+    // Byte counts
+    ULONG64                     BytesRxDirected;
+    ULONG64                     BytesRxMulticast;
+    ULONG64                     BytesRxBroadcast;
+    ULONG64                     BytesTxDirected;
+    ULONG64                     BytesTxMulticast;
+    ULONG64                     BytesTxBroadcast;
+
+    // Count of transmit errors
+    ULONG                       TxAbortExcessCollisions;
+    ULONG                       TxLateCollisions;
+    ULONG                       TxDmaUnderrun;
+    ULONG                       TxLostCRS;
+    ULONG                       TxOKButDeferred;
+    ULONG                       OneRetry;
+    ULONG                       MoreThanOneRetry;
+    ULONG                       TotalRetries;
+    ULONG                       TransmitFailuresOther;
+
+    // Count of receive errors
+    ULONG                       RxCrcErrors;
+    ULONG                       RxAlignmentErrors;
+    ULONG                       RxResourceErrors;
+    ULONG                       RxDmaOverrunErrors;
+    ULONG                       RxCdtFrames;
+    ULONG                       RxRuntErrors;
+
+#if PACKET_TRUNCATION_CHECK
+    LONG                        m_RxTrunc, m_TxTrunc;
+#endif
+
+  BOOLEAN m_InterfaceIsRunning;
+  LONG m_Rx, m_RxErr;
+  NDIS_MEDIUM m_Medium;
+
+  // Help to tear down the adapter by keeping
+  // some state information on allocated
+  // resources.
+  BOOLEAN m_CalledAdapterFreeResources;
+  BOOLEAN m_RegisteredAdapterShutdownHandler;
+
+} TAP_ADAPTER_CONTEXT, *PTAP_ADAPTER_CONTEXT;
+
+FORCEINLINE
+LONG
+tapAdapterContextReference(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+    )
+{
+    LONG    refCount = NdisInterlockedIncrement(&Adapter->RefCount);
+
+    ASSERT(refCount>1);     // Cannot dereference a zombie.
+
+    return refCount;
+}
+
+VOID
+tapAdapterContextFree(
+    __in PTAP_ADAPTER_CONTEXT     Adapter
+    );
+
+FORCEINLINE
+LONG
+tapAdapterContextDereference(
+    IN PTAP_ADAPTER_CONTEXT     Adapter
+    )
+{
+    LONG    refCount = NdisInterlockedDecrement(&Adapter->RefCount);
+    ASSERT(refCount >= 0);
+    if (!refCount)
+    {
+        tapAdapterContextFree(Adapter);
+    }
+
+    return refCount;
+}
+
+VOID
+tapAdapterAcquireLock(
+    __in    PTAP_ADAPTER_CONTEXT    Adapter,
+    __in    BOOLEAN                 DispatchLevel
+    );
+
+VOID
+tapAdapterReleaseLock(
+    __in    PTAP_ADAPTER_CONTEXT    Adapter,
+    __in    BOOLEAN                 DispatchLevel
+    );
+
+// Returns with added reference on adapter context.
+PTAP_ADAPTER_CONTEXT
+tapAdapterContextFromDeviceObject(
+    __in PDEVICE_OBJECT DeviceObject
+    );
+
+BOOLEAN
+tapAdapterReadAndWriteReady(
+    __in PTAP_ADAPTER_CONTEXT     Adapter
+    );
+
+NDIS_STATUS
+tapAdapterSendAndReceiveReady(
+    __in PTAP_ADAPTER_CONTEXT     Adapter
+    );
+
+ULONG
+tapGetNetBufferFrameType(
+    __in PNET_BUFFER       NetBuffer
+    );
+
+ULONG
+tapGetNetBufferCountsFromNetBufferList(
+    __in PNET_BUFFER_LIST   NetBufferList,
+    __inout_opt PULONG      TotalByteCount      // Of all linked NBs
+    );
+
+// Prototypes for standard NDIS miniport entry points
+MINIPORT_SET_OPTIONS                AdapterSetOptions;
+MINIPORT_INITIALIZE                 AdapterCreate;
+MINIPORT_HALT                       AdapterHalt;
+MINIPORT_UNLOAD                     TapDriverUnload;
+MINIPORT_PAUSE                      AdapterPause;
+MINIPORT_RESTART                    AdapterRestart;
+MINIPORT_OID_REQUEST                AdapterOidRequest;
+MINIPORT_SEND_NET_BUFFER_LISTS      AdapterSendNetBufferLists;
+MINIPORT_RETURN_NET_BUFFER_LISTS    AdapterReturnNetBufferLists;
+MINIPORT_CANCEL_SEND                AdapterCancelSend;
+MINIPORT_CHECK_FOR_HANG             AdapterCheckForHangEx;
+MINIPORT_RESET                      AdapterReset;
+MINIPORT_DEVICE_PNP_EVENT_NOTIFY    AdapterDevicePnpEventNotify;
+MINIPORT_SHUTDOWN                   AdapterShutdownEx;
+MINIPORT_CANCEL_OID_REQUEST         AdapterCancelOidRequest;
+
+#endif // __TAP_ADAPTER_CONTEXT_H_
\ No newline at end of file
diff --git a/installer/tap/src/src/config.h.in b/installer/tap/src/src/config.h.in
new file mode 100644
index 0000000..322afa8
--- /dev/null
+++ b/installer/tap/src/src/config.h.in
@@ -0,0 +1,9 @@
+#define PRODUCT_NAME			"@PRODUCT_NAME@"
+#define PRODUCT_VERSION			"@PRODUCT_VERSION@"
+#define PRODUCT_VERSION_RESOURCE	@PRODUCT_VERSION_RESOURCE@
+#define PRODUCT_TAP_WIN_COMPONENT_ID	"@PRODUCT_TAP_WIN_COMPONENT_ID@"
+#define PRODUCT_TAP_WIN_MAJOR		@PRODUCT_TAP_WIN_MAJOR@
+#define PRODUCT_TAP_WIN_MINOR		@PRODUCT_TAP_WIN_MINOR@
+#define PRODUCT_TAP_WIN_PROVIDER		"@PRODUCT_TAP_WIN_PROVIDER@"
+#define PRODUCT_TAP_WIN_DEVICE_DESCRIPTION	"@PRODUCT_TAP_WIN_DEVICE_DESCRIPTION@"
+#define PRODUCT_TAP_WIN_RELDATE		"@PRODUCT_TAP_WIN_RELDATE@"
diff --git a/installer/tap/src/src/constants.h b/installer/tap/src/src/constants.h
new file mode 100644
index 0000000..31b2d54
--- /dev/null
+++ b/installer/tap/src/src/constants.h
@@ -0,0 +1,195 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//====================================================================
+//                        Product and Version public settings
+//====================================================================
+
+#define PRODUCT_STRING PRODUCT_TAP_DEVICE_DESCRIPTION
+
+
+//
+// Update the driver version number every time you release a new driver
+// The high word is the major version. The low word is the minor version.
+// Also make sure that VER_FILEVERSION specified in the .RC file also
+// matches with the driver version because NDISTESTER checks for that.
+//
+#ifndef TAP_DRIVER_MAJOR_VERSION
+
+#define TAP_DRIVER_MAJOR_VERSION           0x04
+#define TAP_DRIVER_MINOR_VERSION           0x02
+
+#endif
+
+#define TAP_DRIVER_VENDOR_VERSION          ((TAP_DRIVER_MAJOR_VERSION << 16) | TAP_DRIVER_MINOR_VERSION)
+
+//
+// Define the NDIS miniport interface version that this driver targets.
+//
+#if defined(NDIS60_MINIPORT)
+#  define TAP_NDIS_MAJOR_VERSION    6
+#  define TAP_NDIS_MINOR_VERSION    0
+#elif defined(NDIS61_MINIPORT)
+#  define TAP_NDIS_MAJOR_VERSION    6
+#  define TAP_NDIS_MINOR_VERSION    1
+#elif defined(NDIS620_MINIPORT)
+#  define TAP_NDIS_MAJOR_VERSION    6
+#  define TAP_NDIS_MINOR_VERSION    20
+#elif defined(NDIS630_MINIPORT)
+#  define TAP_NDIS_MAJOR_VERSION    6
+#  define TAP_NDIS_MINOR_VERSION    30
+#else
+#define TAP_NDIS_MAJOR_VERSION      5
+#define TAP_NDIS_MINOR_VERSION      0
+#endif
+
+//===========================================================
+// Driver constants
+//===========================================================
+
+#define ETHERNET_HEADER_SIZE        (sizeof (ETH_HEADER))
+#define ETHERNET_MTU                1500
+#define ETHERNET_PACKET_SIZE        (ETHERNET_MTU + ETHERNET_HEADER_SIZE)
+#define DEFAULT_PACKET_LOOKAHEAD    (ETHERNET_PACKET_SIZE)
+#define VLAN_TAG_SIZE               4
+
+//===========================================================
+// Medium properties
+//===========================================================
+
+#define TAP_FRAME_HEADER_SIZE       ETHERNET_HEADER_SIZE
+#define TAP_FRAME_MAX_DATA_SIZE     ETHERNET_MTU
+#define TAP_MAX_FRAME_SIZE          (TAP_FRAME_HEADER_SIZE + TAP_FRAME_MAX_DATA_SIZE)
+#define TAP_MIN_FRAME_SIZE          60
+
+#define TAP_MEDIUM_TYPE             NdisMedium802_3
+
+//===========================================================
+// Physical adapter properties
+//===========================================================
+
+// The bus that connects the adapter to the PC.
+// (Example: PCI adapters should use NdisInterfacePci).
+#define TAP_INTERFACE_TYPE          NdisInterfaceInternal
+
+#define TAP_VENDOR_DESC             PRODUCT_TAP_WIN_DEVICE_DESCRIPTION
+
+// Highest byte is the NIC byte plus three vendor bytes. This is normally
+// obtained from the NIC.
+#define TAP_VENDOR_ID               0x00FFFFFF
+
+// If you have physical hardware on 802.3, use NdisPhysicalMedium802_3.
+#define TAP_PHYSICAL_MEDIUM         NdisPhysicalMediumUnspecified
+
+// Claim to be 100mbps duplex
+#define MEGABITS_PER_SECOND                1000000ULL
+#define TAP_XMIT_SPEED                     (100ULL*MEGABITS_PER_SECOND)
+#define TAP_RECV_SPEED                     (100ULL*MEGABITS_PER_SECOND)
+
+// Max number of multicast addresses supported in hardware
+#define TAP_MAX_MCAST_LIST                 32
+
+#define TAP_MAX_LOOKAHEAD                  TAP_FRAME_MAX_DATA_SIZE
+#define TAP_BUFFER_SIZE                    TAP_MAX_FRAME_SIZE
+
+// Set this value to TRUE if there is a physical adapter.
+#define TAP_HAS_PHYSICAL_CONNECTOR         FALSE
+#define TAP_ACCESS_TYPE                    NET_IF_ACCESS_BROADCAST
+#define TAP_DIRECTION_TYPE                 NET_IF_DIRECTION_SENDRECEIVE
+#define TAP_CONNECTION_TYPE                NET_IF_CONNECTION_DEDICATED
+
+// This value must match the *IfType in the driver .inf file
+#define TAP_IFTYPE                         IF_TYPE_ETHERNET_CSMACD
+
+//
+// This is a virtual device, so it can tolerate surprise removal and
+// suspend.  Ensure the correct flags are set for your hardware.
+//
+#define TAP_ADAPTER_ATTRIBUTES_FLAGS (\
+                NDIS_MINIPORT_ATTRIBUTES_SURPRISE_REMOVE_OK | NDIS_MINIPORT_ATTRIBUTES_NDIS_WDM)
+
+#define TAP_SUPPORTED_FILTERS ( \
+                NDIS_PACKET_TYPE_DIRECTED   | \
+                NDIS_PACKET_TYPE_MULTICAST  | \
+                NDIS_PACKET_TYPE_BROADCAST  | \
+                NDIS_PACKET_TYPE_ALL_LOCAL  | \
+                NDIS_PACKET_TYPE_PROMISCUOUS | \
+                NDIS_PACKET_TYPE_ALL_MULTICAST)
+
+#define TAP_MAX_MCAST_LIST          32  // Max length of multicast address list
+
+//
+// Specify a bitmask that defines optional properties of the NIC.
+// This miniport indicates receive with NdisMIndicateReceiveNetBufferLists
+// function.  Such a driver should set this NDIS_MAC_OPTION_TRANSFERS_NOT_PEND
+// flag.
+//
+// NDIS_MAC_OPTION_NO_LOOPBACK tells NDIS that NIC has no internal
+// loopback support so NDIS will manage loopbacks on behalf of
+// this driver.
+//
+// NDIS_MAC_OPTION_COPY_LOOKAHEAD_DATA tells the protocol that
+// our receive buffer is not on a device-specific card. If
+// NDIS_MAC_OPTION_COPY_LOOKAHEAD_DATA is not set, multi-buffer
+// indications are copied to a single flat buffer.
+//
+
+#define TAP_MAC_OPTIONS (\
+                NDIS_MAC_OPTION_COPY_LOOKAHEAD_DATA | \
+                NDIS_MAC_OPTION_TRANSFERS_NOT_PEND  | \
+                NDIS_MAC_OPTION_NO_LOOPBACK)
+
+#define TAP_ADAPTER_CHECK_FOR_HANG_TIME_IN_SECONDS 4
+
+
+// NDIS 6.x miniports must support all counters in OID_GEN_STATISTICS.
+#define TAP_SUPPORTED_STATISTICS (\
+                NDIS_STATISTICS_FLAGS_VALID_DIRECTED_FRAMES_RCV    | \
+                NDIS_STATISTICS_FLAGS_VALID_MULTICAST_FRAMES_RCV   | \
+                NDIS_STATISTICS_FLAGS_VALID_BROADCAST_FRAMES_RCV   | \
+                NDIS_STATISTICS_FLAGS_VALID_BYTES_RCV              | \
+                NDIS_STATISTICS_FLAGS_VALID_RCV_DISCARDS           | \
+                NDIS_STATISTICS_FLAGS_VALID_RCV_ERROR              | \
+                NDIS_STATISTICS_FLAGS_VALID_DIRECTED_FRAMES_XMIT   | \
+                NDIS_STATISTICS_FLAGS_VALID_MULTICAST_FRAMES_XMIT  | \
+                NDIS_STATISTICS_FLAGS_VALID_BROADCAST_FRAMES_XMIT  | \
+                NDIS_STATISTICS_FLAGS_VALID_BYTES_XMIT             | \
+                NDIS_STATISTICS_FLAGS_VALID_XMIT_ERROR             | \
+                NDIS_STATISTICS_FLAGS_VALID_XMIT_DISCARDS          | \
+                NDIS_STATISTICS_FLAGS_VALID_DIRECTED_BYTES_RCV     | \
+                NDIS_STATISTICS_FLAGS_VALID_MULTICAST_BYTES_RCV    | \
+                NDIS_STATISTICS_FLAGS_VALID_BROADCAST_BYTES_RCV    | \
+                NDIS_STATISTICS_FLAGS_VALID_DIRECTED_BYTES_XMIT    | \
+                NDIS_STATISTICS_FLAGS_VALID_MULTICAST_BYTES_XMIT   | \
+                NDIS_STATISTICS_FLAGS_VALID_BROADCAST_BYTES_XMIT)
+
+
+#define MINIMUM_MTU                 576        // USE TCP Minimum MTU
+#define MAXIMUM_MTU                 65536      // IP maximum MTU
+
+#define PACKET_QUEUE_SIZE           64 // tap -> userspace queue size
+#define IRP_QUEUE_SIZE              16 // max number of simultaneous i/o operations from userspace
+#define INJECT_QUEUE_SIZE           16 // DHCP/ARP -> tap injection queue
+
+#define TAP_LITTLE_ENDIAN      // affects ntohs, htonl, etc. functions
diff --git a/installer/tap/src/src/device.c b/installer/tap/src/src/device.c
new file mode 100644
index 0000000..2b7ba9b
--- /dev/null
+++ b/installer/tap/src/src/device.c
@@ -0,0 +1,1169 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//
+// Include files.
+//
+
+#include "tap.h"
+#include <wdmsec.h> // for SDDLs
+
+//======================================================================
+// TAP Win32 Device I/O Callbacks
+//======================================================================
+
+#ifdef ALLOC_PRAGMA
+#pragma alloc_text( PAGE, TapDeviceCreate)
+#pragma alloc_text( PAGE, TapDeviceControl)
+#pragma alloc_text( PAGE, TapDeviceCleanup)
+#pragma alloc_text( PAGE, TapDeviceClose)
+#endif // ALLOC_PRAGMA
+
+//===================================================================
+// Go back to default TAP mode from Point-To-Point mode.
+// Also reset (i.e. disable) DHCP Masq mode.
+//===================================================================
+VOID tapResetAdapterState(
+    __in PTAP_ADAPTER_CONTEXT Adapter
+    )
+{
+  // Point-To-Point
+  Adapter->m_tun = FALSE;
+  Adapter->m_localIP = 0;
+  Adapter->m_remoteNetwork = 0;
+  Adapter->m_remoteNetmask = 0;
+  NdisZeroMemory (&Adapter->m_TapToUser, sizeof (Adapter->m_TapToUser));
+  NdisZeroMemory (&Adapter->m_UserToTap, sizeof (Adapter->m_UserToTap));
+  NdisZeroMemory (&Adapter->m_UserToTap_IPv6, sizeof (Adapter->m_UserToTap_IPv6));
+
+  // DHCP Masq
+  Adapter->m_dhcp_enabled = FALSE;
+  Adapter->m_dhcp_server_arp = FALSE;
+  Adapter->m_dhcp_user_supplied_options_buffer_len = 0;
+  Adapter->m_dhcp_addr = 0;
+  Adapter->m_dhcp_netmask = 0;
+  Adapter->m_dhcp_server_ip = 0;
+  Adapter->m_dhcp_lease_time = 0;
+  Adapter->m_dhcp_received_discover = FALSE;
+  Adapter->m_dhcp_bad_requests = 0;
+  NdisZeroMemory (Adapter->m_dhcp_server_mac, MACADDR_SIZE);
+}
+
+// IRP_MJ_CREATE
+NTSTATUS
+TapDeviceCreate(
+    PDEVICE_OBJECT DeviceObject,
+    PIRP Irp
+    )
+/*++
+
+Routine Description:
+
+    This routine is called by the I/O system when the device is opened.
+
+    No action is performed other than completing the request successfully.
+
+Arguments:
+
+    DeviceObject - a pointer to the object that represents the device
+    that I/O is to be done on.
+
+    Irp - a pointer to the I/O Request Packet for this request.
+
+Return Value:
+
+    NT status code
+
+--*/
+{
+    NDIS_STATUS             status;
+    PIO_STACK_LOCATION      irpSp;// Pointer to current stack location
+    PTAP_ADAPTER_CONTEXT    adapter = NULL;
+    PFILE_OBJECT            originalFileObject;
+
+    PAGED_CODE();
+
+    DEBUGP (("[TAP] --> TapDeviceCreate\n"));
+
+    irpSp = IoGetCurrentIrpStackLocation(Irp);
+
+    //
+    // Invalidate file context
+    //
+    irpSp->FileObject->FsContext = NULL;
+    irpSp->FileObject->FsContext2 = NULL;
+
+    //
+    // Find adapter context for this device.
+    // -------------------------------------
+    // Returns with added reference on adapter context.
+    //
+    adapter = tapAdapterContextFromDeviceObject(DeviceObject);
+
+    // Insure that adapter exists.
+    ASSERT(adapter);
+
+    if(adapter == NULL )
+    {
+        DEBUGP (("[TAP] release [%d.%d] open request; adapter not found\n",
+            TAP_DRIVER_MAJOR_VERSION,
+            TAP_DRIVER_MINOR_VERSION
+            ));
+
+        Irp->IoStatus.Status = STATUS_DEVICE_DOES_NOT_EXIST;
+        Irp->IoStatus.Information = 0;
+
+        IoCompleteRequest( Irp, IO_NO_INCREMENT );
+
+        return STATUS_DEVICE_DOES_NOT_EXIST;
+    }
+
+    DEBUGP(("[%s] [TAP] release [%d.%d] open request (TapFileIsOpen=%d)\n",
+        MINIPORT_INSTANCE_ID(adapter),
+        TAP_DRIVER_MAJOR_VERSION,
+        TAP_DRIVER_MINOR_VERSION,
+        adapter->TapFileIsOpen
+        ));
+
+    // Enforce exclusive access
+    originalFileObject = InterlockedCompareExchangePointer(
+                    &adapter->TapFileObject,
+                    irpSp->FileObject,
+                    NULL
+                    );
+
+    if(originalFileObject == NULL)
+    {
+        irpSp->FileObject->FsContext = adapter; // Quick reference
+
+        status = STATUS_SUCCESS;
+    }
+    else
+    {
+        status = STATUS_UNSUCCESSFUL;
+    }
+
+    // Release the lock.
+    //tapAdapterReleaseLock(adapter,FALSE);
+
+    if(status == STATUS_SUCCESS)
+    {
+        // Reset adapter state on successful open.
+        tapResetAdapterState(adapter);
+
+        adapter->TapFileIsOpen = 1;    // Legacy...
+
+        // NOTE!!! Reference added by tapAdapterContextFromDeviceObject
+        // will be removed when file is closed.
+    }
+    else
+    {
+        DEBUGP (("[%s] TAP is presently unavailable (TapFileIsOpen=%d)\n",
+            MINIPORT_INSTANCE_ID(adapter), adapter->TapFileIsOpen
+            ));
+
+        NOTE_ERROR();
+
+        // Remove reference added by tapAdapterContextFromDeviceObject.
+        tapAdapterContextDereference(adapter);
+    }
+
+    // Complete the IRP.
+    Irp->IoStatus.Status = status;
+    Irp->IoStatus.Information = 0;
+
+    IoCompleteRequest( Irp, IO_NO_INCREMENT );
+
+    DEBUGP (("[TAP] <-- TapDeviceCreate; status = %8.8X\n",status));
+
+    return status;
+}
+
+//===================================================
+// Tell Windows whether the TAP device should be
+// considered "connected" or "disconnected".
+//
+// Allows application control of media connect state.
+//===================================================
+VOID
+tapSetMediaConnectStatus(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in BOOLEAN                LogicalMediaState
+    )
+{
+    NDIS_STATUS_INDICATION  statusIndication;
+    NDIS_LINK_STATE         linkState;
+
+    NdisZeroMemory(&statusIndication, sizeof(NDIS_STATUS_INDICATION));
+    NdisZeroMemory(&linkState, sizeof(NDIS_LINK_STATE));
+
+    //
+    // Fill in object headers
+    //
+    statusIndication.Header.Type = NDIS_OBJECT_TYPE_STATUS_INDICATION;
+    statusIndication.Header.Revision = NDIS_STATUS_INDICATION_REVISION_1;
+    statusIndication.Header.Size = sizeof(NDIS_STATUS_INDICATION);
+
+    linkState.Header.Revision = NDIS_LINK_STATE_REVISION_1;
+    linkState.Header.Type = NDIS_OBJECT_TYPE_DEFAULT;
+    linkState.Header.Size = sizeof(NDIS_LINK_STATE);
+
+    //
+    // Link state buffer
+    //
+    if(Adapter->LogicalMediaState == TRUE)
+    {
+        linkState.MediaConnectState = MediaConnectStateConnected;
+    }
+
+    linkState.MediaDuplexState = MediaDuplexStateFull;
+    linkState.RcvLinkSpeed = TAP_RECV_SPEED;
+    linkState.XmitLinkSpeed = TAP_XMIT_SPEED;
+
+    //
+    // Fill in the status buffer
+    // 
+    statusIndication.StatusCode = NDIS_STATUS_LINK_STATE;
+    statusIndication.SourceHandle = Adapter->MiniportAdapterHandle;
+    statusIndication.DestinationHandle = NULL;
+    statusIndication.RequestId = 0;
+
+    statusIndication.StatusBuffer = &linkState;
+    statusIndication.StatusBufferSize = sizeof(NDIS_LINK_STATE);
+
+    // Fill in new media connect state.
+    if ( (Adapter->LogicalMediaState != LogicalMediaState) && !Adapter->MediaStateAlwaysConnected)
+    {
+        Adapter->LogicalMediaState = LogicalMediaState;
+
+        if (LogicalMediaState == TRUE)
+        {
+            linkState.MediaConnectState = MediaConnectStateConnected;
+
+            DEBUGP (("[TAP] Set MediaConnectState: Connected.\n"));
+        }
+        else
+        {
+            linkState.MediaConnectState = MediaConnectStateDisconnected;
+
+            DEBUGP (("[TAP] Set MediaConnectState: Disconnected.\n"));
+        }
+    }
+
+    // Make the status indication.
+    if(Adapter->Locked.AdapterState != MiniportHaltedState)
+    {
+        NdisMIndicateStatusEx(Adapter->MiniportAdapterHandle, &statusIndication);
+    }
+}
+
+//======================================================
+// If DHCP mode is used together with tun
+// mode, consider the fact that the P2P remote subnet
+// might enclose the DHCP masq server address.
+//======================================================
+VOID
+CheckIfDhcpAndTunMode (
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+    )
+{
+    if (Adapter->m_tun && Adapter->m_dhcp_enabled)
+    {
+        if ((Adapter->m_dhcp_server_ip & Adapter->m_remoteNetmask) == Adapter->m_remoteNetwork)
+        {
+            ETH_COPY_NETWORK_ADDRESS (Adapter->m_dhcp_server_mac, Adapter->m_TapToUser.dest);
+            Adapter->m_dhcp_server_arp = FALSE;
+        }
+    }
+}
+
+// IRP_MJ_DEVICE_CONTROL callback.
+NTSTATUS
+TapDeviceControl(
+    PDEVICE_OBJECT DeviceObject,
+    PIRP Irp
+    )
+
+/*++
+
+Routine Description:
+
+    This routine is called by the I/O system to perform a device I/O
+    control function.
+
+Arguments:
+
+    DeviceObject - a pointer to the object that represents the device
+        that I/O is to be done on.
+
+    Irp - a pointer to the I/O Request Packet for this request.
+
+Return Value:
+
+    NT status code
+
+--*/
+
+{
+    NTSTATUS                ntStatus = STATUS_SUCCESS; // Assume success
+    PIO_STACK_LOCATION      irpSp; // Pointer to current stack location
+    PTAP_ADAPTER_CONTEXT    adapter = NULL;
+    ULONG                   inBufLength; // Input buffer length
+    ULONG                   outBufLength; // Output buffer length
+    PCHAR                   inBuf, outBuf; // pointer to Input and output buffer
+    PMDL                    mdl = NULL;
+    PCHAR                   buffer = NULL;
+
+    PAGED_CODE();
+
+    irpSp = IoGetCurrentIrpStackLocation( Irp );
+
+    //
+    // Fetch adapter context for this device.
+    // --------------------------------------
+    // Adapter pointer was stashed in FsContext when handle was opened.
+    //
+    adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext;
+
+    ASSERT(adapter);
+
+    inBufLength = irpSp->Parameters.DeviceIoControl.InputBufferLength;
+    outBufLength = irpSp->Parameters.DeviceIoControl.OutputBufferLength;
+
+    if (!inBufLength || !outBufLength)
+    {
+        ntStatus = STATUS_INVALID_PARAMETER;
+        goto End;
+    }
+
+    //
+    // Determine which I/O control code was specified.
+    //
+    switch ( irpSp->Parameters.DeviceIoControl.IoControlCode )
+    {
+    case TAP_WIN_IOCTL_GET_MAC:
+        {
+            if (outBufLength >= MACADDR_SIZE )
+            {
+                ETH_COPY_NETWORK_ADDRESS(
+                    Irp->AssociatedIrp.SystemBuffer,
+                    adapter->CurrentAddress
+                    );
+
+                Irp->IoStatus.Information = MACADDR_SIZE;
+            }
+            else
+            {
+                NOTE_ERROR();
+                Irp->IoStatus.Status = ntStatus = STATUS_BUFFER_TOO_SMALL;
+            }
+        }
+        break;
+
+    case TAP_WIN_IOCTL_GET_VERSION:
+        {
+            const ULONG size = sizeof (ULONG) * 3;
+
+            if (outBufLength >= size)
+            {
+                ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[0]
+                    = TAP_DRIVER_MAJOR_VERSION;
+
+                ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[1]
+                    = TAP_DRIVER_MINOR_VERSION;
+
+                ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[2]
+#if DBG
+                    = 1;
+#else
+                    = 0;
+#endif
+                Irp->IoStatus.Information = size;
+            }
+            else
+            {
+                NOTE_ERROR();
+                Irp->IoStatus.Status = ntStatus = STATUS_BUFFER_TOO_SMALL;
+            }
+        }
+        break;
+
+    case TAP_WIN_IOCTL_GET_MTU:
+        {
+            const ULONG size = sizeof (ULONG) * 1;
+
+            if (outBufLength >= size)
+            {
+                ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[0]
+                    = adapter->MtuSize;
+
+                Irp->IoStatus.Information = size;
+            }
+            else
+            {
+                NOTE_ERROR();
+                Irp->IoStatus.Status = ntStatus = STATUS_BUFFER_TOO_SMALL;
+            }
+        }
+        break;
+
+    case TAP_WIN_IOCTL_CONFIG_TUN:
+        {
+            if(inBufLength >= sizeof(IPADDR)*3)
+            {
+                MACADDR dest;
+
+                adapter->m_tun = FALSE;
+
+                GenerateRelatedMAC (dest, adapter->CurrentAddress, 1);
+
+                adapter->m_localIP =       ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[0];
+                adapter->m_remoteNetwork = ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[1];
+                adapter->m_remoteNetmask = ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[2];
+
+                // Sanity check on network/netmask
+                if ((adapter->m_remoteNetwork & adapter->m_remoteNetmask) != adapter->m_remoteNetwork)
+                {
+                    NOTE_ERROR();
+                    Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER;
+                    break;
+                }
+
+                ETH_COPY_NETWORK_ADDRESS (adapter->m_TapToUser.src, adapter->CurrentAddress);
+                ETH_COPY_NETWORK_ADDRESS (adapter->m_TapToUser.dest, dest);
+                ETH_COPY_NETWORK_ADDRESS (adapter->m_UserToTap.src, dest);
+                ETH_COPY_NETWORK_ADDRESS (adapter->m_UserToTap.dest, adapter->CurrentAddress);
+
+                adapter->m_TapToUser.proto = adapter->m_UserToTap.proto = htons (NDIS_ETH_TYPE_IPV4);
+                adapter->m_UserToTap_IPv6 = adapter->m_UserToTap;
+                adapter->m_UserToTap_IPv6.proto = htons(NDIS_ETH_TYPE_IPV6);
+
+                adapter->m_tun = TRUE;
+
+                CheckIfDhcpAndTunMode (adapter);
+
+                Irp->IoStatus.Information = 1; // Simple boolean value
+
+                DEBUGP (("[TAP] Set TUN mode.\n"));
+            }
+            else
+            {
+                NOTE_ERROR();
+                Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER;
+            }
+        }
+        break;
+
+    case TAP_WIN_IOCTL_CONFIG_POINT_TO_POINT:
+        {
+            if(inBufLength >= sizeof(IPADDR)*2)
+            {
+                MACADDR dest;
+
+                adapter->m_tun = FALSE;
+
+                GenerateRelatedMAC (dest, adapter->CurrentAddress, 1);
+
+                adapter->m_localIP =       ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[0];
+                adapter->m_remoteNetwork = ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[1];
+                adapter->m_remoteNetmask = ~0;
+
+                ETH_COPY_NETWORK_ADDRESS (adapter->m_TapToUser.src, adapter->CurrentAddress);
+                ETH_COPY_NETWORK_ADDRESS (adapter->m_TapToUser.dest, dest);
+                ETH_COPY_NETWORK_ADDRESS (adapter->m_UserToTap.src, dest);
+                ETH_COPY_NETWORK_ADDRESS (adapter->m_UserToTap.dest, adapter->CurrentAddress);
+
+                adapter->m_TapToUser.proto = adapter->m_UserToTap.proto = htons (NDIS_ETH_TYPE_IPV4);
+                adapter->m_UserToTap_IPv6 = adapter->m_UserToTap;
+                adapter->m_UserToTap_IPv6.proto = htons(NDIS_ETH_TYPE_IPV6);
+
+                adapter->m_tun = TRUE;
+
+                CheckIfDhcpAndTunMode (adapter);
+
+                Irp->IoStatus.Information = 1; // Simple boolean value
+
+                DEBUGP (("[TAP] Set P2P mode.\n"));
+            }
+            else
+            {
+                NOTE_ERROR();
+                Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER;
+            }
+        }
+        break;
+
+    case TAP_WIN_IOCTL_CONFIG_DHCP_MASQ:
+        {
+            if(inBufLength >= sizeof(IPADDR)*4)
+            {
+                adapter->m_dhcp_enabled = FALSE;
+                adapter->m_dhcp_server_arp = FALSE;
+                adapter->m_dhcp_user_supplied_options_buffer_len = 0;
+
+                // Adapter IP addr / netmask
+                adapter->m_dhcp_addr =
+                    ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[0];
+                adapter->m_dhcp_netmask =
+                    ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[1];
+
+                // IP addr of DHCP masq server
+                adapter->m_dhcp_server_ip =
+                    ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[2];
+
+                // Lease time in seconds
+                adapter->m_dhcp_lease_time =
+                    ((IPADDR*) (Irp->AssociatedIrp.SystemBuffer))[3];
+
+                GenerateRelatedMAC(
+                    adapter->m_dhcp_server_mac,
+                    adapter->CurrentAddress,
+                    2
+                    );
+
+                adapter->m_dhcp_enabled = TRUE;
+                adapter->m_dhcp_server_arp = TRUE;
+
+                CheckIfDhcpAndTunMode (adapter);
+
+                Irp->IoStatus.Information = 1; // Simple boolean value
+
+                DEBUGP (("[TAP] Configured DHCP MASQ.\n"));
+            }
+            else
+            {
+                NOTE_ERROR();
+                Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER;
+            }
+        }
+        break;
+
+    case TAP_WIN_IOCTL_CONFIG_DHCP_SET_OPT:
+        {
+            if (inBufLength <=  DHCP_USER_SUPPLIED_OPTIONS_BUFFER_SIZE
+                && adapter->m_dhcp_enabled)
+            {
+                adapter->m_dhcp_user_supplied_options_buffer_len = 0;
+
+                NdisMoveMemory(
+                    adapter->m_dhcp_user_supplied_options_buffer,
+                    Irp->AssociatedIrp.SystemBuffer,
+                    inBufLength
+                    );
+
+                adapter->m_dhcp_user_supplied_options_buffer_len = 
+                    inBufLength;
+
+                Irp->IoStatus.Information = 1; // Simple boolean value
+
+                DEBUGP (("[TAP] Set DHCP OPT.\n"));
+            }
+            else
+            {
+                NOTE_ERROR();
+                Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER;
+            }
+        }
+        break;
+
+    case TAP_WIN_IOCTL_GET_INFO:
+        {
+            char state[16];
+
+            // Fetch adapter (miniport) state.
+            if (tapAdapterSendAndReceiveReady(adapter) == NDIS_STATUS_SUCCESS)
+                state[0] = 'A';
+            else
+                state[0] = 'a';
+
+            if (tapAdapterReadAndWriteReady(adapter))
+                state[1] = 'T';
+            else
+                state[1] = 't';
+
+            state[2] = '0' + adapter->CurrentPowerState;
+
+            if (adapter->MediaStateAlwaysConnected)
+                state[3] = 'C';
+            else
+                state[3] = 'c';
+
+            state[4] = '\0';
+
+            // BUGBUG!!! What follows, and is not yet implemented, is a real mess.
+            // BUGBUG!!! Tied closely to the NDIS 5 implementation. Need to map
+            //    as much as possible to the NDIS 6 implementation.
+            Irp->IoStatus.Status = ntStatus = RtlStringCchPrintfExA (
+                ((LPTSTR) (Irp->AssociatedIrp.SystemBuffer)),
+                outBufLength,
+                NULL,
+                NULL,
+                STRSAFE_FILL_BEHIND_NULL | STRSAFE_IGNORE_NULLS,
+#if PACKET_TRUNCATION_CHECK
+                "State=%s Err=[%s/%d] #O=%d Tx=[%d,%d,%d] Rx=[%d,%d,%d] IrpQ=[%d,%d,%d] PktQ=[%d,%d,%d] InjQ=[%d,%d,%d]",
+#else
+                "State=%s Err=[%s/%d] #O=%d Tx=[%d,%d] Rx=[%d,%d] IrpQ=[%d,%d,%d] PktQ=[%d,%d,%d] InjQ=[%d,%d,%d]",
+#endif
+                state,
+                g_LastErrorFilename,
+                g_LastErrorLineNumber,
+                (int)adapter->TapFileOpenCount,
+                (int)(adapter->FramesTxDirected + adapter->FramesTxMulticast + adapter->FramesTxBroadcast),
+                (int)adapter->TransmitFailuresOther,
+#if PACKET_TRUNCATION_CHECK
+                (int)adapter->m_TxTrunc,
+#endif
+                (int)adapter->m_Rx,
+                (int)adapter->m_RxErr,
+#if PACKET_TRUNCATION_CHECK
+                (int)adapter->m_RxTrunc,
+#endif
+                (int)adapter->PendingReadIrpQueue.Count,
+                (int)adapter->PendingReadIrpQueue.MaxCount,
+                (int)IRP_QUEUE_SIZE,        // Ignored in NDIS 6 driver...
+
+                (int)adapter->SendPacketQueue.Count,
+                (int)adapter->SendPacketQueue.MaxCount,
+                (int)PACKET_QUEUE_SIZE,
+
+                (int)0,         // adapter->InjectPacketQueue.Count - Unused
+                (int)0,         // adapter->InjectPacketQueue.MaxCount - Unused
+                (int)INJECT_QUEUE_SIZE
+                );
+
+            Irp->IoStatus.Information = outBufLength;
+
+            // BUGBUG!!! Fail because this is not completely implemented.
+            ntStatus = STATUS_INVALID_DEVICE_REQUEST;
+        }
+        break;
+
+#if DBG
+    case TAP_WIN_IOCTL_GET_LOG_LINE:
+        {
+            if (GetDebugLine( (LPTSTR)Irp->AssociatedIrp.SystemBuffer,outBufLength))
+            {
+                Irp->IoStatus.Status = ntStatus = STATUS_SUCCESS;
+            }
+            else
+            {
+                Irp->IoStatus.Status = ntStatus = STATUS_UNSUCCESSFUL;
+            }
+
+            Irp->IoStatus.Information = outBufLength;
+
+            break;
+        }
+#endif
+
+    case TAP_WIN_IOCTL_SET_MEDIA_STATUS:
+        {
+            if(inBufLength >= sizeof(ULONG))
+            {
+                ULONG parm = ((PULONG) (Irp->AssociatedIrp.SystemBuffer))[0];
+                tapSetMediaConnectStatus (adapter, (BOOLEAN) parm);
+                Irp->IoStatus.Information = 1;
+            }
+            else
+            {
+                NOTE_ERROR();
+                Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER;
+            }
+        }
+        break;
+
+    default:
+
+        //
+        // The specified I/O control code is unrecognized by this driver.
+        //
+        ntStatus = STATUS_INVALID_DEVICE_REQUEST;
+        break;
+    }
+
+End:
+
+    //
+    // Finish the I/O operation by simply completing the packet and returning
+    // the same status as in the packet itself.
+    //
+    Irp->IoStatus.Status = ntStatus;
+
+    IoCompleteRequest( Irp, IO_NO_INCREMENT );
+
+    return ntStatus;
+}
+
+// Flush the pending read IRP queue.
+VOID
+tapFlushIrpQueues(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+    )
+{
+
+    DEBUGP (("[TAP] tapFlushIrpQueues: Flushing %d pending read IRPs\n",
+        Adapter->PendingReadIrpQueue.Count));
+
+    tapIrpCsqFlush(&Adapter->PendingReadIrpQueue);
+}
+
+// IRP_MJ_CLEANUP
+NTSTATUS
+TapDeviceCleanup(
+    PDEVICE_OBJECT DeviceObject,
+    PIRP Irp
+    )
+/*++
+
+Routine Description:
+
+    Receipt of this request indicates that the last handle for a file
+    object that is associated with the target device object has been closed
+    (but, due to outstanding I/O requests, might not have been released).
+
+    A driver that holds pending IRPs internally must implement a routine for
+    IRP_MJ_CLEANUP. When the routine is called, the driver should cancel all
+    the pending IRPs that belong to the file object identified by the IRP_MJ_CLEANUP
+    call.
+    
+    In other words, it should cancel all the IRPs that have the same file-object
+    pointer as the one supplied in the current I/O stack location of the IRP for the
+    IRP_MJ_CLEANUP call. Of course, IRPs belonging to other file objects should
+    not be canceled. Also, if an outstanding IRP is completed immediately, the
+    driver does not have to cancel it.
+
+Arguments:
+
+    DeviceObject - a pointer to the object that represents the device
+    to be cleaned up.
+
+    Irp - a pointer to the I/O Request Packet for this request.
+
+Return Value:
+
+    NT status code
+
+--*/
+
+{
+    NDIS_STATUS             status = NDIS_STATUS_SUCCESS;   // Always succeed.
+    PIO_STACK_LOCATION      irpSp;  // Pointer to current stack location
+    PTAP_ADAPTER_CONTEXT    adapter = NULL;
+
+    PAGED_CODE();
+
+    DEBUGP (("[TAP] --> TapDeviceCleanup\n"));
+
+    irpSp = IoGetCurrentIrpStackLocation(Irp);
+
+    //
+    // Fetch adapter context for this device.
+    // --------------------------------------
+    // Adapter pointer was stashed in FsContext when handle was opened.
+    //
+    adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext;
+
+    // Insure that adapter exists.
+    ASSERT(adapter);
+
+    if(adapter == NULL )
+    {
+        DEBUGP (("[TAP] release [%d.%d] cleanup request; adapter not found\n",
+            TAP_DRIVER_MAJOR_VERSION,
+            TAP_DRIVER_MINOR_VERSION
+            ));
+    }
+
+    if(adapter != NULL )
+    {
+        adapter->TapFileIsOpen = 0;    // Legacy...
+
+        // Disconnect from media.
+        tapSetMediaConnectStatus(adapter,FALSE);
+
+        // Reset adapter state when cleaning up;
+        tapResetAdapterState(adapter);
+
+        // BUGBUG!!! Use RemoveLock???
+
+        //
+        // Flush pending send TAP packet queue.
+        //
+        tapFlushSendPacketQueue(adapter);
+
+        ASSERT(adapter->SendPacketQueue.Count == 0);
+
+        //
+        // Flush the pending IRP queues
+        //
+        tapFlushIrpQueues(adapter);
+
+        ASSERT(adapter->PendingReadIrpQueue.Count == 0);
+    }
+
+    // Complete the IRP.
+    Irp->IoStatus.Status = status;
+    Irp->IoStatus.Information = 0;
+
+    IoCompleteRequest( Irp, IO_NO_INCREMENT );
+
+    DEBUGP (("[TAP] <-- TapDeviceCleanup; status = %8.8X\n",status));
+
+    return status;
+}
+
+// IRP_MJ_CLOSE
+NTSTATUS
+TapDeviceClose(
+    PDEVICE_OBJECT DeviceObject,
+    PIRP Irp
+    )
+/*++
+
+Routine Description:
+
+    Receipt of this request indicates that the last handle of the file
+    object that is associated with the target device object has been closed
+    and released.
+    
+    All outstanding I/O requests have been completed or canceled.
+
+Arguments:
+
+    DeviceObject - a pointer to the object that represents the device
+    to be closed.
+
+    Irp - a pointer to the I/O Request Packet for this request.
+
+Return Value:
+
+    NT status code
+
+--*/
+
+{
+    NDIS_STATUS             status = NDIS_STATUS_SUCCESS;   // Always succeed.
+    PIO_STACK_LOCATION      irpSp;  // Pointer to current stack location
+    PTAP_ADAPTER_CONTEXT    adapter = NULL;
+
+    PAGED_CODE();
+
+    DEBUGP (("[TAP] --> TapDeviceClose\n"));
+
+    irpSp = IoGetCurrentIrpStackLocation(Irp);
+
+    //
+    // Fetch adapter context for this device.
+    // --------------------------------------
+    // Adapter pointer was stashed in FsContext when handle was opened.
+    //
+    adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext;
+
+    // Insure that adapter exists.
+    ASSERT(adapter);
+
+    if(adapter == NULL )
+    {
+        DEBUGP (("[TAP] release [%d.%d] close request; adapter not found\n",
+            TAP_DRIVER_MAJOR_VERSION,
+            TAP_DRIVER_MINOR_VERSION
+            ));
+    }
+
+    if(adapter != NULL )
+    {
+        if(adapter->TapFileObject == NULL)
+        {
+            // Should never happen!!!
+            ASSERT(FALSE);
+        }
+        else
+        {
+            ASSERT(irpSp->FileObject->FsContext == adapter);
+
+            ASSERT(adapter->TapFileObject == irpSp->FileObject);
+        }
+
+        adapter->TapFileObject = NULL;
+        irpSp->FileObject = NULL;
+
+        // Remove reference added by when handle was opened.
+        tapAdapterContextDereference(adapter);
+    }
+
+    // Complete the IRP.
+    Irp->IoStatus.Status = status;
+    Irp->IoStatus.Information = 0;
+
+    IoCompleteRequest( Irp, IO_NO_INCREMENT );
+
+    DEBUGP (("[TAP] <-- TapDeviceClose; status = %8.8X\n",status));
+
+    return status;
+}
+
+NTSTATUS
+tapConcatenateNdisStrings(
+    __inout     PNDIS_STRING    DestinationString,
+    __in_opt    PNDIS_STRING    SourceString1,
+    __in_opt    PNDIS_STRING    SourceString2,
+    __in_opt    PNDIS_STRING    SourceString3
+    )
+{
+    NTSTATUS status;
+
+    ASSERT(SourceString1 && SourceString2 && SourceString3);
+
+    status = RtlAppendUnicodeStringToString(
+                DestinationString,
+                SourceString1
+                );
+
+    if(status == STATUS_SUCCESS)
+    {
+        status = RtlAppendUnicodeStringToString(
+                    DestinationString,
+                    SourceString2
+                    );
+
+        if(status == STATUS_SUCCESS)
+        {
+            status = RtlAppendUnicodeStringToString(
+                        DestinationString,
+                        SourceString3
+                        );
+        }
+    }
+
+    return status;
+}
+
+NTSTATUS
+tapMakeDeviceNames(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+    )
+{
+    NDIS_STATUS     status;
+    NDIS_STRING     deviceNamePrefix = NDIS_STRING_CONST("\\Device\\");
+    NDIS_STRING     tapNameSuffix = NDIS_STRING_CONST(".tap");
+
+    // Generate DeviceName from NetCfgInstanceId.
+    Adapter->DeviceName.Buffer = Adapter->DeviceNameBuffer;
+    Adapter->DeviceName.MaximumLength = sizeof(Adapter->DeviceNameBuffer);
+
+    status = tapConcatenateNdisStrings(
+                &Adapter->DeviceName,
+                &deviceNamePrefix,
+                &Adapter->NetCfgInstanceId,
+                &tapNameSuffix
+                );
+
+    if(status == STATUS_SUCCESS)
+    {
+        NDIS_STRING     linkNamePrefix = NDIS_STRING_CONST("\\DosDevices\\Global\\");
+
+        Adapter->LinkName.Buffer = Adapter->LinkNameBuffer;
+        Adapter->LinkName.MaximumLength = sizeof(Adapter->LinkNameBuffer);
+
+        status = tapConcatenateNdisStrings(
+                    &Adapter->LinkName,
+                    &linkNamePrefix,
+                    &Adapter->NetCfgInstanceId,
+                    &tapNameSuffix
+                    );
+    }
+
+    return status;
+}
+
+NDIS_STATUS
+CreateTapDevice(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+   )
+{
+    NDIS_STATUS                     status;
+    NDIS_DEVICE_OBJECT_ATTRIBUTES   deviceAttribute;
+    PDRIVER_DISPATCH                dispatchTable[IRP_MJ_MAXIMUM_FUNCTION+1];
+
+    DEBUGP (("[TAP] version [%d.%d] creating tap device: %wZ\n",
+        TAP_DRIVER_MAJOR_VERSION,
+        TAP_DRIVER_MINOR_VERSION,
+        &Adapter->NetCfgInstanceId));
+
+    // Generate DeviceName and LinkName from NetCfgInstanceId.
+    status = tapMakeDeviceNames(Adapter);
+
+    if (NT_SUCCESS(status))
+    {
+        DEBUGP (("[TAP] DeviceName: %wZ\n",&Adapter->DeviceName));
+        DEBUGP (("[TAP] LinkName: %wZ\n",&Adapter->LinkName));
+
+        // Initialize dispatch table.
+        NdisZeroMemory(dispatchTable, (IRP_MJ_MAXIMUM_FUNCTION+1) * sizeof(PDRIVER_DISPATCH));
+
+        dispatchTable[IRP_MJ_CREATE] = TapDeviceCreate;
+        dispatchTable[IRP_MJ_CLEANUP] = TapDeviceCleanup;
+        dispatchTable[IRP_MJ_CLOSE] = TapDeviceClose;
+        dispatchTable[IRP_MJ_READ] = TapDeviceRead;
+        dispatchTable[IRP_MJ_WRITE] = TapDeviceWrite;
+        dispatchTable[IRP_MJ_DEVICE_CONTROL] = TapDeviceControl;
+
+        //
+        // Create a device object and register dispatch handlers
+        //
+        NdisZeroMemory(&deviceAttribute, sizeof(NDIS_DEVICE_OBJECT_ATTRIBUTES));
+
+        deviceAttribute.Header.Type = NDIS_OBJECT_TYPE_DEVICE_OBJECT_ATTRIBUTES;
+        deviceAttribute.Header.Revision = NDIS_DEVICE_OBJECT_ATTRIBUTES_REVISION_1;
+        deviceAttribute.Header.Size = sizeof(NDIS_DEVICE_OBJECT_ATTRIBUTES);
+
+        deviceAttribute.DeviceName = &Adapter->DeviceName;
+        deviceAttribute.SymbolicName = &Adapter->LinkName;
+        deviceAttribute.MajorFunctions = &dispatchTable[0];
+        //deviceAttribute.ExtensionSize = sizeof(FILTER_DEVICE_EXTENSION);
+
+#if ENABLE_NONADMIN
+        if(Adapter->AllowNonAdmin)
+        {
+            //
+            // SDDL_DEVOBJ_SYS_ALL_WORLD_RWX_RES_RWX allows the kernel and system complete
+            // control over the device. By default the admin can access the entire device,
+            // but cannot change the ACL (the admin must take control of the device first)
+            //
+            // Everyone else, including "restricted" or "untrusted" code can read or write
+            // to the device. Traversal beneath the device is also granted (removing it
+            // would only effect storage devices, except if the "bypass-traversal"
+            // privilege was revoked).
+            //
+            deviceAttribute.DefaultSDDLString = &SDDL_DEVOBJ_SYS_ALL_ADM_RWX_WORLD_RWX_RES_RWX;
+        }
+#endif
+
+        status = NdisRegisterDeviceEx(
+                    Adapter->MiniportAdapterHandle,
+                    &deviceAttribute,
+                    &Adapter->DeviceObject,
+                    &Adapter->DeviceHandle
+                    );
+    }
+
+    ASSERT(NT_SUCCESS(status));
+
+    if (NT_SUCCESS(status))
+    {
+        // Set TAP device flags.
+        (Adapter->DeviceObject)->Flags &= ~DO_BUFFERED_IO;
+        (Adapter->DeviceObject)->Flags |= DO_DIRECT_IO;;
+
+      //========================
+      // Finalize initialization
+      //========================
+
+      Adapter->TapDeviceCreated = TRUE;
+
+      DEBUGP (("[%wZ] successfully created TAP device [%wZ]\n",
+	        &Adapter->NetCfgInstanceId,
+            &Adapter->DeviceName
+            ));
+    }
+
+    DEBUGP (("[TAP] <-- CreateTapDevice; status = %8.8X\n",status));
+
+    return status;
+}
+
+//
+// DestroyTapDevice is called from AdapterHalt and NDIS miniport
+// is in Halted state. Prior to entering the Halted state the
+// miniport would have passed through the Pausing and Paused
+// states. These miniport states have responsibility for waiting
+// until NDIS network operations have completed.
+//
+VOID
+DestroyTapDevice(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+   )
+{
+    DEBUGP (("[TAP] --> DestroyTapDevice; Adapter: %wZ\n",
+        &Adapter->NetCfgInstanceId));
+
+    //
+    // Let clients know we are shutting down
+    //
+    Adapter->TapDeviceCreated = FALSE;
+
+    //
+    // Flush pending send TAP packet queue.
+    //
+    tapFlushSendPacketQueue(Adapter);
+
+    ASSERT(Adapter->SendPacketQueue.Count == 0);
+
+    //
+    // Flush IRP queues. Wait for pending I/O. Etc.
+    // --------------------------------------------
+    // Exhaust IRP and packet queues. Any pending IRPs will
+    // be cancelled, causing user-space to get this error
+    // on overlapped reads:
+    //
+    //   ERROR_OPERATION_ABORTED, code=995
+    //
+    //   "The I/O operation has been aborted because of either a
+    //   thread exit or an application request."
+    //
+    // It's important that user-space close the device handle
+    // when this code is returned, so that when we finally
+    // do a NdisMDeregisterDeviceEx, the device reference count
+    // is 0.  Otherwise the driver will not unload even if the
+    // the last adapter has been halted.
+    //
+    // The act of flushing the queues at this point should result in the user-mode
+    // application closing the adapter's device handle. Closing the handle will
+    // result in the TapDeviceCleanup call being made, followed by the a call to
+    // the TapDeviceClose callback.
+    //
+    tapFlushIrpQueues(Adapter);
+
+    ASSERT(Adapter->PendingReadIrpQueue.Count == 0);
+
+    //
+    // Deregister the Win32 device.
+    // ----------------------------
+    // When a driver calls NdisDeregisterDeviceEx, the I/O manager deletes the
+    // target device object if there are no outstanding references to it. However,
+    // if any outstanding references remain, the I/O manager marks the device
+    // object as "delete pending" and deletes the device object when the references
+    // are finally released.
+    //
+    if(Adapter->DeviceHandle)
+    {
+        DEBUGP (("[TAP] Calling NdisDeregisterDeviceEx\n"));
+        NdisDeregisterDeviceEx(Adapter->DeviceHandle);
+    }
+
+    Adapter->DeviceHandle = NULL;
+
+    DEBUGP (("[TAP] <-- DestroyTapDevice\n"));
+}
+
diff --git a/installer/tap/src/src/device.h b/installer/tap/src/src/device.h
new file mode 100644
index 0000000..93dae0d
--- /dev/null
+++ b/installer/tap/src/src/device.h
@@ -0,0 +1,50 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __TAP_DEVICE_H_
+#define __TAP_DEVICE_H_
+
+//======================================================================
+// TAP Prototypes for standard Win32 device I/O entry points
+//======================================================================
+
+__drv_dispatchType(IRP_MJ_CREATE)
+DRIVER_DISPATCH TapDeviceCreate;
+
+__drv_dispatchType(IRP_MJ_READ)
+DRIVER_DISPATCH TapDeviceRead;
+
+__drv_dispatchType(IRP_MJ_WRITE)
+DRIVER_DISPATCH TapDeviceWrite;
+
+__drv_dispatchType(IRP_MJ_DEVICE_CONTROL)
+DRIVER_DISPATCH TapDeviceControl;
+
+__drv_dispatchType(IRP_MJ_CLEANUP)
+DRIVER_DISPATCH TapDeviceCleanup;
+
+__drv_dispatchType(IRP_MJ_CLOSE)
+DRIVER_DISPATCH TapDeviceClose;
+
+#endif // __TAP_DEVICE_H_
\ No newline at end of file
diff --git a/installer/tap/src/src/dhcp.c b/installer/tap/src/src/dhcp.c
new file mode 100644
index 0000000..30b22f4
--- /dev/null
+++ b/installer/tap/src/src/dhcp.c
@@ -0,0 +1,710 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include "tap.h"
+
+//=========================
+// Code to set DHCP options
+//=========================
+
+VOID
+SetDHCPOpt(
+    __in DHCPMsg *m,
+    __in void *data,
+    __in unsigned int len
+    )
+{
+    if (!m->overflow)
+    {
+        if (m->optlen + len <= DHCP_OPTIONS_BUFFER_SIZE)
+        {
+            if (len)
+            {
+                NdisMoveMemory (m->msg.options + m->optlen, data, len);
+                m->optlen += len;
+            }
+        }
+        else
+        {
+            m->overflow = TRUE;
+        }
+    }
+}
+
+VOID
+SetDHCPOpt0(
+    __in DHCPMsg *msg,
+    __in int type
+    )
+{
+    DHCPOPT0 opt;
+    opt.type = (UCHAR) type;
+    SetDHCPOpt (msg, &opt, sizeof (opt));
+}
+
+VOID
+SetDHCPOpt8(
+    __in DHCPMsg *msg,
+    __in int type,
+    __in ULONG data
+    )
+{
+    DHCPOPT8 opt;
+    opt.type = (UCHAR) type;
+    opt.len = sizeof (opt.data);
+    opt.data = (UCHAR) data;
+    SetDHCPOpt (msg, &opt, sizeof (opt));
+}
+
+VOID
+SetDHCPOpt32(
+    __in DHCPMsg *msg,
+    __in int type,
+    __in ULONG data
+    )
+{
+    DHCPOPT32 opt;
+    opt.type = (UCHAR) type;
+    opt.len = sizeof (opt.data);
+    opt.data = data;
+    SetDHCPOpt (msg, &opt, sizeof (opt));
+}
+
+//==============
+// Checksum code
+//==============
+
+USHORT
+ip_checksum(
+    __in const UCHAR *buf,
+    __in const int len_ip_header
+    )
+{
+    USHORT word16;
+    ULONG sum = 0;
+    int i;
+
+    // make 16 bit words out of every two adjacent 8 bit words in the packet
+    // and add them up
+    for (i = 0; i < len_ip_header - 1; i += 2)
+    {
+        word16 = ((buf[i] << 8) & 0xFF00) + (buf[i+1] & 0xFF);
+        sum += (ULONG) word16;
+    }
+
+    // take only 16 bits out of the 32 bit sum and add up the carries
+    while (sum >> 16)
+    {
+        sum = (sum & 0xFFFF) + (sum >> 16);
+    }
+
+    // one's complement the result
+    return ((USHORT) ~sum);
+}
+
+USHORT
+udp_checksum (
+    __in const UCHAR *buf,
+    __in const int len_udp,
+    __in const UCHAR *src_addr,
+    __in const UCHAR *dest_addr
+    )
+{
+    USHORT word16;
+    ULONG sum = 0;
+    int i;
+
+    // make 16 bit words out of every two adjacent 8 bit words and 
+    // calculate the sum of all 16 bit words
+    for (i = 0; i < len_udp; i += 2)
+    {
+        word16 = ((buf[i] << 8) & 0xFF00) + ((i + 1 < len_udp) ? (buf[i+1] & 0xFF) : 0);
+        sum += word16;
+    }
+
+    // add the UDP pseudo header which contains the IP source and destination addresses
+    for (i = 0; i < 4; i += 2)
+    {
+        word16 =((src_addr[i] << 8) & 0xFF00) + (src_addr[i+1] & 0xFF);
+        sum += word16;
+    }
+
+    for (i = 0; i < 4; i += 2)
+    {
+        word16 =((dest_addr[i] << 8) & 0xFF00) + (dest_addr[i+1] & 0xFF);
+        sum += word16; 	
+    }
+
+    // the protocol number and the length of the UDP packet
+    sum += (USHORT) IPPROTO_UDP + (USHORT) len_udp;
+
+    // keep only the last 16 bits of the 32 bit calculated sum and add the carries
+    while (sum >> 16)
+    {
+        sum = (sum & 0xFFFF) + (sum >> 16);
+    }
+
+    // Take the one's complement of sum
+    return ((USHORT) ~sum);
+}
+
+//================================
+// Set IP and UDP packet checksums
+//================================
+
+VOID
+SetChecksumDHCPMsg(
+    __in DHCPMsg *m
+    )
+{
+    // Set IP checksum
+    m->msg.pre.ip.check = htons (ip_checksum ((UCHAR *) &m->msg.pre.ip, sizeof (IPHDR)));
+
+    // Set UDP Checksum
+    m->msg.pre.udp.check = htons (udp_checksum ((UCHAR *) &m->msg.pre.udp, 
+        sizeof (UDPHDR) + sizeof (DHCP) + m->optlen,
+        (UCHAR *)&m->msg.pre.ip.saddr,
+        (UCHAR *)&m->msg.pre.ip.daddr));
+}
+
+//===================
+// DHCP message tests
+//===================
+
+int
+GetDHCPMessageType(
+    __in const DHCP *dhcp,
+    __in const int optlen
+    )
+{
+    const UCHAR *p = (UCHAR *) (dhcp + 1);
+    int i;
+
+    for (i = 0; i < optlen; ++i)
+    {
+        const UCHAR type = p[i];
+        const int room = optlen - i - 1;
+
+        if (type == DHCP_END)           // didn't find what we were looking for
+            return -1;
+        else if (type == DHCP_PAD)      // no-operation
+            ;
+        else if (type == DHCP_MSG_TYPE) // what we are looking for
+        {
+            if (room >= 2)
+            {
+                if (p[i+1] == 1)        // message length should be 1
+                    return p[i+2];        // return message type
+            }
+            return -1;
+        }
+        else                            // some other message
+        {
+            if (room >= 1)
+            {
+                const int len = p[i+1]; // get message length
+                i += (len + 1);         // advance to next message
+            }
+        }
+    }
+    return -1;
+}
+
+BOOLEAN
+DHCPMessageOurs (
+    __in const PTAP_ADAPTER_CONTEXT Adapter,
+    __in const ETH_HEADER *eth,
+    __in const IPHDR *ip,
+    __in const UDPHDR *udp,
+    __in const DHCP *dhcp
+    )
+{
+    // Must be UDPv4 protocol
+    if (!(eth->proto == htons (NDIS_ETH_TYPE_IPV4) && ip->protocol == IPPROTO_UDP))
+    {
+        return FALSE;
+    }
+
+    // Source MAC must be our adapter
+    if (!MAC_EQUAL (eth->src, Adapter->CurrentAddress))
+    {
+        return FALSE;
+    }
+
+    // Dest MAC must be either broadcast or our virtual DHCP server
+    if (!(ETH_IS_BROADCAST(eth->dest)
+        || MAC_EQUAL (eth->dest, Adapter->m_dhcp_server_mac)))
+    {
+        return FALSE;
+    }
+
+    // Port numbers must be correct
+    if (!(udp->dest == htons (BOOTPS_PORT)
+        && udp->source == htons (BOOTPC_PORT)))
+    {
+        return FALSE;
+    }
+
+    // Hardware address must be MAC addr sized
+    if (!(dhcp->hlen == sizeof (MACADDR)))
+    {
+        return FALSE;
+    }
+
+    // Hardware address must match our adapter
+    if (!MAC_EQUAL (eth->src, dhcp->chaddr))
+    {
+        return FALSE;
+    }
+
+    return TRUE;
+}
+
+
+//=====================================================
+// Build all of DHCP packet except for DHCP options.
+// Assume that *p has been zeroed before we are called.
+//=====================================================
+
+VOID
+BuildDHCPPre (
+    __in const PTAP_ADAPTER_CONTEXT Adapter,
+    __inout DHCPPre *p,
+    __in const ETH_HEADER *eth,
+    __in const IPHDR *ip,
+    __in const UDPHDR *udp,
+    __in const DHCP *dhcp,
+    __in const int optlen,
+    __in const int type)
+{
+    // Should we broadcast or direct to a specific MAC / IP address?
+    const BOOLEAN broadcast = (type == DHCPNAK
+        || ETH_IS_BROADCAST(eth->dest));
+
+    //
+    // Build ethernet header
+    //
+    ETH_COPY_NETWORK_ADDRESS (p->eth.src, Adapter->m_dhcp_server_mac);
+
+    if (broadcast)
+    {
+        memset(p->eth.dest,0xFF,ETH_LENGTH_OF_ADDRESS);
+    }
+    else
+    {
+        ETH_COPY_NETWORK_ADDRESS (p->eth.dest, eth->src);
+    }
+
+    p->eth.proto = htons (NDIS_ETH_TYPE_IPV4);
+
+    //
+    // Build IP header
+    //
+    p->ip.version_len = (4 << 4) | (sizeof (IPHDR) >> 2);
+    p->ip.tos = 0;
+    p->ip.tot_len = htons (sizeof (IPHDR) + sizeof (UDPHDR) + sizeof (DHCP) + optlen);
+    p->ip.id = 0;
+    p->ip.frag_off = 0;
+    p->ip.ttl = 16;
+    p->ip.protocol = IPPROTO_UDP;
+    p->ip.check = 0;
+    p->ip.saddr = Adapter->m_dhcp_server_ip;
+
+    if (broadcast)
+    {
+        p->ip.daddr = ~0;
+    }
+    else
+    {
+        p->ip.daddr = Adapter->m_dhcp_addr;
+    }
+
+    //
+    // Build UDP header
+    //
+    p->udp.source = htons (BOOTPS_PORT);
+    p->udp.dest = htons (BOOTPC_PORT);
+    p->udp.len = htons (sizeof (UDPHDR) + sizeof (DHCP) + optlen);
+    p->udp.check = 0;
+
+    // Build DHCP response
+
+    p->dhcp.op = BOOTREPLY;
+    p->dhcp.htype = 1;
+    p->dhcp.hlen = sizeof (MACADDR);
+    p->dhcp.hops = 0;
+    p->dhcp.xid = dhcp->xid;
+    p->dhcp.secs = 0;
+    p->dhcp.flags = 0;
+    p->dhcp.ciaddr = 0;
+
+    if (type == DHCPNAK)
+    {
+        p->dhcp.yiaddr = 0;
+    }
+    else
+    {
+        p->dhcp.yiaddr = Adapter->m_dhcp_addr;
+    }
+
+    p->dhcp.siaddr = Adapter->m_dhcp_server_ip;
+    p->dhcp.giaddr = 0;
+    ETH_COPY_NETWORK_ADDRESS (p->dhcp.chaddr, eth->src);
+    p->dhcp.magic = htonl (0x63825363);
+}
+
+//=============================
+// Build specific DHCP messages
+//=============================
+
+VOID
+SendDHCPMsg(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in const int type,
+    __in const ETH_HEADER *eth,
+    __in const IPHDR *ip,
+    __in const UDPHDR *udp,
+    __in const DHCP *dhcp
+    )
+{
+    DHCPMsg *pkt;
+
+    if (!(type == DHCPOFFER || type == DHCPACK || type == DHCPNAK))
+    {
+        DEBUGP (("[TAP] SendDHCPMsg: Bad DHCP type: %d\n", type));
+        return;
+    }
+
+    pkt = (DHCPMsg *) MemAlloc (sizeof (DHCPMsg), TRUE);
+
+    if(pkt)
+    {
+        //-----------------------
+        // Build DHCP options
+        //-----------------------
+
+        // Message Type
+        SetDHCPOpt8 (pkt, DHCP_MSG_TYPE, type);
+
+        // Server ID
+        SetDHCPOpt32 (pkt, DHCP_SERVER_ID, Adapter->m_dhcp_server_ip);
+
+        if (type == DHCPOFFER || type == DHCPACK)
+        {
+            // Lease Time
+            SetDHCPOpt32 (pkt, DHCP_LEASE_TIME, htonl (Adapter->m_dhcp_lease_time));
+
+            // Netmask
+            SetDHCPOpt32 (pkt, DHCP_NETMASK, Adapter->m_dhcp_netmask);
+
+            // Other user-defined options
+            SetDHCPOpt (
+                pkt,
+                Adapter->m_dhcp_user_supplied_options_buffer,
+                Adapter->m_dhcp_user_supplied_options_buffer_len);
+        }
+
+        // End
+        SetDHCPOpt0 (pkt, DHCP_END);
+
+        if (!DHCPMSG_OVERFLOW (pkt))
+        {
+            // The initial part of the DHCP message (not including options) gets built here
+            BuildDHCPPre (
+                Adapter,
+                &pkt->msg.pre,
+                eth,
+                ip,
+                udp,
+                dhcp,
+                DHCPMSG_LEN_OPT (pkt),
+                type);
+
+            SetChecksumDHCPMsg (pkt);
+
+            DUMP_PACKET ("DHCPMsg",
+                DHCPMSG_BUF (pkt),
+                DHCPMSG_LEN_FULL (pkt));
+
+            // Return DHCP response to kernel
+            IndicateReceivePacket(
+                Adapter,
+                DHCPMSG_BUF (pkt),
+                DHCPMSG_LEN_FULL (pkt)
+                );
+        }
+        else
+        {
+            DEBUGP (("[TAP] SendDHCPMsg: DHCP buffer overflow\n"));
+        }
+
+        MemFree (pkt, sizeof (DHCPMsg));
+    }
+}
+
+//===================================================================
+// Handle a BOOTPS packet produced by the local system to
+// resolve the address/netmask of this adapter.
+// If we are in TAP_WIN_IOCTL_CONFIG_DHCP_MASQ mode, reply
+// to the message.  Return TRUE if we processed the passed
+// message, so that downstream stages can ignore it.
+//===================================================================
+
+BOOLEAN
+ProcessDHCP(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in const ETH_HEADER *eth,
+    __in const IPHDR *ip,
+    __in const UDPHDR *udp,
+    __in const DHCP *dhcp,
+    __in int optlen
+    )
+{
+    int msg_type;
+
+    // Sanity check IP header
+    if (!(ntohs (ip->tot_len) == sizeof (IPHDR) + sizeof (UDPHDR) + sizeof (DHCP) + optlen
+        && (ntohs (ip->frag_off) & IP_OFFMASK) == 0))
+    {
+        return TRUE;
+    }
+
+    // Does this message belong to us?
+    if (!DHCPMessageOurs (Adapter, eth, ip, udp, dhcp))
+    {
+        return FALSE;
+    }
+
+    msg_type = GetDHCPMessageType (dhcp, optlen);
+
+    // Drop non-BOOTREQUEST messages
+    if (dhcp->op != BOOTREQUEST)
+    {
+        return TRUE;
+    }
+
+    // Drop any messages except DHCPDISCOVER or DHCPREQUEST
+    if (!(msg_type == DHCPDISCOVER || msg_type == DHCPREQUEST))
+    {
+        return TRUE;
+    }
+
+    // Should we reply with DHCPOFFER, DHCPACK, or DHCPNAK?
+    if (msg_type == DHCPREQUEST
+        && ((dhcp->ciaddr && dhcp->ciaddr != Adapter->m_dhcp_addr)
+        || !Adapter->m_dhcp_received_discover
+        || Adapter->m_dhcp_bad_requests >= BAD_DHCPREQUEST_NAK_THRESHOLD))
+    {
+        SendDHCPMsg(
+            Adapter,
+            DHCPNAK,
+            eth, ip, udp, dhcp
+            );
+    }
+    else
+    {
+        SendDHCPMsg(
+            Adapter,
+            (msg_type == DHCPDISCOVER ? DHCPOFFER : DHCPACK),
+            eth, ip, udp, dhcp
+            );
+    }
+
+    // Remember if we received a DHCPDISCOVER
+    if (msg_type == DHCPDISCOVER)
+    {
+        Adapter->m_dhcp_received_discover = TRUE;
+    }
+
+    // Is this a bad DHCPREQUEST?
+    if (msg_type == DHCPREQUEST && dhcp->ciaddr && dhcp->ciaddr != Adapter->m_dhcp_addr)
+    {
+        ++Adapter->m_dhcp_bad_requests;
+    }
+
+    return TRUE;
+}
+
+#if DBG
+
+const char *
+    message_op_text (int op)
+{
+    switch (op)
+    {
+    case BOOTREQUEST:
+        return "BOOTREQUEST";
+
+    case BOOTREPLY:
+        return "BOOTREPLY";
+
+    default:
+        return "???";
+    }
+}
+
+const char *
+    message_type_text (int type)
+{
+    switch (type)
+    {
+    case DHCPDISCOVER:
+        return "DHCPDISCOVER";
+
+    case DHCPOFFER:
+        return "DHCPOFFER";
+
+    case DHCPREQUEST:
+        return "DHCPREQUEST";
+
+    case DHCPDECLINE:
+        return "DHCPDECLINE";
+
+    case DHCPACK:
+        return "DHCPACK";
+
+    case DHCPNAK:
+        return "DHCPNAK";
+
+    case DHCPRELEASE:
+        return "DHCPRELEASE";
+
+    case DHCPINFORM:
+        return "DHCPINFORM";
+
+    default:
+        return "???";
+    }
+}
+
+const char *
+port_name (int port)
+{
+    switch (port)
+    {
+    case BOOTPS_PORT:
+        return "BOOTPS";
+
+    case BOOTPC_PORT:
+        return "BOOTPC";
+
+    default:
+        return "unknown";
+    }
+}
+
+VOID
+DumpDHCP (
+    const ETH_HEADER *eth,
+    const IPHDR *ip,
+    const UDPHDR *udp,
+    const DHCP *dhcp,
+    const int optlen
+    )
+{
+    DEBUGP ((" %s", message_op_text (dhcp->op)));
+    DEBUGP ((" %s ", message_type_text (GetDHCPMessageType (dhcp, optlen))));
+    PrIP (ip->saddr);
+    DEBUGP ((":%s[", port_name (ntohs (udp->source))));
+    PrMac (eth->src);
+    DEBUGP (("] -> "));
+    PrIP (ip->daddr);
+    DEBUGP ((":%s[", port_name (ntohs (udp->dest))));
+    PrMac (eth->dest);
+    DEBUGP (("]"));
+    if (dhcp->ciaddr)
+    {
+        DEBUGP ((" ci="));
+        PrIP (dhcp->ciaddr);
+    }
+    if (dhcp->yiaddr)
+    {
+        DEBUGP ((" yi="));
+        PrIP (dhcp->yiaddr);
+    }
+    if (dhcp->siaddr)
+    {
+        DEBUGP ((" si="));
+        PrIP (dhcp->siaddr);
+    }
+    if (dhcp->hlen == sizeof (MACADDR))
+    {
+        DEBUGP ((" ch="));
+        PrMac (dhcp->chaddr);
+    }
+
+    DEBUGP ((" xid=0x%08x", ntohl (dhcp->xid)));
+
+    if (ntohl (dhcp->magic) != 0x63825363)
+        DEBUGP ((" ma=0x%08x", ntohl (dhcp->magic)));
+    if (dhcp->htype != 1)
+        DEBUGP ((" htype=%d", dhcp->htype));
+    if (dhcp->hops)
+        DEBUGP ((" hops=%d", dhcp->hops));
+    if (ntohs (dhcp->secs))
+        DEBUGP ((" secs=%d", ntohs (dhcp->secs)));
+    if (ntohs (dhcp->flags))
+        DEBUGP ((" flags=0x%04x", ntohs (dhcp->flags)));
+
+    // extra stuff
+
+    if (ip->version_len != 0x45)
+        DEBUGP ((" vl=0x%02x", ip->version_len));
+    if (ntohs (ip->tot_len) != sizeof (IPHDR) + sizeof (UDPHDR) + sizeof (DHCP) + optlen)
+        DEBUGP ((" tl=%d", ntohs (ip->tot_len)));
+    if (ntohs (udp->len) != sizeof (UDPHDR) + sizeof (DHCP) + optlen)
+        DEBUGP ((" ul=%d", ntohs (udp->len)));
+
+    if (ip->tos)
+        DEBUGP ((" tos=0x%02x", ip->tos));
+    if (ntohs (ip->id))
+        DEBUGP ((" id=0x%04x", ntohs (ip->id)));
+    if (ntohs (ip->frag_off))
+        DEBUGP ((" frag_off=0x%04x", ntohs (ip->frag_off)));
+
+    DEBUGP ((" ttl=%d", ip->ttl));
+    DEBUGP ((" ic=0x%04x [0x%04x]", ntohs (ip->check),
+        ip_checksum ((UCHAR*)ip, sizeof (IPHDR))));
+    DEBUGP ((" uc=0x%04x [0x%04x/%d]", ntohs (udp->check),
+        udp_checksum ((UCHAR *) udp,
+        sizeof (UDPHDR) + sizeof (DHCP) + optlen,
+        (UCHAR *) &ip->saddr,
+        (UCHAR *) &ip->daddr),
+        optlen));
+
+    // Options
+    {
+        const UCHAR *opt = (UCHAR *) (dhcp + 1);
+        int i;
+
+        DEBUGP ((" OPT"));
+        for (i = 0; i < optlen; ++i)
+        {
+            const UCHAR data = opt[i];
+            DEBUGP ((".%d", data));
+        }
+    }
+}
+
+#endif /* DBG */
diff --git a/installer/tap/src/src/dhcp.h b/installer/tap/src/src/dhcp.h
new file mode 100644
index 0000000..b594a5e
--- /dev/null
+++ b/installer/tap/src/src/dhcp.h
@@ -0,0 +1,165 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#pragma once
+
+#pragma pack(1)
+
+//===================================================
+// How many bad DHCPREQUESTs do we receive before we
+// return a NAK?
+//
+// A bad DHCPREQUEST is defined to be one where the
+// requestor doesn't know its IP address.
+//===================================================
+
+#define BAD_DHCPREQUEST_NAK_THRESHOLD 3
+
+//==============================================
+// Maximum number of DHCP options bytes supplied
+//==============================================
+
+#define DHCP_USER_SUPPLIED_OPTIONS_BUFFER_SIZE 256
+#define DHCP_OPTIONS_BUFFER_SIZE               256
+
+//===================================
+// UDP port numbers of DHCP messages.
+//===================================
+
+#define BOOTPS_PORT 67
+#define BOOTPC_PORT 68
+
+//===========================
+// The DHCP message structure
+//===========================
+
+typedef struct {
+# define BOOTREQUEST 1
+# define BOOTREPLY   2
+  UCHAR op;          /* message op */
+
+  UCHAR  htype;      /* hardware address type (e.g. '1' = 10Mb Ethernet) */
+  UCHAR  hlen;       /* hardware address length (e.g. '6' for 10Mb Ethernet) */
+  UCHAR  hops;       /* client sets to 0, may be used by relay agents */
+  ULONG  xid;        /* transaction ID, chosen by client */
+  USHORT secs;       /* seconds since request process began, set by client */
+  USHORT flags;
+  ULONG  ciaddr;     /* client IP address, client sets if known */
+  ULONG  yiaddr;     /* 'your' IP address -- server's response to client */
+  ULONG  siaddr;     /* server IP address */
+  ULONG  giaddr;     /* relay agent IP address */
+  UCHAR  chaddr[16]; /* client hardware address */
+  UCHAR  sname[64];  /* optional server host name */
+  UCHAR  file[128];  /* boot file name */
+  ULONG  magic;      /* must be 0x63825363 (network order) */
+} DHCP;
+
+typedef struct {
+  ETH_HEADER eth;
+  IPHDR ip;
+  UDPHDR udp;
+  DHCP dhcp;
+} DHCPPre;
+
+typedef struct {
+  DHCPPre pre;
+  UCHAR options[DHCP_OPTIONS_BUFFER_SIZE];
+} DHCPFull;
+
+typedef struct {
+  unsigned int optlen;
+  BOOLEAN overflow;
+  DHCPFull msg;
+} DHCPMsg;
+
+//===================
+// Macros for DHCPMSG
+//===================
+
+#define DHCPMSG_LEN_BASE(p) (sizeof (DHCPPre))
+#define DHCPMSG_LEN_OPT(p)  ((p)->optlen)
+#define DHCPMSG_LEN_FULL(p) (DHCPMSG_LEN_BASE(p) + DHCPMSG_LEN_OPT(p))
+#define DHCPMSG_BUF(p)      ((UCHAR*) &(p)->msg)
+#define DHCPMSG_OVERFLOW(p) ((p)->overflow)
+
+//========================================
+// structs to hold individual DHCP options
+//========================================
+
+typedef struct {
+  UCHAR type;
+} DHCPOPT0;
+
+typedef struct {
+  UCHAR type;
+  UCHAR len;
+  UCHAR data;
+} DHCPOPT8;
+
+typedef struct {
+  UCHAR type;
+  UCHAR len;
+  ULONG data;
+} DHCPOPT32;
+
+#pragma pack()
+
+//==================
+// DHCP Option types
+//==================
+
+#define DHCP_MSG_TYPE    53  /* message type (u8) */
+#define DHCP_PARM_REQ    55  /* parameter request list: c1 (u8), ... */
+#define DHCP_CLIENT_ID   61  /* client ID: type (u8), i1 (u8), ... */
+#define DHCP_IP          50  /* requested IP addr (u32) */
+#define DHCP_NETMASK      1  /* subnet mask (u32) */
+#define DHCP_LEASE_TIME  51  /* lease time sec (u32) */
+#define DHCP_RENEW_TIME  58  /* renewal time sec (u32) */
+#define DHCP_REBIND_TIME 59  /* rebind time sec (u32) */
+#define DHCP_SERVER_ID   54  /* server ID: IP addr (u32) */
+#define DHCP_PAD          0
+#define DHCP_END        255
+
+//====================
+// DHCP Messages types
+//====================
+
+#define DHCPDISCOVER 1
+#define DHCPOFFER    2
+#define DHCPREQUEST  3
+#define DHCPDECLINE  4
+#define DHCPACK      5
+#define DHCPNAK      6
+#define DHCPRELEASE  7
+#define DHCPINFORM   8
+
+#if DBG
+
+VOID
+DumpDHCP (const ETH_HEADER *eth,
+	  const IPHDR *ip,
+	  const UDPHDR *udp,
+	  const DHCP *dhcp,
+	  const int optlen);
+
+#endif
diff --git a/installer/tap/src/src/endian.h b/installer/tap/src/src/endian.h
new file mode 100644
index 0000000..b7d3449
--- /dev/null
+++ b/installer/tap/src/src/endian.h
@@ -0,0 +1,35 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifdef TAP_LITTLE_ENDIAN
+#define ntohs(x) RtlUshortByteSwap(x)
+#define htons(x) RtlUshortByteSwap(x)
+#define ntohl(x) RtlUlongByteSwap(x)
+#define htonl(x) RtlUlongByteSwap(x)
+#else
+#define ntohs(x) ((USHORT)(x))
+#define htons(x) ((USHORT)(x))
+#define ntohl(x) ((ULONG)(x))
+#define htonl(x) ((ULONG)(x))
+#endif
diff --git a/installer/tap/src/src/error.c b/installer/tap/src/src/error.c
new file mode 100644
index 0000000..1fad1d3
--- /dev/null
+++ b/installer/tap/src/src/error.c
@@ -0,0 +1,398 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include "tap.h"
+
+//-----------------
+// DEBUGGING OUTPUT
+//-----------------
+
+const char *g_LastErrorFilename;
+int g_LastErrorLineNumber;
+
+#if DBG
+
+DebugOutput g_Debug;
+
+BOOLEAN
+NewlineExists (const char *str, int len)
+{
+    while (len-- > 0)
+    {
+        const char c = *str++;
+        if (c == '\n')
+            return TRUE;
+        else if (c == '\0')
+            break;
+    }
+    return FALSE;
+}
+
+VOID
+MyDebugInit (unsigned int bufsiz)
+{
+    NdisZeroMemory (&g_Debug, sizeof (g_Debug));
+    g_Debug.text = (char *) MemAlloc (bufsiz, FALSE);
+
+    if (g_Debug.text)
+    {
+        g_Debug.capacity = bufsiz;
+    }
+}
+
+VOID
+MyDebugFree ()
+{
+    if (g_Debug.text)
+    {
+        MemFree (g_Debug.text, g_Debug.capacity);
+    }
+
+    NdisZeroMemory (&g_Debug, sizeof (g_Debug));
+}
+
+VOID
+MyDebugPrint (const unsigned char* format, ...)
+{
+    if (g_Debug.text && g_Debug.capacity > 0 && CAN_WE_PRINT)
+    {
+        BOOLEAN owned;
+        ACQUIRE_MUTEX_ADAPTIVE (&g_Debug.lock, owned);
+        if (owned)
+        {
+            const int remaining = (int)g_Debug.capacity - (int)g_Debug.out;
+
+            if (remaining > 0)
+            {
+                va_list args;
+                NTSTATUS status;
+                char *end;
+
+#ifdef DBG_PRINT
+                va_start (args, format);
+                vDbgPrintEx (DPFLTR_IHVNETWORK_ID, DPFLTR_INFO_LEVEL, format, args);
+                va_end (args);
+#endif
+                va_start (args, format);
+                status = RtlStringCchVPrintfExA (g_Debug.text + g_Debug.out,
+                    remaining,
+                    &end,
+                    NULL,
+                    STRSAFE_NO_TRUNCATION | STRSAFE_IGNORE_NULLS,
+                    format,
+                    args);
+                va_end (args);
+                va_start (args, format);
+                vDbgPrintEx(DPFLTR_IHVDRIVER_ID , 1, format, args);
+                va_end (args);
+                if (status == STATUS_SUCCESS)
+                    g_Debug.out = (unsigned int) (end - g_Debug.text);
+                else
+                    g_Debug.error = TRUE;
+            }
+            else
+                g_Debug.error = TRUE;
+
+            RELEASE_MUTEX (&g_Debug.lock);
+        }
+        else
+            g_Debug.error = TRUE;
+    }
+}
+
+BOOLEAN
+GetDebugLine (
+    __in char *buf,
+    __in const int len
+    )
+{
+    static const char *truncated = "[OUTPUT TRUNCATED]\n";
+    BOOLEAN ret = FALSE;
+
+    NdisZeroMemory (buf, len);
+
+    if (g_Debug.text && g_Debug.capacity > 0)
+    {
+        BOOLEAN owned;
+        ACQUIRE_MUTEX_ADAPTIVE (&g_Debug.lock, owned);
+        if (owned)
+        {
+            int i = 0;
+
+            if (g_Debug.error || NewlineExists (g_Debug.text + g_Debug.in, (int)g_Debug.out - (int)g_Debug.in))
+            {
+                while (i < (len - 1) && g_Debug.in < g_Debug.out)
+                {
+                    const char c = g_Debug.text[g_Debug.in++];
+                    if (c == '\n')
+                        break;
+                    buf[i++] = c;
+                }
+                if (i < len)
+                    buf[i] = '\0';
+            }
+
+            if (!i)
+            {
+                if (g_Debug.in == g_Debug.out)
+                {
+                    g_Debug.in = g_Debug.out = 0;
+                    if (g_Debug.error)
+                    {
+                        const unsigned int tlen = strlen (truncated);
+                        if (tlen < g_Debug.capacity)
+                        {
+                            NdisMoveMemory (g_Debug.text, truncated, tlen+1);
+                            g_Debug.out = tlen;
+                        }
+                        g_Debug.error = FALSE;
+                    }
+                }
+            }
+            else
+                ret = TRUE;
+
+            RELEASE_MUTEX (&g_Debug.lock);
+        }      
+    }
+    return ret;
+}
+
+VOID
+PrMac (const MACADDR mac)
+{
+  DEBUGP (("%x:%x:%x:%x:%x:%x",
+	    mac[0], mac[1], mac[2],
+	    mac[3], mac[4], mac[5]));
+}
+
+VOID
+PrIP (IPADDR ip_addr)
+{
+  const unsigned char *ip = (const unsigned char *) &ip_addr;
+
+  DEBUGP (("%d.%d.%d.%d",
+	    ip[0], ip[1], ip[2], ip[3]));
+}
+
+const char *
+PrIPProto (int proto)
+{
+    switch (proto)
+    {
+    case IPPROTO_UDP:
+        return "UDP";
+
+    case IPPROTO_TCP:
+        return "TCP";
+
+    case IPPROTO_ICMP:
+        return "ICMP";
+
+    case IPPROTO_IGMP:
+        return "IGMP";
+
+    default:
+        return "???";
+    }
+}
+
+VOID
+DumpARP (const char *prefix, const ARP_PACKET *arp)
+{
+  DEBUGP (("%s ARP src=", prefix));
+  PrMac (arp->m_MAC_Source);
+  DEBUGP ((" dest="));
+  PrMac (arp->m_MAC_Destination);
+  DEBUGP ((" OP=0x%04x",
+	    (int)ntohs(arp->m_ARP_Operation)));
+  DEBUGP ((" M=0x%04x(%d)",
+	    (int)ntohs(arp->m_MAC_AddressType),
+	    (int)arp->m_MAC_AddressSize));
+  DEBUGP ((" P=0x%04x(%d)",
+	    (int)ntohs(arp->m_PROTO_AddressType),
+	    (int)arp->m_PROTO_AddressSize));
+
+  DEBUGP ((" MacSrc="));
+  PrMac (arp->m_ARP_MAC_Source);
+  DEBUGP ((" MacDest="));
+  PrMac (arp->m_ARP_MAC_Destination);
+
+  DEBUGP ((" IPSrc="));
+  PrIP (arp->m_ARP_IP_Source);
+  DEBUGP ((" IPDest="));
+  PrIP (arp->m_ARP_IP_Destination);
+
+  DEBUGP (("\n"));
+}
+
+struct ethpayload
+{
+  ETH_HEADER eth;
+  UCHAR payload[DEFAULT_PACKET_LOOKAHEAD];
+};
+
+#ifdef ALLOW_PACKET_DUMP
+
+VOID
+DumpPacket2(
+    __in const char *prefix,
+    __in const ETH_HEADER *eth,
+    __in const unsigned char *data,
+    __in unsigned int len
+    )
+{
+    struct ethpayload *ep = (struct ethpayload *) MemAlloc (sizeof (struct ethpayload), TRUE);
+    if (ep)
+    {
+        if (len > DEFAULT_PACKET_LOOKAHEAD)
+            len = DEFAULT_PACKET_LOOKAHEAD;
+        ep->eth = *eth;
+        NdisMoveMemory (ep->payload, data, len);
+        DumpPacket (prefix, (unsigned char *) ep, sizeof (ETH_HEADER) + len);
+        MemFree (ep, sizeof (struct ethpayload));
+    }
+}
+
+VOID
+DumpPacket(
+    __in const char *prefix,
+    __in const unsigned char *data,
+    __in unsigned int len
+    )
+{
+    const ETH_HEADER *eth = (const ETH_HEADER *) data;
+    const IPHDR *ip = (const IPHDR *) (data + sizeof (ETH_HEADER));
+
+    if (len < sizeof (ETH_HEADER))
+    {
+        DEBUGP (("%s TRUNCATED PACKET LEN=%d\n", prefix, len));
+        return;
+    }
+
+    // ARP Packet?
+    if (len >= sizeof (ARP_PACKET) && eth->proto == htons (ETH_P_ARP))
+    {
+        DumpARP (prefix, (const ARP_PACKET *) data);
+        return;
+    }
+
+    // IPv4 packet?
+    if (len >= (sizeof (IPHDR) + sizeof (ETH_HEADER))
+        && eth->proto == htons (ETH_P_IP)
+        && IPH_GET_VER (ip->version_len) == 4)
+    {
+        const int hlen = IPH_GET_LEN (ip->version_len);
+        const int blen = len - sizeof (ETH_HEADER);
+        BOOLEAN did = FALSE;
+
+        DEBUGP (("%s IPv4 %s[%d]", prefix, PrIPProto (ip->protocol), len));
+
+        if (!(ntohs (ip->tot_len) == blen && hlen <= blen))
+        {
+            DEBUGP ((" XXX"));
+            return;
+        }
+
+        // TCP packet?
+        if (ip->protocol == IPPROTO_TCP
+            && blen - hlen >= (sizeof (TCPHDR)))
+        {
+            const TCPHDR *tcp = (TCPHDR *) (data + sizeof (ETH_HEADER) + hlen);
+            DEBUGP ((" "));
+            PrIP (ip->saddr);
+            DEBUGP ((":%d", ntohs (tcp->source)));
+            DEBUGP ((" -> "));
+            PrIP (ip->daddr);
+            DEBUGP ((":%d", ntohs (tcp->dest)));
+            did = TRUE;
+        }
+
+        // UDP packet?
+        else if ((ntohs (ip->frag_off) & IP_OFFMASK) == 0
+            && ip->protocol == IPPROTO_UDP
+            && blen - hlen >= (sizeof (UDPHDR)))
+        {
+            const UDPHDR *udp = (UDPHDR *) (data + sizeof (ETH_HEADER) + hlen);
+
+            // DHCP packet?
+            if ((udp->dest == htons (BOOTPC_PORT) || udp->dest == htons (BOOTPS_PORT))
+                && blen - hlen >= (sizeof (UDPHDR) + sizeof (DHCP)))
+            {
+                const DHCP *dhcp = (DHCP *) (data
+                    + hlen
+                    + sizeof (ETH_HEADER)
+                    + sizeof (UDPHDR));
+
+                int optlen = len
+                    - sizeof (ETH_HEADER)
+                    - hlen
+                    - sizeof (UDPHDR)
+                    - sizeof (DHCP);
+
+                if (optlen < 0)
+                    optlen = 0;
+
+                DumpDHCP (eth, ip, udp, dhcp, optlen);
+                did = TRUE;
+            }
+
+            if (!did)
+            {
+                DEBUGP ((" "));
+                PrIP (ip->saddr);
+                DEBUGP ((":%d", ntohs (udp->source)));
+                DEBUGP ((" -> "));
+                PrIP (ip->daddr);
+                DEBUGP ((":%d", ntohs (udp->dest)));
+                did = TRUE;
+            }
+        }
+
+        if (!did)
+        {
+            DEBUGP ((" ipproto=%d ", ip->protocol));
+            PrIP (ip->saddr);
+            DEBUGP ((" -> "));
+            PrIP (ip->daddr);
+        }
+
+        DEBUGP (("\n"));
+        return;
+    }
+
+    {
+        DEBUGP (("%s ??? src=", prefix));
+        PrMac (eth->src);
+        DEBUGP ((" dest="));
+        PrMac (eth->dest);
+        DEBUGP ((" proto=0x%04x len=%d\n",
+            (int) ntohs(eth->proto),
+            len));
+    }
+}
+
+#endif // ALLOW_PACKET_DUMP
+
+#endif
diff --git a/installer/tap/src/src/error.h b/installer/tap/src/src/error.h
new file mode 100644
index 0000000..2ba39cc
--- /dev/null
+++ b/installer/tap/src/src/error.h
@@ -0,0 +1,114 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//-----------------
+// DEBUGGING OUTPUT
+//-----------------
+
+extern const char *g_LastErrorFilename;
+extern int g_LastErrorLineNumber;
+
+// Debug info output
+#define ALSO_DBGPRINT           1
+#define DEBUGP_AT_DISPATCH      1
+
+// Uncomment line below to allow packet dumps
+//#define ALLOW_PACKET_DUMP       1
+
+#define NOTE_ERROR() \
+{ \
+  g_LastErrorFilename = __FILE__; \
+  g_LastErrorLineNumber = __LINE__; \
+}
+
+#if DBG
+
+typedef struct
+{
+    unsigned int in;
+    unsigned int out;
+    unsigned int capacity;
+    char *text;
+    BOOLEAN error;
+    MUTEX lock;
+} DebugOutput;
+
+VOID MyDebugPrint (const unsigned char* format, ...);
+
+VOID PrMac (const MACADDR mac);
+
+VOID PrIP (IPADDR ip_addr);
+
+#ifdef ALLOW_PACKET_DUMP
+
+VOID
+DumpPacket(
+    __in const char *prefix,
+    __in const unsigned char *data,
+    __in unsigned int len
+    );
+
+DumpPacket2(
+    __in const char *prefix,
+    __in const ETH_HEADER *eth,
+    __in const unsigned char *data,
+    __in unsigned int len
+    );
+
+#else
+#define DUMP_PACKET(prefix, data, len)
+#define DUMP_PACKET2(prefix, eth, data, len)
+#endif
+
+#define CAN_WE_PRINT (DEBUGP_AT_DISPATCH || KeGetCurrentIrql () < DISPATCH_LEVEL)
+
+#if ALSO_DBGPRINT
+#define DEBUGP(fmt) { MyDebugPrint fmt; if (CAN_WE_PRINT) DbgPrint fmt; }
+#else
+#define DEBUGP(fmt) { MyDebugPrint fmt; }
+#endif
+
+#ifdef ALLOW_PACKET_DUMP
+
+#define DUMP_PACKET(prefix, data, len) \
+  DumpPacket (prefix, data, len)
+
+#define DUMP_PACKET2(prefix, eth, data, len) \
+  DumpPacket2 (prefix, eth, data, len)
+
+#endif
+
+BOOLEAN
+GetDebugLine (
+    __in char *buf,
+    __in const int len
+    );
+
+#else 
+
+#define DEBUGP(fmt)
+#define DUMP_PACKET(prefix, data, len)
+#define DUMP_PACKET2(prefix, eth, data, len)
+
+#endif
diff --git a/installer/tap/src/src/hexdump.h b/installer/tap/src/src/hexdump.h
new file mode 100644
index 0000000..d6275c1
--- /dev/null
+++ b/installer/tap/src/src/hexdump.h
@@ -0,0 +1,63 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef HEXDUMP_DEFINED
+#define HEXDUMP_DEFINED
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+//=====================================================================================
+//                                   Debug Routines
+//=====================================================================================
+
+#ifndef NDIS_MINIPORT_DRIVER
+#   include <stdio.h>
+#   include <ctype.h>
+#   include <windows.h>
+#   include <winnt.h>
+#   include <memory.h>
+
+#   ifndef DEBUGP
+#      define DEBUGP(fmt) { DbgMessage fmt; }
+#   endif
+
+    extern VOID (*DbgMessage)(char *p_Format, ...);
+
+    VOID DisplayDebugString (char *p_Format, ...);
+#endif
+
+//===================================================================================
+//                              Reporting / Debugging
+//===================================================================================
+#define IfPrint(c) (c >= 32 && c < 127 ? c : '.')
+
+VOID HexDump (unsigned char *p_Buffer, unsigned long p_Size);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/installer/tap/src/src/lock.h b/installer/tap/src/src/lock.h
new file mode 100644
index 0000000..c80b164
--- /dev/null
+++ b/installer/tap/src/src/lock.h
@@ -0,0 +1,75 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+typedef struct
+{
+  volatile long count;
+} MUTEX;
+
+#define MUTEX_SLEEP_TIME  10000 // microseconds
+
+#define INIT_MUTEX(m) { (m)->count = 0; }
+
+#define ACQUIRE_MUTEX_BLOCKING(m)                         \
+{                                                         \
+    while (NdisInterlockedIncrement (&((m)->count)) != 1) \
+    {                                                     \
+        NdisInterlockedDecrement(&((m)->count));          \
+        NdisMSleep(MUTEX_SLEEP_TIME);                     \
+    }                                                     \
+}
+
+#define RELEASE_MUTEX(m)                                  \
+{                                                         \
+        NdisInterlockedDecrement(&((m)->count));          \
+}
+
+#define ACQUIRE_MUTEX_NONBLOCKING(m, result)              \
+{                                                         \
+    if (NdisInterlockedIncrement (&((m)->count)) != 1)    \
+    {                                                     \
+        NdisInterlockedDecrement(&((m)->count));          \
+        result = FALSE;                                   \
+    }                                                     \
+    else                                                  \
+    {                                                     \
+	result = TRUE;                                    \
+    }                                                     \
+}
+
+#define ACQUIRE_MUTEX_ADAPTIVE(m, result)                 \
+{                                                         \
+    result = TRUE;                                        \
+    while (NdisInterlockedIncrement (&((m)->count)) != 1) \
+    {                                                     \
+        NdisInterlockedDecrement(&((m)->count));          \
+        if (KeGetCurrentIrql () < DISPATCH_LEVEL)         \
+            NdisMSleep(MUTEX_SLEEP_TIME);                 \
+        else                                              \
+        {                                                 \
+	    result = FALSE;                               \
+	    break;                                        \
+        }                                                 \
+    }                                                     \
+}
diff --git a/installer/tap/src/src/macinfo.c b/installer/tap/src/src/macinfo.c
new file mode 100644
index 0000000..dfd0a07
--- /dev/null
+++ b/installer/tap/src/src/macinfo.c
@@ -0,0 +1,164 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+
+#include "tap.h"
+
+int
+HexStringToDecimalInt (const int p_Character)
+{
+    int l_Value = 0;
+
+    if (p_Character >= 'A' && p_Character <= 'F')
+        l_Value = (p_Character - 'A') + 10;
+    else if (p_Character >= 'a' && p_Character <= 'f')
+        l_Value = (p_Character - 'a') + 10;
+    else if (p_Character >= '0' && p_Character <= '9')
+        l_Value = p_Character - '0';
+
+    return l_Value;
+}
+
+BOOLEAN
+ParseMAC (MACADDR dest, const char *src)
+{
+    int c;
+    int mac_index = 0;
+    BOOLEAN high_digit = FALSE;
+    int delim_action = 1;
+
+    ASSERT (src);
+    ASSERT (dest);
+
+    CLEAR_MAC (dest);
+
+    while (c = *src++)
+    {
+        if (IsMacDelimiter (c))
+        {
+            mac_index += delim_action;
+            high_digit = FALSE;
+            delim_action = 1;
+        }
+        else if (IsHexDigit (c))
+        {
+            const int digit = HexStringToDecimalInt (c);
+            if (mac_index < sizeof (MACADDR))
+            {
+                if (!high_digit)
+                {
+                    dest[mac_index] = (char)(digit);
+                    high_digit = TRUE;
+                    delim_action = 1;
+                }
+                else
+                {
+                    dest[mac_index] = (char)(dest[mac_index] * 16 + digit);
+                    ++mac_index;
+                    high_digit = FALSE;
+                    delim_action = 0;
+                }
+            }
+            else
+                return FALSE;
+        }
+        else
+            return FALSE;
+    }
+
+    return (mac_index + delim_action) >= sizeof (MACADDR);
+}
+
+/*
+ * Generate a MAC using the GUID in the adapter name.
+ *
+ * The mac is constructed as 00:FF:xx:xx:xx:xx where
+ * the Xs are taken from the first 32 bits of the GUID in the
+ * adapter name.  This is similar to the Linux 2.4 tap MAC
+ * generator, except linux uses 32 random bits for the Xs.
+ *
+ * In general, this solution is reasonable for most
+ * applications except for very large bridged TAP networks,
+ * where the probability of address collisions becomes more
+ * than infintesimal.
+ *
+ * Using the well-known "birthday paradox", on a 1000 node
+ * network the probability of collision would be
+ * 0.000116292153.  On a 10,000 node network, the probability
+ * of collision would be 0.01157288998621678766.
+ */
+
+VOID
+GenerateRandomMac(
+    __in MACADDR mac,
+    __in const unsigned char *adapter_name
+    )
+{
+    unsigned const char *cp = adapter_name;
+    unsigned char c;
+    unsigned int i = 2;
+    unsigned int byte = 0;
+    int brace = 0;
+    int state = 0;
+
+    CLEAR_MAC (mac);
+
+    mac[0] = 0x00;
+    mac[1] = 0xFF;
+
+    while (c = *cp++)
+    {
+        if (i >= sizeof (MACADDR))
+            break;
+        if (c == '{')
+            brace = 1;
+        if (IsHexDigit (c) && brace)
+        {
+            const unsigned int digit = HexStringToDecimalInt (c);
+            if (state)
+            {
+                byte <<= 4;
+                byte |= digit;
+                mac[i++] = (unsigned char) byte;
+                state = 0;
+            }
+            else
+            {
+                byte = digit;
+                state = 1;
+            }
+        }
+    }
+}
+
+VOID
+GenerateRelatedMAC(
+    __in MACADDR dest,
+    __in const MACADDR src,
+    __in const int delta
+    )
+{
+    ETH_COPY_NETWORK_ADDRESS (dest, src);
+    dest[2] += (UCHAR) delta;
+}
diff --git a/installer/tap/src/src/macinfo.h b/installer/tap/src/src/macinfo.h
new file mode 100644
index 0000000..dd88b6f
--- /dev/null
+++ b/installer/tap/src/src/macinfo.h
@@ -0,0 +1,53 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef MacInfoDefined
+#define MacInfoDefined
+
+//===================================================================================
+//                                      Macros
+//===================================================================================
+#define IsMacDelimiter(a) (a == ':' || a == '-' || a == '.')
+#define IsHexDigit(c) ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'F') || (c >= 'a' && c <= 'f'))
+
+#define CLEAR_MAC(dest)     NdisZeroMemory ((dest), sizeof (MACADDR))
+#define MAC_EQUAL(a,b)      (memcmp ((a), (b), sizeof (MACADDR)) == 0)
+
+BOOLEAN
+ParseMAC (MACADDR dest, const char *src);
+
+VOID
+GenerateRandomMac(
+    __in MACADDR mac,
+    __in const unsigned char *adapter_name
+    );
+
+VOID
+GenerateRelatedMAC(
+    __in MACADDR dest,
+    __in const MACADDR src,
+    __in const int delta
+    );
+
+#endif
diff --git a/installer/tap/src/src/mem.c b/installer/tap/src/src/mem.c
new file mode 100644
index 0000000..78bfa22
--- /dev/null
+++ b/installer/tap/src/src/mem.c
@@ -0,0 +1,384 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//------------------
+// Memory Management
+//------------------
+
+#include "tap.h"
+
+PVOID
+MemAlloc(
+    __in ULONG p_Size,
+    __in BOOLEAN zero
+    )
+{
+    PVOID l_Return = NULL;
+
+    if (p_Size)
+    {
+        __try
+        {
+            if (NdisAllocateMemoryWithTag (&l_Return, p_Size, 'APAT')
+                == NDIS_STATUS_SUCCESS)
+            {
+                if (zero)
+                {
+                    NdisZeroMemory (l_Return, p_Size);
+                }
+            }
+            else
+            {
+                l_Return = NULL;
+            }
+        }
+        __except (EXCEPTION_EXECUTE_HANDLER)
+        {
+            l_Return = NULL;
+        }
+    }
+
+    return l_Return;
+}
+
+VOID
+MemFree(
+    __in PVOID p_Addr,
+    __in ULONG p_Size
+    )
+{
+    if (p_Addr && p_Size)
+    {
+        __try
+        {
+#if DBG
+            NdisZeroMemory (p_Addr, p_Size);
+#endif
+            NdisFreeMemory (p_Addr, p_Size, 0);
+        }
+        __except (EXCEPTION_EXECUTE_HANDLER)
+        {
+        }
+    }
+}
+
+//======================================================================
+// TAP Packet Queue Support
+//======================================================================
+
+VOID
+tapPacketQueueInsertTail(
+    __in PTAP_PACKET_QUEUE  TapPacketQueue,
+    __in PTAP_PACKET        TapPacket
+    )
+{
+    KIRQL  irql;
+
+    KeAcquireSpinLock(&TapPacketQueue->QueueLock,&irql);
+
+    InsertTailList(&TapPacketQueue->Queue,&TapPacket->QueueLink);
+
+    // BUGBUG!!! Enforce PACKET_QUEUE_SIZE queue count limit???
+    // For NDIS 6 there is no per-packet status, so this will need to
+    // be handled on per-NBL basis in AdapterSendNetBufferLists...
+
+    // Update counts
+    ++TapPacketQueue->Count;
+
+    if(TapPacketQueue->Count > TapPacketQueue->MaxCount)
+    {
+        TapPacketQueue->MaxCount = TapPacketQueue->Count;
+
+        DEBUGP (("[TAP] tapPacketQueueInsertTail: New MAX queued packet count = %d\n",
+            TapPacketQueue->MaxCount));
+    }
+
+    KeReleaseSpinLock(&TapPacketQueue->QueueLock,irql);
+}
+
+// Call with QueueLock held
+PTAP_PACKET
+tapPacketRemoveHeadLocked(
+    __in PTAP_PACKET_QUEUE  TapPacketQueue
+    )
+{
+    PTAP_PACKET     tapPacket = NULL;
+    PLIST_ENTRY     listEntry;
+
+    listEntry = RemoveHeadList(&TapPacketQueue->Queue);
+
+    if(listEntry != &TapPacketQueue->Queue)
+    {
+        tapPacket = CONTAINING_RECORD(listEntry, TAP_PACKET, QueueLink);
+
+        // Update counts
+        --TapPacketQueue->Count;
+    }
+
+    return tapPacket;
+}
+
+VOID
+tapPacketQueueInitialize(
+    __in PTAP_PACKET_QUEUE  TapPacketQueue
+    )
+{
+    KeInitializeSpinLock(&TapPacketQueue->QueueLock);
+
+    NdisInitializeListHead(&TapPacketQueue->Queue);
+}
+
+//======================================================================
+// TAP Cancel-Safe Queue Support
+//======================================================================
+
+VOID
+tapIrpCsqInsert (
+    __in struct _IO_CSQ    *Csq,
+    __in PIRP              Irp
+    )
+{
+    PTAP_IRP_CSQ          tapIrpCsq;
+
+    tapIrpCsq = (PTAP_IRP_CSQ )Csq;
+
+    InsertTailList(
+        &tapIrpCsq->Queue,
+        &Irp->Tail.Overlay.ListEntry
+        );
+
+    // Update counts
+    ++tapIrpCsq->Count;
+
+    if(tapIrpCsq->Count > tapIrpCsq->MaxCount)
+    {
+        tapIrpCsq->MaxCount = tapIrpCsq->Count;
+
+        DEBUGP (("[TAP] tapIrpCsqInsert: New MAX queued IRP count = %d\n",
+            tapIrpCsq->MaxCount));
+    }
+}
+
+VOID
+tapIrpCsqRemoveIrp(
+    __in PIO_CSQ Csq,
+    __in PIRP    Irp
+    )
+{
+    PTAP_IRP_CSQ          tapIrpCsq;
+
+    tapIrpCsq = (PTAP_IRP_CSQ )Csq;
+
+    // Update counts
+    --tapIrpCsq->Count;
+
+    RemoveEntryList(&Irp->Tail.Overlay.ListEntry);
+}
+
+
+PIRP
+tapIrpCsqPeekNextIrp(
+    __in PIO_CSQ Csq,
+    __in PIRP    Irp,
+    __in PVOID   PeekContext
+    )
+{
+    PTAP_IRP_CSQ          tapIrpCsq;
+    PIRP                    nextIrp = NULL;
+    PLIST_ENTRY             nextEntry;
+    PLIST_ENTRY             listHead;
+    PIO_STACK_LOCATION      irpStack;
+
+    tapIrpCsq = (PTAP_IRP_CSQ )Csq;
+
+    listHead = &tapIrpCsq->Queue;
+
+    //
+    // If the IRP is NULL, we will start peeking from the listhead, else
+    // we will start from that IRP onwards. This is done under the
+    // assumption that new IRPs are always inserted at the tail.
+    //
+
+    if (Irp == NULL)
+    {
+        nextEntry = listHead->Flink;
+    }
+    else
+    {
+        nextEntry = Irp->Tail.Overlay.ListEntry.Flink;
+    }
+
+    while(nextEntry != listHead)
+    {
+        nextIrp = CONTAINING_RECORD(nextEntry, IRP, Tail.Overlay.ListEntry);
+
+        irpStack = IoGetCurrentIrpStackLocation(nextIrp);
+
+        //
+        // If context is present, continue until you find a matching one.
+        // Else you break out as you got next one.
+        //
+        if (PeekContext)
+        {
+            if (irpStack->FileObject == (PFILE_OBJECT) PeekContext)
+            {
+                break;
+            }
+        }
+        else
+        {
+            break;
+        }
+
+        nextIrp = NULL;
+        nextEntry = nextEntry->Flink;
+    }
+
+    return nextIrp;
+}
+
+//
+// tapIrpCsqAcquireQueueLock modifies the execution level of the current processor.
+// 
+// KeAcquireSpinLock raises the execution level to Dispatch Level and stores
+// the current execution level in the Irql parameter to be restored at a later
+// time.  KeAcqurieSpinLock also requires us to be running at no higher than
+// Dispatch level when it is called.
+//
+// The annotations reflect these changes and requirments.
+//
+
+__drv_raisesIRQL(DISPATCH_LEVEL)
+__drv_maxIRQL(DISPATCH_LEVEL)
+VOID
+tapIrpCsqAcquireQueueLock(
+     __in PIO_CSQ Csq,
+     __out PKIRQL  Irql
+    )
+{
+    PTAP_IRP_CSQ          tapIrpCsq;
+
+    tapIrpCsq = (PTAP_IRP_CSQ )Csq;
+
+    //
+    // Suppressing because the address below csq is valid since it's
+    // part of TAP_ADAPTER_CONTEXT structure.
+    //
+#pragma prefast(suppress: __WARNING_BUFFER_UNDERFLOW, "Underflow using expression 'adapter->PendingReadCsqQueueLock'")
+    KeAcquireSpinLock(&tapIrpCsq->QueueLock, Irql);
+}
+
+//
+// tapIrpCsqReleaseQueueLock modifies the execution level of the current processor.
+// 
+// KeReleaseSpinLock assumes we already hold the spin lock and are therefore
+// running at Dispatch level.  It will use the Irql parameter saved in a
+// previous call to KeAcquireSpinLock to return the thread back to it's original
+// execution level.
+//
+// The annotations reflect these changes and requirments.
+//
+
+__drv_requiresIRQL(DISPATCH_LEVEL)
+VOID
+tapIrpCsqReleaseQueueLock(
+     __in PIO_CSQ Csq,
+     __in KIRQL   Irql
+    )
+{
+    PTAP_IRP_CSQ          tapIrpCsq;
+
+    tapIrpCsq = (PTAP_IRP_CSQ )Csq;
+
+    //
+    // Suppressing because the address below csq is valid since it's
+    // part of TAP_ADAPTER_CONTEXT structure.
+    //
+#pragma prefast(suppress: __WARNING_BUFFER_UNDERFLOW, "Underflow using expression 'adapter->PendingReadCsqQueueLock'")
+    KeReleaseSpinLock(&tapIrpCsq->QueueLock, Irql);
+}
+
+VOID
+tapIrpCsqCompleteCanceledIrp(
+    __in  PIO_CSQ             pCsq,
+    __in  PIRP                Irp
+    )
+{
+    UNREFERENCED_PARAMETER(pCsq);
+
+    Irp->IoStatus.Status = STATUS_CANCELLED;
+    Irp->IoStatus.Information = 0;
+    IoCompleteRequest(Irp, IO_NO_INCREMENT);
+}
+
+VOID
+tapIrpCsqInitialize(
+    __in PTAP_IRP_CSQ  TapIrpCsq
+    )
+{
+    KeInitializeSpinLock(&TapIrpCsq->QueueLock);
+
+    NdisInitializeListHead(&TapIrpCsq->Queue);
+
+    IoCsqInitialize(
+        &TapIrpCsq->CsqQueue,
+        tapIrpCsqInsert,
+        tapIrpCsqRemoveIrp,
+        tapIrpCsqPeekNextIrp,
+        tapIrpCsqAcquireQueueLock,
+        tapIrpCsqReleaseQueueLock,
+        tapIrpCsqCompleteCanceledIrp
+        );
+}
+
+VOID
+tapIrpCsqFlush(
+    __in PTAP_IRP_CSQ  TapIrpCsq
+    )
+{
+    PIRP    pendingIrp;
+
+    //
+    // Flush the pending read IRP queue.
+    //
+    pendingIrp = IoCsqRemoveNextIrp(
+                    &TapIrpCsq->CsqQueue,
+                    NULL
+                    );
+
+    while(pendingIrp) 
+    {
+        // Cancel the IRP
+        pendingIrp->IoStatus.Information = 0;
+        pendingIrp->IoStatus.Status = STATUS_CANCELLED;
+        IoCompleteRequest(pendingIrp, IO_NO_INCREMENT);
+
+        pendingIrp = IoCsqRemoveNextIrp(
+                        &TapIrpCsq->CsqQueue,
+                        NULL
+                        );
+    }
+
+    ASSERT(IsListEmpty(&TapIrpCsq->Queue));
+}
diff --git a/installer/tap/src/src/mem.h b/installer/tap/src/src/mem.h
new file mode 100644
index 0000000..d10d536
--- /dev/null
+++ b/installer/tap/src/src/mem.h
@@ -0,0 +1,108 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//------------------
+// Memory Management
+//------------------
+
+PVOID
+MemAlloc(
+    __in ULONG p_Size,
+    __in BOOLEAN zero
+    );
+
+VOID
+MemFree(
+    __in PVOID p_Addr,
+    __in ULONG p_Size
+    );
+
+//======================================================================
+// TAP Packet Queue
+//======================================================================
+
+typedef
+struct _TAP_PACKET
+{
+    LIST_ENTRY                  QueueLink;
+
+#   define TAP_PACKET_SIZE(data_size) (sizeof (TAP_PACKET) + (data_size))
+#   define TP_TUN 0x80000000
+#   define TP_SIZE_MASK      (~TP_TUN)
+    ULONG                       m_SizeFlags;
+
+    // m_Data must be the last struct member
+    UCHAR                       m_Data [];
+} TAP_PACKET, *PTAP_PACKET;
+
+#define TAP_PACKET_TAG      '6PAT'  // "TAP6"
+
+typedef struct _TAP_PACKET_QUEUE
+{
+    KSPIN_LOCK      QueueLock;
+    LIST_ENTRY      Queue;
+    ULONG           Count;   // Count of currently queued items
+    ULONG           MaxCount;
+} TAP_PACKET_QUEUE, *PTAP_PACKET_QUEUE;
+
+VOID
+tapPacketQueueInsertTail(
+    __in PTAP_PACKET_QUEUE  TapPacketQueue,
+    __in PTAP_PACKET        TapPacket
+    );
+
+
+// Call with QueueLock held
+PTAP_PACKET
+tapPacketRemoveHeadLocked(
+    __in PTAP_PACKET_QUEUE  TapPacketQueue
+    );
+
+VOID
+tapPacketQueueInitialize(
+    __in PTAP_PACKET_QUEUE  TapPacketQueue
+    );
+
+//----------------------
+// Cancel-Safe IRP Queue
+//----------------------
+
+typedef struct _TAP_IRP_CSQ
+{
+    IO_CSQ          CsqQueue;
+    KSPIN_LOCK      QueueLock;
+    LIST_ENTRY      Queue;
+    ULONG           Count;   // Count of currently queued items
+    ULONG           MaxCount;
+} TAP_IRP_CSQ, *PTAP_IRP_CSQ;
+
+VOID
+tapIrpCsqInitialize(
+    __in PTAP_IRP_CSQ  TapIrpCsq
+    );
+
+VOID
+tapIrpCsqFlush(
+    __in PTAP_IRP_CSQ  TapIrpCsq
+    );
diff --git a/installer/tap/src/src/oidrequest.c b/installer/tap/src/src/oidrequest.c
new file mode 100644
index 0000000..a6882f8
--- /dev/null
+++ b/installer/tap/src/src/oidrequest.c
@@ -0,0 +1,1028 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//
+// Include files.
+//
+
+#include "tap.h"
+
+#ifndef DBG
+
+#define DBG_PRINT_OID_NAME
+
+#else
+
+VOID
+DBG_PRINT_OID_NAME(
+    __in  NDIS_OID  Oid
+    )
+{
+    PCHAR oidName = NULL;
+
+    switch (Oid){
+
+        #undef MAKECASE
+        #define MAKECASE(oidx) case oidx: oidName = #oidx "\n"; break;
+
+        /* Operational OIDs */
+        MAKECASE(OID_GEN_SUPPORTED_LIST)
+        MAKECASE(OID_GEN_HARDWARE_STATUS)
+        MAKECASE(OID_GEN_MEDIA_SUPPORTED)
+        MAKECASE(OID_GEN_MEDIA_IN_USE)
+        MAKECASE(OID_GEN_MAXIMUM_LOOKAHEAD)
+        MAKECASE(OID_GEN_MAXIMUM_FRAME_SIZE)
+        MAKECASE(OID_GEN_LINK_SPEED)
+        MAKECASE(OID_GEN_TRANSMIT_BUFFER_SPACE)
+        MAKECASE(OID_GEN_RECEIVE_BUFFER_SPACE)
+        MAKECASE(OID_GEN_TRANSMIT_BLOCK_SIZE)
+        MAKECASE(OID_GEN_RECEIVE_BLOCK_SIZE)
+        MAKECASE(OID_GEN_VENDOR_ID)
+        MAKECASE(OID_GEN_VENDOR_DESCRIPTION)
+        MAKECASE(OID_GEN_VENDOR_DRIVER_VERSION)
+        MAKECASE(OID_GEN_CURRENT_PACKET_FILTER)
+        MAKECASE(OID_GEN_CURRENT_LOOKAHEAD)
+        MAKECASE(OID_GEN_DRIVER_VERSION)
+        MAKECASE(OID_GEN_MAXIMUM_TOTAL_SIZE)
+        MAKECASE(OID_GEN_PROTOCOL_OPTIONS)
+        MAKECASE(OID_GEN_MAC_OPTIONS)
+        MAKECASE(OID_GEN_MEDIA_CONNECT_STATUS)
+        MAKECASE(OID_GEN_MAXIMUM_SEND_PACKETS)
+        MAKECASE(OID_GEN_SUPPORTED_GUIDS)
+        MAKECASE(OID_GEN_NETWORK_LAYER_ADDRESSES)
+        MAKECASE(OID_GEN_TRANSPORT_HEADER_OFFSET)
+        MAKECASE(OID_GEN_MEDIA_CAPABILITIES)
+        MAKECASE(OID_GEN_PHYSICAL_MEDIUM)
+        MAKECASE(OID_GEN_MACHINE_NAME)
+        MAKECASE(OID_GEN_VLAN_ID)
+        MAKECASE(OID_GEN_RNDIS_CONFIG_PARAMETER)
+
+        /* Operational OIDs for NDIS 6.0 */
+        MAKECASE(OID_GEN_MAX_LINK_SPEED)
+        MAKECASE(OID_GEN_LINK_STATE)
+        MAKECASE(OID_GEN_LINK_PARAMETERS)
+        MAKECASE(OID_GEN_MINIPORT_RESTART_ATTRIBUTES)
+        MAKECASE(OID_GEN_ENUMERATE_PORTS)
+        MAKECASE(OID_GEN_PORT_STATE)
+        MAKECASE(OID_GEN_PORT_AUTHENTICATION_PARAMETERS)
+        MAKECASE(OID_GEN_INTERRUPT_MODERATION)
+        MAKECASE(OID_GEN_PHYSICAL_MEDIUM_EX)
+
+        /* Statistical OIDs */
+        MAKECASE(OID_GEN_XMIT_OK)
+        MAKECASE(OID_GEN_RCV_OK)
+        MAKECASE(OID_GEN_XMIT_ERROR)
+        MAKECASE(OID_GEN_RCV_ERROR)
+        MAKECASE(OID_GEN_RCV_NO_BUFFER)
+        MAKECASE(OID_GEN_DIRECTED_BYTES_XMIT)
+        MAKECASE(OID_GEN_DIRECTED_FRAMES_XMIT)
+        MAKECASE(OID_GEN_MULTICAST_BYTES_XMIT)
+        MAKECASE(OID_GEN_MULTICAST_FRAMES_XMIT)
+        MAKECASE(OID_GEN_BROADCAST_BYTES_XMIT)
+        MAKECASE(OID_GEN_BROADCAST_FRAMES_XMIT)
+        MAKECASE(OID_GEN_DIRECTED_BYTES_RCV)
+        MAKECASE(OID_GEN_DIRECTED_FRAMES_RCV)
+        MAKECASE(OID_GEN_MULTICAST_BYTES_RCV)
+        MAKECASE(OID_GEN_MULTICAST_FRAMES_RCV)
+        MAKECASE(OID_GEN_BROADCAST_BYTES_RCV)
+        MAKECASE(OID_GEN_BROADCAST_FRAMES_RCV)
+        MAKECASE(OID_GEN_RCV_CRC_ERROR)
+        MAKECASE(OID_GEN_TRANSMIT_QUEUE_LENGTH)
+
+        /* Statistical OIDs for NDIS 6.0 */
+        MAKECASE(OID_GEN_STATISTICS)
+        MAKECASE(OID_GEN_BYTES_RCV)
+        MAKECASE(OID_GEN_BYTES_XMIT)
+        MAKECASE(OID_GEN_RCV_DISCARDS)
+        MAKECASE(OID_GEN_XMIT_DISCARDS)
+
+        /* Misc OIDs */
+        MAKECASE(OID_GEN_GET_TIME_CAPS)
+        MAKECASE(OID_GEN_GET_NETCARD_TIME)
+        MAKECASE(OID_GEN_NETCARD_LOAD)
+        MAKECASE(OID_GEN_DEVICE_PROFILE)
+        MAKECASE(OID_GEN_INIT_TIME_MS)
+        MAKECASE(OID_GEN_RESET_COUNTS)
+        MAKECASE(OID_GEN_MEDIA_SENSE_COUNTS)
+
+        /* PnP power management operational OIDs */
+        MAKECASE(OID_PNP_CAPABILITIES)
+        MAKECASE(OID_PNP_SET_POWER)
+        MAKECASE(OID_PNP_QUERY_POWER)
+        MAKECASE(OID_PNP_ADD_WAKE_UP_PATTERN)
+        MAKECASE(OID_PNP_REMOVE_WAKE_UP_PATTERN)
+        MAKECASE(OID_PNP_ENABLE_WAKE_UP)
+        MAKECASE(OID_PNP_WAKE_UP_PATTERN_LIST)
+
+        /* PnP power management statistical OIDs */
+        MAKECASE(OID_PNP_WAKE_UP_ERROR)
+        MAKECASE(OID_PNP_WAKE_UP_OK)
+
+        /* Ethernet operational OIDs */
+        MAKECASE(OID_802_3_PERMANENT_ADDRESS)
+        MAKECASE(OID_802_3_CURRENT_ADDRESS)
+        MAKECASE(OID_802_3_MULTICAST_LIST)
+        MAKECASE(OID_802_3_MAXIMUM_LIST_SIZE)
+        MAKECASE(OID_802_3_MAC_OPTIONS)
+
+        /* Ethernet operational OIDs for NDIS 6.0 */
+        MAKECASE(OID_802_3_ADD_MULTICAST_ADDRESS)
+        MAKECASE(OID_802_3_DELETE_MULTICAST_ADDRESS)
+
+        /* Ethernet statistical OIDs */
+        MAKECASE(OID_802_3_RCV_ERROR_ALIGNMENT)
+        MAKECASE(OID_802_3_XMIT_ONE_COLLISION)
+        MAKECASE(OID_802_3_XMIT_MORE_COLLISIONS)
+        MAKECASE(OID_802_3_XMIT_DEFERRED)
+        MAKECASE(OID_802_3_XMIT_MAX_COLLISIONS)
+        MAKECASE(OID_802_3_RCV_OVERRUN)
+        MAKECASE(OID_802_3_XMIT_UNDERRUN)
+        MAKECASE(OID_802_3_XMIT_HEARTBEAT_FAILURE)
+        MAKECASE(OID_802_3_XMIT_TIMES_CRS_LOST)
+        MAKECASE(OID_802_3_XMIT_LATE_COLLISIONS)
+
+        /*  TCP/IP OIDs */
+        MAKECASE(OID_TCP_TASK_OFFLOAD)
+        MAKECASE(OID_TCP_TASK_IPSEC_ADD_SA)
+        MAKECASE(OID_TCP_TASK_IPSEC_DELETE_SA)
+        MAKECASE(OID_TCP_SAN_SUPPORT)
+        MAKECASE(OID_TCP_TASK_IPSEC_ADD_UDPESP_SA)
+        MAKECASE(OID_TCP_TASK_IPSEC_DELETE_UDPESP_SA)
+        MAKECASE(OID_TCP4_OFFLOAD_STATS)
+        MAKECASE(OID_TCP6_OFFLOAD_STATS)
+        MAKECASE(OID_IP4_OFFLOAD_STATS)
+        MAKECASE(OID_IP6_OFFLOAD_STATS)
+
+        /* TCP offload OIDs for NDIS 6 */
+        MAKECASE(OID_TCP_OFFLOAD_CURRENT_CONFIG)
+        MAKECASE(OID_TCP_OFFLOAD_PARAMETERS)
+        MAKECASE(OID_TCP_OFFLOAD_HARDWARE_CAPABILITIES)
+        MAKECASE(OID_TCP_CONNECTION_OFFLOAD_CURRENT_CONFIG)
+        MAKECASE(OID_TCP_CONNECTION_OFFLOAD_HARDWARE_CAPABILITIES)
+        MAKECASE(OID_OFFLOAD_ENCAPSULATION)
+
+#if (NDIS_SUPPORT_NDIS620)
+        /* VMQ OIDs for NDIS 6.20 */
+        MAKECASE(OID_RECEIVE_FILTER_FREE_QUEUE)
+        MAKECASE(OID_RECEIVE_FILTER_CLEAR_FILTER)
+        MAKECASE(OID_RECEIVE_FILTER_ALLOCATE_QUEUE)
+        MAKECASE(OID_RECEIVE_FILTER_QUEUE_ALLOCATION_COMPLETE)
+        MAKECASE(OID_RECEIVE_FILTER_SET_FILTER)
+#endif
+
+#if (NDIS_SUPPORT_NDIS630)
+        /* NDIS QoS OIDs for NDIS 6.30 */
+        MAKECASE(OID_QOS_PARAMETERS)
+#endif
+    }
+
+    if (oidName)
+    {
+        DEBUGP(("OID: %s", oidName));
+    }
+    else
+    {
+        DEBUGP(("<** Unknown OID 0x%08x **>\n", Oid));
+    }
+}
+
+#endif // DBG
+
+//======================================================================
+// TAP NDIS 6 OID Request Callbacks
+//======================================================================
+
+NDIS_STATUS
+tapSetMulticastList(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in PNDIS_OID_REQUEST      OidRequest
+    )
+{
+    NDIS_STATUS   status = NDIS_STATUS_SUCCESS;
+
+    //
+    // Initialize.
+    //
+    OidRequest->DATA.SET_INFORMATION.BytesNeeded = MACADDR_SIZE;
+    OidRequest->DATA.SET_INFORMATION.BytesRead
+        = OidRequest->DATA.SET_INFORMATION.InformationBufferLength;
+
+
+    do
+    {
+        if (OidRequest->DATA.SET_INFORMATION.InformationBufferLength % MACADDR_SIZE)
+        {
+            status = NDIS_STATUS_INVALID_LENGTH;
+            break;
+        }
+
+        if (OidRequest->DATA.SET_INFORMATION.InformationBufferLength > (TAP_MAX_MCAST_LIST * MACADDR_SIZE))
+        {
+            status = NDIS_STATUS_MULTICAST_FULL;
+            OidRequest->DATA.SET_INFORMATION.BytesNeeded = TAP_MAX_MCAST_LIST * MACADDR_SIZE;
+            break;
+        }
+
+        // BUGBUG!!! Is lock needed??? If so, use NDIS_RW_LOCK. Also apply to packet filter.
+
+        NdisZeroMemory(Adapter->MCList,
+                       TAP_MAX_MCAST_LIST * MACADDR_SIZE);
+
+        NdisMoveMemory(Adapter->MCList,
+                       OidRequest->DATA.SET_INFORMATION.InformationBuffer,
+                       OidRequest->DATA.SET_INFORMATION.InformationBufferLength);
+
+        Adapter->ulMCListSize = OidRequest->DATA.SET_INFORMATION.InformationBufferLength / MACADDR_SIZE;
+
+    } while(FALSE);
+    return status;
+}
+
+NDIS_STATUS
+tapSetPacketFilter(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in ULONG                  PacketFilter
+    )
+{
+    NDIS_STATUS   status = NDIS_STATUS_SUCCESS;
+
+    // any bits not supported?
+    if (PacketFilter & ~(TAP_SUPPORTED_FILTERS))
+    {
+        DEBUGP (("[TAP] Unsupported packet filter: 0x%08x\n", PacketFilter));
+        status = NDIS_STATUS_NOT_SUPPORTED;
+    }
+    else
+    {
+        // Any actual filtering changes?
+        if (PacketFilter != Adapter->PacketFilter)
+        {
+            //
+            // Change the filtering modes on hardware
+            //
+
+            // Save the new packet filter value
+            Adapter->PacketFilter = PacketFilter;
+        }
+    }
+
+    return status;
+}
+
+NDIS_STATUS
+AdapterSetPowerD0(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+    )
+/*++
+Routine Description:
+
+    NIC power has been restored to the working power state (D0).
+    Prepare the NIC for normal operation:
+        - Restore hardware context (packet filters, multicast addresses, MAC address, etc.)
+        - Enable interrupts and the NIC's DMA engine.
+
+Arguments:
+
+    Adapter     - Pointer to adapter block
+
+Return Value:
+
+    NDIS_STATUS   
+
+--*/      
+{
+    NDIS_STATUS status = NDIS_STATUS_SUCCESS;
+
+    DEBUGP (("[TAP] PowerState: Fully powered\n"));
+
+    // Start data path...
+
+    return status;
+}
+
+NDIS_STATUS
+AdapterSetPowerLow(
+    __in PTAP_ADAPTER_CONTEXT       Adapter,
+    __in NDIS_DEVICE_POWER_STATE    PowerState
+    )
+/*++
+Routine Description:
+
+    The NIC is about to be transitioned to a low power state. 
+    Prepare the NIC for the sleeping state:
+        - Disable interrupts and the NIC's DMA engine, cancel timers.  
+        - Save any hardware context that the NIC cannot preserve in 
+          a sleeping state (packet filters, multicast addresses, 
+          the current MAC address, etc.)
+    A miniport driver cannot access the NIC hardware after 
+    the NIC has been set to the D3 state by the bus driver.
+
+    Miniport drivers NDIS v6.30 and above 
+        Do NOT wait for NDIS to return the ownership of all 
+        NBLs from outstanding receive indications
+        Retain ownership of all the receive descriptors and 
+        packet buffers previously owned by the hardware.
+
+Arguments:
+
+    Adapter         - Pointer to adapter block
+    PowerState      - New power state
+
+Return Value:
+
+    NDIS_STATUS   
+
+--*/      
+{
+    NDIS_STATUS status = NDIS_STATUS_SUCCESS;
+
+    DEBUGP (("[TAP] PowerState: Low-power\n"));
+
+    //
+    // Miniport drivers NDIS v6.20 and below are 
+    // paused prior the low power transition 
+    //
+
+    // Check for paused state...
+    // Verify data path stopped...
+
+    return status;
+}
+
+NDIS_STATUS
+tapSetInformation(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in PNDIS_OID_REQUEST      OidRequest
+    )
+/*++
+
+Routine Description:
+
+    Helper function to perform a set OID request
+
+Arguments:
+
+    Adapter         -
+    NdisSetRequest  - The OID to set
+
+Return Value:
+
+    NDIS_STATUS
+
+--*/
+{
+    NDIS_STATUS    status = NDIS_STATUS_SUCCESS;
+
+    DBG_PRINT_OID_NAME(OidRequest->DATA.SET_INFORMATION.Oid);
+
+    switch(OidRequest->DATA.SET_INFORMATION.Oid)
+    {
+    case OID_802_3_MULTICAST_LIST:
+        //
+        // Set the multicast address list on the NIC for packet reception.
+        // The NIC driver can set a limit on the number of multicast
+        // addresses bound protocol drivers can enable simultaneously.
+        // NDIS returns NDIS_STATUS_MULTICAST_FULL if a protocol driver
+        // exceeds this limit or if it specifies an invalid multicast
+        // address.
+        //
+        status = tapSetMulticastList(Adapter,OidRequest);
+        break;
+
+    case OID_GEN_CURRENT_LOOKAHEAD:
+        //
+        // A protocol driver can set a suggested value for the number
+        // of bytes to be used in its binding; however, the underlying
+        // NIC driver is never required to limit its indications to
+        // the value set.
+        //
+        if (OidRequest->DATA.SET_INFORMATION.InformationBufferLength != sizeof(ULONG))
+        {
+            OidRequest->DATA.SET_INFORMATION.BytesNeeded = sizeof(ULONG);
+            status = NDIS_STATUS_INVALID_LENGTH;
+            break;
+        }
+
+        Adapter->ulLookahead = *(PULONG)OidRequest->DATA.SET_INFORMATION.InformationBuffer;
+
+        OidRequest->DATA.SET_INFORMATION.BytesRead = sizeof(ULONG);
+        status = NDIS_STATUS_SUCCESS;
+        break;
+
+    case OID_GEN_CURRENT_PACKET_FILTER:
+            //
+            // Program the hardware to indicate the packets
+            // of certain filter types.
+            //
+            if(OidRequest->DATA.SET_INFORMATION.InformationBufferLength != sizeof(ULONG))
+            {
+                OidRequest->DATA.SET_INFORMATION.BytesNeeded = sizeof(ULONG);
+                status = NDIS_STATUS_INVALID_LENGTH;
+                break;
+            }
+
+            OidRequest->DATA.SET_INFORMATION.BytesRead
+                = OidRequest->DATA.SET_INFORMATION.InformationBufferLength;
+
+            status = tapSetPacketFilter(
+                            Adapter,
+                            *((PULONG)OidRequest->DATA.SET_INFORMATION.InformationBuffer)
+                            );
+
+            break;
+
+    case OID_PNP_SET_POWER:
+        {
+            // Sanity check.
+            if (OidRequest->DATA.SET_INFORMATION.InformationBufferLength
+                < sizeof(NDIS_DEVICE_POWER_STATE)
+                )
+            {
+                status = NDIS_STATUS_INVALID_LENGTH;
+            }
+            else
+            {
+                NDIS_DEVICE_POWER_STATE     PowerState;
+
+                PowerState = *(PNDIS_DEVICE_POWER_STATE UNALIGNED)OidRequest->DATA.SET_INFORMATION.InformationBuffer;
+                OidRequest->DATA.SET_INFORMATION.BytesRead = sizeof(NDIS_DEVICE_POWER_STATE);
+
+                if(PowerState < NdisDeviceStateD0  ||
+                    PowerState > NdisDeviceStateD3)
+                {
+                    status = NDIS_STATUS_INVALID_DATA;
+                }
+                else
+                {
+                    Adapter->CurrentPowerState = PowerState;
+
+                    if (PowerState == NdisDeviceStateD0)
+                    {
+                        status = AdapterSetPowerD0(Adapter);
+                    }
+                    else
+                    {
+                        status = AdapterSetPowerLow(Adapter, PowerState);
+                    }
+                }
+            }
+        }
+        break;
+
+#if (NDIS_SUPPORT_NDIS61)
+    case OID_PNP_ADD_WAKE_UP_PATTERN:
+    case OID_PNP_REMOVE_WAKE_UP_PATTERN:
+    case OID_PNP_ENABLE_WAKE_UP:
+#endif
+        ASSERT(!"NIC does not support wake on LAN OIDs"); 
+    default:
+        //
+        // The entry point may by used by other requests
+        //
+        status = NDIS_STATUS_NOT_SUPPORTED;
+        break;
+    }
+
+    return status;
+}
+
+NDIS_STATUS
+tapQueryInformation(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in PNDIS_OID_REQUEST      OidRequest
+    )
+/*++
+
+Routine Description:
+
+    Helper function to perform a query OID request
+
+Arguments:
+
+    Adapter         -
+    OidRequest  - The OID request that is being queried
+
+Return Value:
+
+    NDIS_STATUS
+
+--*/
+{
+    NDIS_STATUS             status = NDIS_STATUS_SUCCESS;
+    NDIS_MEDIUM             Medium = TAP_MEDIUM_TYPE;
+    NDIS_HARDWARE_STATUS    HardwareStatus = NdisHardwareStatusReady;
+    UCHAR                   VendorDesc[] = TAP_VENDOR_DESC;
+    ULONG                   ulInfo;
+    USHORT                  usInfo;
+    ULONG64                 ulInfo64;
+
+    // Default to returning the ULONG value
+    PVOID                   pInfo=NULL;
+    ULONG                   ulInfoLen = sizeof(ulInfo);
+
+    // ATTENTION!!! Ignore OIDs to noisy to print...
+    if((OidRequest->DATA.QUERY_INFORMATION.Oid != OID_GEN_STATISTICS)
+        && (OidRequest->DATA.QUERY_INFORMATION.Oid != OID_IP4_OFFLOAD_STATS)
+        && (OidRequest->DATA.QUERY_INFORMATION.Oid != OID_IP6_OFFLOAD_STATS)
+        )
+    {
+        DBG_PRINT_OID_NAME(OidRequest->DATA.QUERY_INFORMATION.Oid);
+    }
+
+    // Dispatch based on object identifier (OID).
+    switch(OidRequest->DATA.QUERY_INFORMATION.Oid)
+    {
+    case OID_GEN_HARDWARE_STATUS:
+        //
+        // Specify the current hardware status of the underlying NIC as
+        // one of the following NDIS_HARDWARE_STATUS-type values.
+        //
+        pInfo = (PVOID) &HardwareStatus;
+        ulInfoLen = sizeof(NDIS_HARDWARE_STATUS);
+        break;
+
+    case OID_802_3_PERMANENT_ADDRESS:
+        //
+        // Return the MAC address of the NIC burnt in the hardware.
+        //
+        pInfo = Adapter->PermanentAddress;
+        ulInfoLen = MACADDR_SIZE;
+        break;
+
+    case OID_802_3_CURRENT_ADDRESS:
+        //
+        // Return the MAC address the NIC is currently programmed to
+        // use. Note that this address could be different from the
+        // permananent address as the user can override using
+        // registry. Read NdisReadNetworkAddress doc for more info.
+        //
+        pInfo = Adapter->CurrentAddress;
+        ulInfoLen = MACADDR_SIZE;
+        break;
+
+    case OID_GEN_MEDIA_SUPPORTED:
+        //
+        // Return an array of media that are supported by the miniport.
+        // This miniport only supports one medium (Ethernet), so the OID
+        // returns identical results to OID_GEN_MEDIA_IN_USE.
+        //
+
+        __fallthrough;
+
+    case OID_GEN_MEDIA_IN_USE:
+        //
+        // Return an array of media that are currently in use by the
+        // miniport.  This array should be a subset of the array returned
+        // by OID_GEN_MEDIA_SUPPORTED.
+        //
+        pInfo = &Medium;
+        ulInfoLen = sizeof(Medium);
+        break;
+
+    case OID_GEN_MAXIMUM_TOTAL_SIZE:
+        //
+        // Specify the maximum total packet length, in bytes, the NIC
+        // supports including the header. A protocol driver might use
+        // this returned length as a gauge to determine the maximum
+        // size packet that a NIC driver could forward to the
+        // protocol driver. The miniport driver must never indicate
+        // up to the bound protocol driver packets received over the
+        // network that are longer than the packet size specified by
+        // OID_GEN_MAXIMUM_TOTAL_SIZE.
+        //
+
+        __fallthrough;
+
+    case OID_GEN_TRANSMIT_BLOCK_SIZE:
+        //
+        // The OID_GEN_TRANSMIT_BLOCK_SIZE OID specifies the minimum
+        // number of bytes that a single net packet occupies in the
+        // transmit buffer space of the NIC. In our case, the transmit
+        // block size is identical to its maximum packet size.
+        __fallthrough;
+
+    case OID_GEN_RECEIVE_BLOCK_SIZE:
+        //
+        // The OID_GEN_RECEIVE_BLOCK_SIZE OID specifies the amount of
+        // storage, in bytes, that a single packet occupies in the receive
+        // buffer space of the NIC.
+        //
+        ulInfo = (ULONG) TAP_MAX_FRAME_SIZE;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_INTERRUPT_MODERATION:
+        {
+            PNDIS_INTERRUPT_MODERATION_PARAMETERS moderationParams
+                = (PNDIS_INTERRUPT_MODERATION_PARAMETERS)OidRequest->DATA.QUERY_INFORMATION.InformationBuffer;
+
+            moderationParams->Header.Type = NDIS_OBJECT_TYPE_DEFAULT; 
+            moderationParams->Header.Revision = NDIS_INTERRUPT_MODERATION_PARAMETERS_REVISION_1;
+            moderationParams->Header.Size = NDIS_SIZEOF_INTERRUPT_MODERATION_PARAMETERS_REVISION_1;
+            moderationParams->Flags = 0;
+            moderationParams->InterruptModeration = NdisInterruptModerationNotSupported;
+            ulInfoLen = NDIS_SIZEOF_INTERRUPT_MODERATION_PARAMETERS_REVISION_1;
+        }
+        break;
+
+    case OID_PNP_QUERY_POWER:
+        // Simply succeed this.
+        break;
+
+    case OID_GEN_VENDOR_ID:
+        //
+        // Specify a three-byte IEEE-registered vendor code, followed
+        // by a single byte that the vendor assigns to identify a
+        // particular NIC. The IEEE code uniquely identifies the vendor
+        // and is the same as the three bytes appearing at the beginning
+        // of the NIC hardware address. Vendors without an IEEE-registered
+        // code should use the value 0xFFFFFF.
+        //
+
+        ulInfo = TAP_VENDOR_ID;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_VENDOR_DESCRIPTION:
+        //
+        // Specify a zero-terminated string describing the NIC vendor.
+        //
+        pInfo = VendorDesc;
+        ulInfoLen = sizeof(VendorDesc);
+        break;
+
+    case OID_GEN_VENDOR_DRIVER_VERSION:
+        //
+        // Specify the vendor-assigned version number of the NIC driver.
+        // The low-order half of the return value specifies the minor
+        // version; the high-order half specifies the major version.
+        //
+
+        ulInfo = TAP_DRIVER_VENDOR_VERSION;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_DRIVER_VERSION:
+        //
+        // Specify the NDIS version in use by the NIC driver. The high
+        // byte is the major version number; the low byte is the minor
+        // version number.
+        //
+        usInfo = (USHORT) (TAP_NDIS_MAJOR_VERSION<<8) + TAP_NDIS_MINOR_VERSION;
+        pInfo = (PVOID) &usInfo;
+        ulInfoLen = sizeof(USHORT);
+        break;
+
+    case OID_802_3_MAXIMUM_LIST_SIZE:
+        //
+        // The maximum number of multicast addresses the NIC driver
+        // can manage. This list is global for all protocols bound
+        // to (or above) the NIC. Consequently, a protocol can receive
+        // NDIS_STATUS_MULTICAST_FULL from the NIC driver when
+        // attempting to set the multicast address list, even if
+        // the number of elements in the given list is less than
+        // the number originally returned for this query.
+        //
+
+        ulInfo = TAP_MAX_MCAST_LIST;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_XMIT_ERROR:
+        ulInfo = (ULONG)
+            (Adapter->TxAbortExcessCollisions +
+            Adapter->TxDmaUnderrun +
+            Adapter->TxLostCRS +
+            Adapter->TxLateCollisions+
+            Adapter->TransmitFailuresOther);
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_RCV_ERROR:
+        ulInfo = (ULONG)
+            (Adapter->RxCrcErrors +
+            Adapter->RxAlignmentErrors +
+            Adapter->RxDmaOverrunErrors +
+            Adapter->RxRuntErrors);
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_RCV_DISCARDS:
+        ulInfo = (ULONG)Adapter->RxResourceErrors;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_RCV_NO_BUFFER:
+        ulInfo = (ULONG)Adapter->RxResourceErrors;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_XMIT_OK:
+        ulInfo64 = Adapter->FramesTxBroadcast
+            + Adapter->FramesTxMulticast
+            + Adapter->FramesTxDirected;
+        pInfo = &ulInfo64;
+        if (OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength >= sizeof(ULONG64) ||
+            OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength == 0)
+        {
+            ulInfoLen = sizeof(ULONG64);
+        }
+        else
+        {
+            ulInfoLen = sizeof(ULONG);
+        }
+
+        // We should always report that only 8 bytes are required to keep ndistest happy
+        OidRequest->DATA.QUERY_INFORMATION.BytesNeeded =  sizeof(ULONG64);
+        break;
+
+    case OID_GEN_RCV_OK:
+        ulInfo64 = Adapter->FramesRxBroadcast
+            + Adapter->FramesRxMulticast
+            + Adapter->FramesRxDirected;
+
+        pInfo = &ulInfo64;
+
+        if (OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength >= sizeof(ULONG64) ||
+            OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength == 0)
+        {
+            ulInfoLen = sizeof(ULONG64);
+        }
+        else
+        {
+            ulInfoLen = sizeof(ULONG);
+        }
+
+        // We should always report that only 8 bytes are required to keep ndistest happy
+        OidRequest->DATA.QUERY_INFORMATION.BytesNeeded =  sizeof(ULONG64);
+        break;
+
+    case OID_802_3_RCV_ERROR_ALIGNMENT:
+
+        ulInfo = Adapter->RxAlignmentErrors;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_802_3_XMIT_ONE_COLLISION:
+
+        ulInfo = Adapter->OneRetry;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_802_3_XMIT_MORE_COLLISIONS:
+
+        ulInfo = Adapter->MoreThanOneRetry;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_802_3_XMIT_DEFERRED:
+
+        ulInfo = Adapter->TxOKButDeferred;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_802_3_XMIT_MAX_COLLISIONS:
+
+        ulInfo = Adapter->TxAbortExcessCollisions;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_802_3_RCV_OVERRUN:
+
+        ulInfo = Adapter->RxDmaOverrunErrors;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_802_3_XMIT_UNDERRUN:
+
+        ulInfo = Adapter->TxDmaUnderrun;
+        pInfo = &ulInfo;
+        break;
+
+    case OID_GEN_STATISTICS:
+
+        if (OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength < sizeof(NDIS_STATISTICS_INFO))
+        {
+            status = NDIS_STATUS_INVALID_LENGTH;
+            OidRequest->DATA.QUERY_INFORMATION.BytesNeeded = sizeof(NDIS_STATISTICS_INFO);
+            break;
+        }
+        else
+        {
+            PNDIS_STATISTICS_INFO Statistics
+                = (PNDIS_STATISTICS_INFO)OidRequest->DATA.QUERY_INFORMATION.InformationBuffer;
+
+            {C_ASSERT(sizeof(NDIS_STATISTICS_INFO) >= NDIS_SIZEOF_STATISTICS_INFO_REVISION_1);}
+            Statistics->Header.Type = NDIS_OBJECT_TYPE_DEFAULT;
+            Statistics->Header.Size = NDIS_SIZEOF_STATISTICS_INFO_REVISION_1;
+            Statistics->Header.Revision = NDIS_STATISTICS_INFO_REVISION_1;
+
+            Statistics->SupportedStatistics = TAP_SUPPORTED_STATISTICS;
+
+            /* Bytes in */
+            Statistics->ifHCInOctets =
+                Adapter->BytesRxDirected +
+                Adapter->BytesRxMulticast +
+                Adapter->BytesRxBroadcast;
+
+            Statistics->ifHCInUcastOctets =
+                Adapter->BytesRxDirected;
+
+            Statistics->ifHCInMulticastOctets =
+                Adapter->BytesRxMulticast;
+
+            Statistics->ifHCInBroadcastOctets =
+                Adapter->BytesRxBroadcast;
+
+            /* Packets in */
+            Statistics->ifHCInUcastPkts =
+                Adapter->FramesRxDirected;
+
+            Statistics->ifHCInMulticastPkts =
+                Adapter->FramesRxMulticast;
+
+            Statistics->ifHCInBroadcastPkts =
+                Adapter->FramesRxBroadcast;
+
+            /* Errors in */
+            Statistics->ifInErrors =
+                Adapter->RxCrcErrors +
+                Adapter->RxAlignmentErrors +
+                Adapter->RxDmaOverrunErrors +
+                Adapter->RxRuntErrors;
+
+            Statistics->ifInDiscards =
+                Adapter->RxResourceErrors;
+
+
+            /* Bytes out */
+            Statistics->ifHCOutOctets =
+                Adapter->BytesTxDirected +
+                Adapter->BytesTxMulticast +
+                Adapter->BytesTxBroadcast;
+
+            Statistics->ifHCOutUcastOctets =
+                Adapter->BytesTxDirected;
+
+            Statistics->ifHCOutMulticastOctets =
+                Adapter->BytesTxMulticast;
+
+            Statistics->ifHCOutBroadcastOctets =
+                Adapter->BytesTxBroadcast;
+
+            /* Packets out */
+            Statistics->ifHCOutUcastPkts =
+                Adapter->FramesTxDirected;
+
+            Statistics->ifHCOutMulticastPkts =
+                Adapter->FramesTxMulticast;
+
+            Statistics->ifHCOutBroadcastPkts =
+                Adapter->FramesTxBroadcast;
+
+            /* Errors out */
+            Statistics->ifOutErrors =
+                Adapter->TxAbortExcessCollisions +
+                Adapter->TxDmaUnderrun +
+                Adapter->TxLostCRS +
+                Adapter->TxLateCollisions+
+                Adapter->TransmitFailuresOther;
+
+            Statistics->ifOutDiscards = 0ULL;
+
+            ulInfoLen = NDIS_SIZEOF_STATISTICS_INFO_REVISION_1;
+        }
+
+        break;
+
+        // TODO: Inplement these query information requests.
+    case OID_GEN_RECEIVE_BUFFER_SPACE:
+    case OID_GEN_MAXIMUM_SEND_PACKETS:
+    case OID_GEN_TRANSMIT_QUEUE_LENGTH:
+    case OID_802_3_XMIT_HEARTBEAT_FAILURE:
+    case OID_802_3_XMIT_TIMES_CRS_LOST:
+    case OID_802_3_XMIT_LATE_COLLISIONS:
+
+    default:
+        //
+        // The entry point may by used by other requests
+        //
+        status = NDIS_STATUS_NOT_SUPPORTED;
+        break;
+    }
+
+    if (status == NDIS_STATUS_SUCCESS)
+    {
+        ASSERT(ulInfoLen > 0);
+
+        if (ulInfoLen <= OidRequest->DATA.QUERY_INFORMATION.InformationBufferLength)
+        {
+            if(pInfo)
+            {
+                // Copy result into InformationBuffer
+                NdisMoveMemory(
+                    OidRequest->DATA.QUERY_INFORMATION.InformationBuffer,
+                    pInfo,
+                    ulInfoLen
+                    );
+            }
+
+            OidRequest->DATA.QUERY_INFORMATION.BytesWritten = ulInfoLen;
+        }
+        else
+        {
+            // too short
+            OidRequest->DATA.QUERY_INFORMATION.BytesNeeded = ulInfoLen;
+            status = NDIS_STATUS_BUFFER_TOO_SHORT;
+        }
+    }
+
+    return status;
+}
+
+NDIS_STATUS
+AdapterOidRequest(
+    __in  NDIS_HANDLE             MiniportAdapterContext,
+    __in  PNDIS_OID_REQUEST       OidRequest
+    )
+/*++
+
+Routine Description:
+
+    Entry point called by NDIS to get or set the value of a specified OID.
+
+Arguments:
+
+    MiniportAdapterContext  - Our adapter handle
+    NdisRequest             - The OID request to handle
+
+Return Value:
+
+    Return code from the NdisRequest below.
+
+--*/
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+    NDIS_STATUS    status;
+
+    // Dispatch based on request type.
+    switch (OidRequest->RequestType)
+    {
+    case NdisRequestSetInformation:
+        status = tapSetInformation(adapter,OidRequest);
+        break;
+
+    case NdisRequestQueryInformation:
+    case NdisRequestQueryStatistics:
+        status = tapQueryInformation(adapter,OidRequest);
+        break;
+
+    case NdisRequestMethod: // TAP doesn't need to respond to this request type.
+    default:
+        //
+        // The entry point may by used by other requests
+        //
+        status = NDIS_STATUS_NOT_SUPPORTED;
+        break;
+    }
+
+    return status;
+}
+
+VOID
+AdapterCancelOidRequest(
+    __in NDIS_HANDLE              MiniportAdapterContext,
+    __in PVOID                    RequestId
+    )
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+
+    UNREFERENCED_PARAMETER(RequestId);
+
+    //
+    // This miniport sample does not pend any OID requests, so we don't have
+    // to worry about cancelling them.
+    //
+}
+
diff --git a/installer/tap/src/src/proto.h b/installer/tap/src/src/proto.h
new file mode 100644
index 0000000..cc23de6
--- /dev/null
+++ b/installer/tap/src/src/proto.h
@@ -0,0 +1,224 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//============================================================
+// MAC address, Ethernet header, and ARP
+//============================================================
+
+#pragma pack(1)
+
+#define IP_HEADER_SIZE 20
+#define IPV6_HEADER_SIZE 40
+
+#define MACADDR_SIZE    6
+typedef unsigned char MACADDR[MACADDR_SIZE];
+
+typedef unsigned long IPADDR;
+typedef unsigned char IPV6ADDR[16];
+
+//-----------------
+// Ethernet address
+//-----------------
+
+typedef struct {
+  MACADDR addr;
+} ETH_ADDR;
+
+typedef struct {
+  ETH_ADDR list[TAP_MAX_MCAST_LIST];
+} MC_LIST;
+
+
+// BUGBUG!!! Consider using ststem defines in netiodef.h!!!
+
+//----------------
+// Ethernet header
+//----------------
+typedef struct
+{
+    MACADDR dest;               /* destination eth addr	*/
+    MACADDR src;                /* source ether addr	*/
+    USHORT proto;               /* packet type ID field	*/
+} ETH_HEADER, *PETH_HEADER;
+
+//----------------
+// ARP packet
+//----------------
+
+typedef struct
+   {
+    MACADDR        m_MAC_Destination;        // Reverse these two
+    MACADDR        m_MAC_Source;             // to answer ARP requests
+    USHORT         m_Proto;                  // 0x0806
+
+#   define MAC_ADDR_TYPE 0x0001
+    USHORT         m_MAC_AddressType;        // 0x0001
+
+    USHORT         m_PROTO_AddressType;      // 0x0800
+    UCHAR          m_MAC_AddressSize;        // 0x06
+    UCHAR          m_PROTO_AddressSize;      // 0x04
+
+#   define ARP_REQUEST 0x0001
+#   define ARP_REPLY   0x0002
+    USHORT         m_ARP_Operation;          // 0x0001 for ARP request, 0x0002 for ARP reply
+
+    MACADDR        m_ARP_MAC_Source;
+    IPADDR         m_ARP_IP_Source;
+    MACADDR        m_ARP_MAC_Destination;
+    IPADDR         m_ARP_IP_Destination;
+   }
+ARP_PACKET, *PARP_PACKET;
+
+//----------
+// IP Header
+//----------
+
+typedef struct {
+# define IPH_GET_VER(v) (((v) >> 4) & 0x0F)
+# define IPH_GET_LEN(v) (((v) & 0x0F) << 2)
+  UCHAR    version_len;
+
+  UCHAR    tos;
+  USHORT   tot_len;
+  USHORT   id;
+
+# define IP_OFFMASK 0x1fff
+  USHORT   frag_off;
+
+  UCHAR    ttl;
+
+# define IPPROTO_UDP  17  /* UDP protocol */
+# define IPPROTO_TCP   6  /* TCP protocol */
+# define IPPROTO_ICMP  1  /* ICMP protocol */
+# define IPPROTO_IGMP  2  /* IGMP protocol */
+  UCHAR    protocol;
+
+  USHORT   check;
+  ULONG    saddr;
+  ULONG    daddr;
+  /* The options start here. */
+} IPHDR;
+
+//-----------
+// UDP header
+//-----------
+
+typedef struct {
+  USHORT   source;
+  USHORT   dest;
+  USHORT   len;
+  USHORT   check;
+} UDPHDR;
+
+//--------------------------
+// TCP header, per RFC 793.
+//--------------------------
+
+typedef struct {
+  USHORT      source;    /* source port */
+  USHORT      dest;      /* destination port */
+  ULONG       seq;       /* sequence number */
+  ULONG       ack_seq;   /* acknowledgement number */
+
+# define TCPH_GET_DOFF(d) (((d) & 0xF0) >> 2)
+  UCHAR       doff_res;
+
+# define TCPH_FIN_MASK (1<<0)
+# define TCPH_SYN_MASK (1<<1)
+# define TCPH_RST_MASK (1<<2)
+# define TCPH_PSH_MASK (1<<3)
+# define TCPH_ACK_MASK (1<<4)
+# define TCPH_URG_MASK (1<<5)
+# define TCPH_ECE_MASK (1<<6)
+# define TCPH_CWR_MASK (1<<7)
+  UCHAR       flags;
+
+  USHORT      window;
+  USHORT      check;
+  USHORT      urg_ptr;
+} TCPHDR;
+
+#define	TCPOPT_EOL     0
+#define	TCPOPT_NOP     1
+#define	TCPOPT_MAXSEG  2
+#define TCPOLEN_MAXSEG 4
+
+//------------
+// IPv6 Header
+//------------
+
+typedef struct {
+  UCHAR    version_prio;
+  UCHAR    flow_lbl[3];
+  USHORT   payload_len;
+# define IPPROTO_ICMPV6  0x3a  /* ICMP protocol v6 */
+  UCHAR    nexthdr;
+  UCHAR    hop_limit;
+  IPV6ADDR saddr;
+  IPV6ADDR daddr;
+} IPV6HDR;
+
+//--------------------------------------------
+// IPCMPv6 NS/NA Packets (RFC4443 and RFC4861)
+//--------------------------------------------
+
+// Neighbor Solictiation - RFC 4861, 4.3
+// (this is just the ICMPv6 part of the packet)
+typedef struct {
+  UCHAR    type;
+# define ICMPV6_TYPE_NS	135		// neighbour solicitation
+  UCHAR    code;
+# define ICMPV6_CODE_0	0		// no specific sub-code for NS/NA
+  USHORT   checksum;
+  ULONG    reserved;
+  IPV6ADDR target_addr;
+} ICMPV6_NS;
+
+// Neighbor Advertisement - RFC 4861, 4.4 + 4.6/4.6.1
+// (this is just the ICMPv6 payload)
+typedef struct {
+  UCHAR    type;
+# define ICMPV6_TYPE_NA	136		// neighbour advertisement
+  UCHAR    code;
+# define ICMPV6_CODE_0	0		// no specific sub-code for NS/NA
+  USHORT   checksum;
+  UCHAR    rso_bits;			// Router(0), Solicited(2), Ovrrd(4)
+  UCHAR	   reserved[3];
+  IPV6ADDR target_addr;
+// always include "Target Link-layer Address" option (RFC 4861 4.6.1)
+  UCHAR    opt_type;
+#define ICMPV6_OPTION_TLLA 2
+  UCHAR    opt_length;
+#define ICMPV6_LENGTH_TLLA 1		// multiplied by 8 -> 1 = 8 bytes
+  MACADDR  target_macaddr;
+} ICMPV6_NA;
+
+// this is the complete packet with Ethernet and IPv6 headers
+typedef struct {
+  ETH_HEADER eth;
+  IPV6HDR    ipv6;
+  ICMPV6_NA  icmpv6;
+} ICMPV6_NA_PKT;
+
+#pragma pack()
diff --git a/installer/tap/src/src/prototypes.h b/installer/tap/src/src/prototypes.h
new file mode 100644
index 0000000..ad70261
--- /dev/null
+++ b/installer/tap/src/src/prototypes.h
@@ -0,0 +1,87 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef TAP_PROTOTYPES_DEFINED
+#define TAP_PROTOTYPES_DEFINED
+
+DRIVER_INITIALIZE   DriverEntry;
+
+//VOID AdapterFreeResources
+//   (
+//    TapAdapterPointer p_Adapter
+//   );
+//
+
+//
+//NTSTATUS TapDeviceHook
+//   (
+//    IN PDEVICE_OBJECT p_DeviceObject,
+//    IN PIRP p_IRP
+//   );
+//
+
+NDIS_STATUS
+CreateTapDevice(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+   );
+
+VOID
+DestroyTapDevice(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+   );
+
+// Flush the pending send TAP packet queue.
+VOID
+tapFlushSendPacketQueue(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+    );
+
+VOID
+IndicateReceivePacket(
+    __in PTAP_ADAPTER_CONTEXT  Adapter,
+    __in PUCHAR packetData,
+    __in const unsigned int packetLength
+    );
+
+BOOLEAN
+ProcessDHCP(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in const ETH_HEADER *eth,
+    __in const IPHDR *ip,
+    __in const UDPHDR *udp,
+    __in const DHCP *dhcp,
+    __in int optlen
+    );
+
+BOOLEAN
+ProcessARP(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in const PARP_PACKET src,
+    __in const IPADDR adapter_ip,
+    __in const IPADDR ip_network,
+    __in const IPADDR ip_netmask,
+    __in const MACADDR mac
+   );
+
+#endif
diff --git a/installer/tap/src/src/resource.rc b/installer/tap/src/src/resource.rc
new file mode 100644
index 0000000..fbe2775
--- /dev/null
+++ b/installer/tap/src/src/resource.rc
@@ -0,0 +1,62 @@
+#include <windows.h>
+#include <ntverp.h>
+
+#include "config.h"
+
+#undef VER_PRODUCTVERSION
+#undef VER_PRODUCTVERSION_STR
+#undef VER_COMPANYNAME_STR
+#undef VER_PRODUCTNAME_STR
+
+/* VER_FILETYPE, VER_FILESUBTYPE, VER_FILEDESCRIPTION_STR
+ * and VER_INTERNALNAME_STR must be defined before including COMMON.VER
+ * The strings don't need a '\0', since common.ver has them.
+ */
+
+#define	VER_FILETYPE	VFT_DRV
+/* possible values:		VFT_UNKNOWN
+				VFT_APP
+				VFT_DLL
+				VFT_DRV
+				VFT_FONT
+				VFT_VXD
+				VFT_STATIC_LIB
+*/
+#define	VER_FILESUBTYPE	VFT2_DRV_NETWORK
+/* possible values		VFT2_UNKNOWN
+				VFT2_DRV_PRINTER
+				VFT2_DRV_KEYBOARD
+				VFT2_DRV_LANGUAGE
+				VFT2_DRV_DISPLAY
+				VFT2_DRV_MOUSE
+				VFT2_DRV_NETWORK
+				VFT2_DRV_SYSTEM
+				VFT2_DRV_INSTALLABLE
+				VFT2_DRV_SOUND
+				VFT2_DRV_COMM
+*/
+
+#define VER_COMPANYNAME_STR         "The OpenVPN Project"
+#define VER_FILEDESCRIPTION_STR     "TAP-Windows Virtual Network Driver (NDIS 6.0)"
+#define VER_ORIGINALFILENAME_STR    PRODUCT_TAP_WIN_COMPONENT_ID ".sys"
+#define VER_LEGALCOPYRIGHT_YEARS    "2003-2014"
+#define VER_LEGALCOPYRIGHT_STR      "OpenVPN Technologies, Inc."
+
+
+#define VER_PRODUCTNAME_STR         VER_FILEDESCRIPTION_STR
+#define VER_PRODUCTVERSION	    PRODUCT_TAP_WIN_MAJOR,00,00,PRODUCT_TAP_WIN_MINOR
+
+#define XSTR(s) STR(s)
+#define STR(s) #s
+
+#define VSTRING PRODUCT_VERSION " " XSTR(PRODUCT_TAP_WIN_MAJOR) "/" XSTR(PRODUCT_TAP_WIN_MINOR)
+
+#ifdef DBG
+#define VER_PRODUCTVERSION_STR      VSTRING " (DEBUG)"
+#else
+#define VER_PRODUCTVERSION_STR      VSTRING
+#endif
+
+#define VER_INTERNALNAME_STR        VER_ORIGINALFILENAME_STR
+
+#include "common.ver"
diff --git a/installer/tap/src/src/rxpath.c b/installer/tap/src/src/rxpath.c
new file mode 100644
index 0000000..7415b5e
--- /dev/null
+++ b/installer/tap/src/src/rxpath.c
@@ -0,0 +1,667 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//
+// Include files.
+//
+
+#include "tap.h"
+
+//======================================================================
+// TAP Receive Path Support
+//======================================================================
+
+#ifdef ALLOC_PRAGMA
+#pragma alloc_text( PAGE, TapDeviceWrite)
+#endif // ALLOC_PRAGMA
+
+//===============================================================
+// Used in cases where internally generated packets such as
+// ARP or DHCP replies must be returned to the kernel, to be
+// seen as an incoming packet "arriving" on the interface.
+//===============================================================
+
+VOID
+IndicateReceivePacket(
+    __in PTAP_ADAPTER_CONTEXT  Adapter,
+    __in PUCHAR packetData,
+    __in const unsigned int packetLength
+    )
+{
+    PUCHAR  injectBuffer;
+
+    //
+    // Handle miniport Pause
+    // ---------------------
+    // NDIS 6 miniports implement a temporary "Pause" state normally followed
+    // by the Restart. While in the Pause state it is forbidden for the miniport
+    // to indicate receive NBLs.
+    //
+    // That is: The device interface may be "up", but the NDIS miniport send/receive
+    // interface may be temporarily "down".
+    //
+    // BUGBUG!!! In the initial implementation of the NDIS 6 TapOas inject path
+    // the code below will simply ignore inject packets passed to the driver while
+    // the miniport is in the Paused state.
+    //
+    // The correct implementation is to go ahead and build the NBLs corresponding
+    // to the inject packet - but queue them. When Restart is entered the
+    // queued NBLs would be dequeued and indicated to the host.
+    //
+    if(tapAdapterSendAndReceiveReady(Adapter) != NDIS_STATUS_SUCCESS)
+    {
+        DEBUGP (("[%s] Lying send in IndicateReceivePacket while adapter paused\n",
+            MINIPORT_INSTANCE_ID (Adapter)));
+
+        return;
+    }
+
+    // Allocate flat buffer for packet data.
+    injectBuffer = (PUCHAR )NdisAllocateMemoryWithTagPriority(
+                        Adapter->MiniportAdapterHandle,
+                        packetLength,
+                        TAP_RX_INJECT_BUFFER_TAG,
+                        NormalPoolPriority
+                        );
+
+    if( injectBuffer)
+    {
+        PMDL    mdl;
+
+        // Copy packet data to flat buffer.
+        NdisMoveMemory (injectBuffer, packetData, packetLength);
+
+        // Allocate MDL for flat buffer.
+        mdl = NdisAllocateMdl(
+                Adapter->MiniportAdapterHandle,
+                injectBuffer,
+                packetLength
+                );
+
+        if( mdl )
+        {
+            PNET_BUFFER_LIST    netBufferList;
+
+            mdl->Next = NULL;   // No next MDL
+
+            // Allocate the NBL and NB. Link MDL chain to NB.
+            netBufferList = NdisAllocateNetBufferAndNetBufferList(
+                                Adapter->ReceiveNblPool,
+                                0,                  // ContextSize
+                                0,                  // ContextBackFill
+                                mdl,                // MDL chain
+                                0,
+                                packetLength
+                                );
+
+            if(netBufferList != NULL)
+            {
+                ULONG       receiveFlags = 0;
+                LONG        nblCount;
+
+                NET_BUFFER_LIST_NEXT_NBL(netBufferList) = NULL; // Only one NBL
+
+                if(KeGetCurrentIrql() == DISPATCH_LEVEL)
+                {
+                    receiveFlags |= NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL;
+                }
+
+                // Set flag indicating that this is an injected packet
+                TAP_RX_NBL_FLAGS_CLEAR_ALL(netBufferList);
+                TAP_RX_NBL_FLAG_SET(netBufferList,TAP_RX_NBL_FLAGS_IS_INJECTED);
+
+                netBufferList->MiniportReserved[0] = NULL;
+                netBufferList->MiniportReserved[1] = NULL;
+
+                // Increment in-flight receive NBL count.
+                nblCount = NdisInterlockedIncrement(&Adapter->ReceiveNblInFlightCount);
+                ASSERT(nblCount > 0 );
+
+                netBufferList->SourceHandle = Adapter->MiniportAdapterHandle;
+
+                //
+                // Indicate the packet
+                // -------------------
+                // Irp->AssociatedIrp.SystemBuffer with length irpSp->Parameters.Write.Length
+                // contains the complete packet including Ethernet header and payload.
+                //
+                NdisMIndicateReceiveNetBufferLists(
+                    Adapter->MiniportAdapterHandle,
+                    netBufferList,
+                    NDIS_DEFAULT_PORT_NUMBER,
+                    1,      // NumberOfNetBufferLists
+                    receiveFlags
+                    );
+
+                return;
+            }
+            else
+            {
+                DEBUGP (("[%s] NdisAllocateNetBufferAndNetBufferList failed in IndicateReceivePacket\n",
+                    MINIPORT_INSTANCE_ID (Adapter)));
+                NOTE_ERROR ();
+
+                NdisFreeMdl(mdl);
+                NdisFreeMemory(injectBuffer,0,0);
+            }
+        }
+        else
+        {
+            DEBUGP (("[%s] NdisAllocateMdl failed in IndicateReceivePacket\n",
+                MINIPORT_INSTANCE_ID (Adapter)));
+            NOTE_ERROR ();
+
+            NdisFreeMemory(injectBuffer,0,0);
+        }
+    }
+    else
+    {
+        DEBUGP (("[%s] NdisAllocateMemoryWithTagPriority failed in IndicateReceivePacket\n",
+            MINIPORT_INSTANCE_ID (Adapter)));
+        NOTE_ERROR ();
+    }
+}
+
+VOID
+tapCompleteIrpAndFreeReceiveNetBufferList(
+    __in  PTAP_ADAPTER_CONTEXT  Adapter,
+    __in  PNET_BUFFER_LIST      NetBufferList,  // Only one NB here...
+    __in  NTSTATUS              IoCompletionStatus
+    )
+{
+    PIRP    irp;
+    ULONG   frameType, netBufferCount, byteCount;
+    LONG    nblCount;
+
+    // Fetch NB frame type.
+    frameType = tapGetNetBufferFrameType(NET_BUFFER_LIST_FIRST_NB(NetBufferList));
+
+    // Fetch statistics for all NBs linked to the NB.
+    netBufferCount = tapGetNetBufferCountsFromNetBufferList(
+                        NetBufferList,
+                        &byteCount
+                        );
+
+    // Update statistics by frame type
+    if(IoCompletionStatus == STATUS_SUCCESS)
+    {
+        switch(frameType)
+        {
+        case NDIS_PACKET_TYPE_DIRECTED:
+            Adapter->FramesRxDirected += netBufferCount;
+            Adapter->BytesRxDirected += byteCount;
+            break;
+
+        case NDIS_PACKET_TYPE_BROADCAST:
+            Adapter->FramesRxBroadcast += netBufferCount;
+            Adapter->BytesRxBroadcast += byteCount;
+            break;
+
+        case NDIS_PACKET_TYPE_MULTICAST:
+            Adapter->FramesRxMulticast += netBufferCount;
+            Adapter->BytesRxMulticast += byteCount;
+            break;
+
+        default:
+            ASSERT(FALSE);
+            break;
+        }
+    }
+
+    //
+    // Handle P2P Packet
+    // -----------------
+    // Free MDL allocated for P2P Ethernet header.
+    //
+    if(TAP_RX_NBL_FLAG_TEST(NetBufferList,TAP_RX_NBL_FLAGS_IS_P2P))
+    {
+        PNET_BUFFER     netBuffer;
+        PMDL            mdl;
+
+        netBuffer = NET_BUFFER_LIST_FIRST_NB(NetBufferList);
+        mdl = NET_BUFFER_FIRST_MDL(netBuffer);
+        mdl->Next = NULL;
+
+        NdisFreeMdl(mdl);
+    }
+
+    //
+    // Handle Injected Packet
+    // -----------------------
+    // Free MDL and data buffer allocated for injected packet.
+    //
+    if(TAP_RX_NBL_FLAG_TEST(NetBufferList,TAP_RX_NBL_FLAGS_IS_INJECTED))
+    {
+        PNET_BUFFER     netBuffer;
+        PMDL            mdl;
+        PUCHAR          injectBuffer;
+
+        netBuffer = NET_BUFFER_LIST_FIRST_NB(NetBufferList);
+        mdl = NET_BUFFER_FIRST_MDL(netBuffer);
+
+        injectBuffer = (PUCHAR )MmGetSystemAddressForMdlSafe(mdl,NormalPagePriority);
+
+        if(injectBuffer)
+        {
+            NdisFreeMemory(injectBuffer,0,0);
+        }
+
+        NdisFreeMdl(mdl);
+    }
+
+    //
+    // Complete the IRP
+    //
+    irp = (PIRP )NetBufferList->MiniportReserved[0];
+
+    if(irp)
+    {
+        irp->IoStatus.Status = IoCompletionStatus;
+        IoCompleteRequest(irp, IO_NO_INCREMENT);
+    }
+
+    // Decrement in-flight receive NBL count.
+    nblCount = NdisInterlockedDecrement(&Adapter->ReceiveNblInFlightCount);
+    ASSERT(nblCount >= 0 );
+    if (0 == nblCount)
+    {
+        NdisSetEvent(&Adapter->ReceiveNblInFlightCountZeroEvent);
+    }
+
+    // Free the NBL
+    NdisFreeNetBufferList(NetBufferList);
+}
+
+VOID
+AdapterReturnNetBufferLists(
+    __in  NDIS_HANDLE             MiniportAdapterContext,
+    __in  PNET_BUFFER_LIST        NetBufferLists,
+    __in  ULONG                   ReturnFlags
+    )
+{
+    PTAP_ADAPTER_CONTEXT    adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+    PNET_BUFFER_LIST        currentNbl, nextNbl;
+
+    UNREFERENCED_PARAMETER(ReturnFlags);
+
+    //
+    // Process each NBL individually
+    //
+    currentNbl = NetBufferLists;
+    while (currentNbl)
+    {
+        PNET_BUFFER_LIST    nextNbl;
+
+        nextNbl = NET_BUFFER_LIST_NEXT_NBL(currentNbl);
+        NET_BUFFER_LIST_NEXT_NBL(currentNbl) = NULL;
+
+        // Complete write IRP and free NBL and associated resources.
+        tapCompleteIrpAndFreeReceiveNetBufferList(
+            adapter,
+            currentNbl,
+            STATUS_SUCCESS
+            );
+
+        // Move to next NBL
+        currentNbl = nextNbl;
+    }
+}
+
+// IRP_MJ_WRITE callback.
+NTSTATUS
+TapDeviceWrite(
+    PDEVICE_OBJECT DeviceObject,
+    PIRP Irp
+    )
+{
+    NTSTATUS                ntStatus = STATUS_SUCCESS;// Assume success
+    PIO_STACK_LOCATION      irpSp;// Pointer to current stack location
+    PTAP_ADAPTER_CONTEXT    adapter = NULL;
+    ULONG                   dataLength;
+
+    PAGED_CODE();
+
+    irpSp = IoGetCurrentIrpStackLocation( Irp );
+
+    //
+    // Fetch adapter context for this device.
+    // --------------------------------------
+    // Adapter pointer was stashed in FsContext when handle was opened.
+    //
+    adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext;
+
+    ASSERT(adapter);
+
+    //
+    // Sanity checks on state variables
+    //
+    if (!tapAdapterReadAndWriteReady(adapter))
+    {
+        //DEBUGP (("[%s] Interface is down in IRP_MJ_WRITE\n",
+        //    MINIPORT_INSTANCE_ID (adapter)));
+        //NOTE_ERROR();
+
+        Irp->IoStatus.Status = ntStatus = STATUS_CANCELLED;
+        Irp->IoStatus.Information = 0;
+        IoCompleteRequest (Irp, IO_NO_INCREMENT);
+
+        return ntStatus;
+    }
+
+    // Save IRP-accessible copy of buffer length
+    Irp->IoStatus.Information = irpSp->Parameters.Write.Length;
+
+    if (Irp->MdlAddress == NULL)
+    {
+        DEBUGP (("[%s] MdlAddress is NULL for IRP_MJ_WRITE\n",
+            MINIPORT_INSTANCE_ID (adapter)));
+
+        NOTE_ERROR();
+        Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER;
+        Irp->IoStatus.Information = 0;
+        IoCompleteRequest (Irp, IO_NO_INCREMENT);
+
+        return ntStatus;
+    }
+
+    //
+    // Try to get a virtual address for the MDL.
+    //
+    NdisQueryMdl(
+        Irp->MdlAddress,
+        &Irp->AssociatedIrp.SystemBuffer,
+        &dataLength,
+        NormalPagePriority
+        );
+
+    if (Irp->AssociatedIrp.SystemBuffer == NULL)
+    {
+        DEBUGP (("[%s] Could not map address in IRP_MJ_WRITE\n",
+            MINIPORT_INSTANCE_ID (adapter)));
+
+        NOTE_ERROR();
+        Irp->IoStatus.Status = ntStatus = STATUS_INSUFFICIENT_RESOURCES;
+        Irp->IoStatus.Information = 0;
+        IoCompleteRequest (Irp, IO_NO_INCREMENT);
+
+        return ntStatus;
+    }
+
+    ASSERT(dataLength == irpSp->Parameters.Write.Length);
+
+    Irp->IoStatus.Information = irpSp->Parameters.Write.Length;
+
+    //
+    // Handle miniport Pause
+    // ---------------------
+    // NDIS 6 miniports implement a temporary "Pause" state normally followed
+    // by the Restart. While in the Pause state it is forbidden for the miniport
+    // to indicate receive NBLs.
+    //
+    // That is: The device interface may be "up", but the NDIS miniport send/receive
+    // interface may be temporarily "down".
+    //
+    // BUGBUG!!! In the initial implementation of the NDIS 6 TapOas receive path
+    // the code below will perform a "lying send" for write IRPs passed to the
+    // driver while the miniport is in the Paused state.
+    //
+    // The correct implementation is to go ahead and build the NBLs corresponding
+    // to the user-mode write - but queue them. When Restart is entered the
+    // queued NBLs would be dequeued and indicated to the host.
+    //
+    if(tapAdapterSendAndReceiveReady(adapter) == NDIS_STATUS_SUCCESS)
+    {
+        if (!adapter->m_tun && ((irpSp->Parameters.Write.Length) >= ETHERNET_HEADER_SIZE))
+        {
+            PNET_BUFFER_LIST    netBufferList;
+
+            DUMP_PACKET ("IRP_MJ_WRITE ETH",
+                (unsigned char *) Irp->AssociatedIrp.SystemBuffer,
+                irpSp->Parameters.Write.Length);
+
+            //=====================================================
+            // If IPv4 packet, check whether or not packet
+            // was truncated.
+            //=====================================================
+#if PACKET_TRUNCATION_CHECK
+            IPv4PacketSizeVerify (
+                (unsigned char *) Irp->AssociatedIrp.SystemBuffer,
+                irpSp->Parameters.Write.Length,
+                FALSE,
+                "RX",
+                &adapter->m_RxTrunc
+                );
+#endif
+            (Irp->MdlAddress)->Next = NULL; // No next MDL
+
+            // Allocate the NBL and NB. Link MDL chain to NB.
+            netBufferList = NdisAllocateNetBufferAndNetBufferList(
+                adapter->ReceiveNblPool,
+                0,                  // ContextSize
+                0,                  // ContextBackFill
+                Irp->MdlAddress,    // MDL chain
+                0,
+                dataLength
+                );
+
+            if(netBufferList != NULL)
+            {
+                LONG    nblCount;
+
+                NET_BUFFER_LIST_NEXT_NBL(netBufferList) = NULL; // Only one NBL
+
+                // Stash IRP pointer in NBL MiniportReserved[0] field.
+                netBufferList->MiniportReserved[0] = Irp;
+                netBufferList->MiniportReserved[1] = NULL;
+
+                // This IRP is pended.
+                IoMarkIrpPending(Irp);
+
+                // This IRP cannot be cancelled while in-flight.
+                IoSetCancelRoutine(Irp,NULL);
+
+                TAP_RX_NBL_FLAGS_CLEAR_ALL(netBufferList);
+
+                // Increment in-flight receive NBL count.
+                nblCount = NdisInterlockedIncrement(&adapter->ReceiveNblInFlightCount);
+                ASSERT(nblCount > 0 );
+
+                //
+                // Indicate the packet
+                // -------------------
+                // Irp->AssociatedIrp.SystemBuffer with length irpSp->Parameters.Write.Length
+                // contains the complete packet including Ethernet header and payload.
+                //
+                NdisMIndicateReceiveNetBufferLists(
+                    adapter->MiniportAdapterHandle,
+                    netBufferList,
+                    NDIS_DEFAULT_PORT_NUMBER,
+                    1,      // NumberOfNetBufferLists
+                    0       // ReceiveFlags
+                    );
+
+                ntStatus = STATUS_PENDING;
+            }
+            else
+            {
+                DEBUGP (("[%s] NdisMIndicateReceiveNetBufferLists failed in IRP_MJ_WRITE\n",
+                    MINIPORT_INSTANCE_ID (adapter)));
+                NOTE_ERROR ();
+
+                // Fail the IRP
+                Irp->IoStatus.Information = 0;
+                ntStatus = STATUS_INSUFFICIENT_RESOURCES;
+            }
+        }
+        else if (adapter->m_tun && ((irpSp->Parameters.Write.Length) >= IP_HEADER_SIZE))
+        {
+            PETH_HEADER         p_UserToTap = &adapter->m_UserToTap;
+            PMDL                mdl;    // Head of MDL chain.
+
+            // For IPv6, need to use Ethernet header with IPv6 proto
+            if ( IPH_GET_VER( ((IPHDR*) Irp->AssociatedIrp.SystemBuffer)->version_len) == 6 )
+            {
+                p_UserToTap = &adapter->m_UserToTap_IPv6;
+            }
+
+            DUMP_PACKET2 ("IRP_MJ_WRITE P2P",
+                p_UserToTap,
+                (unsigned char *) Irp->AssociatedIrp.SystemBuffer,
+                irpSp->Parameters.Write.Length);
+
+            //=====================================================
+            // If IPv4 packet, check whether or not packet
+            // was truncated.
+            //=====================================================
+#if PACKET_TRUNCATION_CHECK
+            IPv4PacketSizeVerify (
+                (unsigned char *) Irp->AssociatedIrp.SystemBuffer,
+                irpSp->Parameters.Write.Length,
+                TRUE,
+                "RX",
+                &adapter->m_RxTrunc
+                );
+#endif
+
+            //
+            // Allocate MDL for Ethernet header
+            // --------------------------------
+            // Irp->AssociatedIrp.SystemBuffer with length irpSp->Parameters.Write.Length
+            // contains the only the Ethernet payload. Prepend the user-mode provided
+            // payload with the Ethernet header pointed to by p_UserToTap.
+            //
+            mdl = NdisAllocateMdl(
+                adapter->MiniportAdapterHandle,
+                p_UserToTap,
+                sizeof(ETH_HEADER)
+                );
+
+            if(mdl != NULL)
+            {
+                PNET_BUFFER_LIST    netBufferList;
+
+                // Chain user's Ethernet payload behind Ethernet header.
+                mdl->Next = Irp->MdlAddress;
+                (Irp->MdlAddress)->Next = NULL; // No next MDL
+
+                // Allocate the NBL and NB. Link MDL chain to NB.
+                netBufferList = NdisAllocateNetBufferAndNetBufferList(
+                    adapter->ReceiveNblPool,
+                    0,          // ContextSize
+                    0,          // ContextBackFill
+                    mdl,        // MDL chain
+                    0,
+                    sizeof(ETH_HEADER) + dataLength
+                    );
+
+                if(netBufferList != NULL)
+                {
+                    LONG        nblCount;
+
+                    NET_BUFFER_LIST_NEXT_NBL(netBufferList) = NULL; // Only one NBL
+
+                    // This IRP is pended.
+                    IoMarkIrpPending(Irp);
+
+                    // This IRP cannot be cancelled while in-flight.
+                    IoSetCancelRoutine(Irp,NULL);
+
+                    // Stash IRP pointer in NBL MiniportReserved[0] field.
+                    netBufferList->MiniportReserved[0] = Irp;
+                    netBufferList->MiniportReserved[1] = NULL;
+
+                    // Set flag indicating that this is P2P packet
+                    TAP_RX_NBL_FLAGS_CLEAR_ALL(netBufferList);
+                    TAP_RX_NBL_FLAG_SET(netBufferList,TAP_RX_NBL_FLAGS_IS_P2P);
+
+                    // Increment in-flight receive NBL count.
+                    nblCount = NdisInterlockedIncrement(&adapter->ReceiveNblInFlightCount);
+                    ASSERT(nblCount > 0 );
+
+                    //
+                    // Indicate the packet
+                    //
+                    NdisMIndicateReceiveNetBufferLists(
+                        adapter->MiniportAdapterHandle,
+                        netBufferList,
+                        NDIS_DEFAULT_PORT_NUMBER,
+                        1,      // NumberOfNetBufferLists
+                        0       // ReceiveFlags
+                        );
+
+                    ntStatus = STATUS_PENDING;
+                }
+                else
+                {
+                    mdl->Next = NULL;
+                    NdisFreeMdl(mdl);
+
+                    DEBUGP (("[%s] NdisMIndicateReceiveNetBufferLists failed in IRP_MJ_WRITE\n",
+                        MINIPORT_INSTANCE_ID (adapter)));
+                    NOTE_ERROR ();
+
+                    // Fail the IRP
+                    Irp->IoStatus.Information = 0;
+                    ntStatus = STATUS_INSUFFICIENT_RESOURCES;
+                }
+            }
+            else
+            {
+                DEBUGP (("[%s] NdisAllocateMdl failed in IRP_MJ_WRITE\n",
+                    MINIPORT_INSTANCE_ID (adapter)));
+                NOTE_ERROR ();
+
+                // Fail the IRP
+                Irp->IoStatus.Information = 0;
+                ntStatus = STATUS_INSUFFICIENT_RESOURCES;
+            }
+        }
+        else
+        {
+            DEBUGP (("[%s] Bad buffer size in IRP_MJ_WRITE, len=%d\n",
+                MINIPORT_INSTANCE_ID (adapter),
+                irpSp->Parameters.Write.Length));
+            NOTE_ERROR ();
+
+            Irp->IoStatus.Information = 0;	// ETHERNET_HEADER_SIZE;
+            Irp->IoStatus.Status = ntStatus = STATUS_BUFFER_TOO_SMALL;
+        }
+    }
+    else
+    {
+        DEBUGP (("[%s] Lying send in IRP_MJ_WRITE while adapter paused\n",
+            MINIPORT_INSTANCE_ID (adapter)));
+
+        ntStatus = STATUS_SUCCESS;
+    }
+
+    if (ntStatus != STATUS_PENDING)
+    {
+        Irp->IoStatus.Status = ntStatus;
+        IoCompleteRequest(Irp, IO_NO_INCREMENT);
+    }
+
+    return ntStatus;
+}
+
diff --git a/installer/tap/src/src/tap-windows.h b/installer/tap/src/src/tap-windows.h
new file mode 100644
index 0000000..9971534
--- /dev/null
+++ b/installer/tap/src/src/tap-windows.h
@@ -0,0 +1,75 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below). This particular file
+ *  (tap-windows.h) is also licensed using the MIT license (see COPYRIGHT.MIT).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#ifndef __TAP_WIN_H
+#define __TAP_WIN_H
+
+/*
+ * =============
+ * TAP IOCTLs
+ * =============
+ */
+
+#define TAP_WIN_CONTROL_CODE(request,method) \
+  CTL_CODE (FILE_DEVICE_UNKNOWN, request, method, FILE_ANY_ACCESS)
+
+/* Present in 8.1 */
+
+#define TAP_WIN_IOCTL_GET_MAC               TAP_WIN_CONTROL_CODE (1, METHOD_BUFFERED)
+#define TAP_WIN_IOCTL_GET_VERSION           TAP_WIN_CONTROL_CODE (2, METHOD_BUFFERED)
+#define TAP_WIN_IOCTL_GET_MTU               TAP_WIN_CONTROL_CODE (3, METHOD_BUFFERED)
+#define TAP_WIN_IOCTL_GET_INFO              TAP_WIN_CONTROL_CODE (4, METHOD_BUFFERED)
+#define TAP_WIN_IOCTL_CONFIG_POINT_TO_POINT TAP_WIN_CONTROL_CODE (5, METHOD_BUFFERED)
+#define TAP_WIN_IOCTL_SET_MEDIA_STATUS      TAP_WIN_CONTROL_CODE (6, METHOD_BUFFERED)
+#define TAP_WIN_IOCTL_CONFIG_DHCP_MASQ      TAP_WIN_CONTROL_CODE (7, METHOD_BUFFERED)
+#define TAP_WIN_IOCTL_GET_LOG_LINE          TAP_WIN_CONTROL_CODE (8, METHOD_BUFFERED)
+#define TAP_WIN_IOCTL_CONFIG_DHCP_SET_OPT   TAP_WIN_CONTROL_CODE (9, METHOD_BUFFERED)
+
+/* Added in 8.2 */
+
+/* obsoletes TAP_WIN_IOCTL_CONFIG_POINT_TO_POINT */
+#define TAP_WIN_IOCTL_CONFIG_TUN            TAP_WIN_CONTROL_CODE (10, METHOD_BUFFERED)
+
+/*
+ * =================
+ * Registry keys
+ * =================
+ */
+
+#define ADAPTER_KEY "SYSTEM\\CurrentControlSet\\Control\\Class\\{4D36E972-E325-11CE-BFC1-08002BE10318}"
+
+#define NETWORK_CONNECTIONS_KEY "SYSTEM\\CurrentControlSet\\Control\\Network\\{4D36E972-E325-11CE-BFC1-08002BE10318}"
+
+/*
+ * ======================
+ * Filesystem prefixes
+ * ======================
+ */
+
+#define USERMODEDEVICEDIR "\\\\.\\Global\\"
+#define SYSDEVICEDIR      "\\Device\\"
+#define USERDEVICEDIR     "\\DosDevices\\Global\\"
+#define TAP_WIN_SUFFIX    ".tap"
+
+#endif // __TAP_WIN_H
diff --git a/installer/tap/src/src/tap.h b/installer/tap/src/src/tap.h
new file mode 100644
index 0000000..ded959b
--- /dev/null
+++ b/installer/tap/src/src/tap.h
@@ -0,0 +1,83 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#ifndef __TAP_H
+#define __TAP_H
+
+#include <ntifs.h>
+#include <ndis.h>
+#include <ntstrsafe.h>
+#include <netioapi.h>
+
+#include "config.h"
+#include "lock.h"
+#include "constants.h"
+#include "proto.h"
+#include "mem.h"
+#include "macinfo.h"
+#include "dhcp.h"
+#include "error.h"
+#include "endian.h"
+#include "dhcp.h"
+#include "types.h"
+#include "adapter.h"
+#include "device.h"
+#include "prototypes.h"
+#include "tap-windows.h"
+
+//========================================================
+// Check for truncated IPv4 packets, log errors if found.
+//========================================================
+#define PACKET_TRUNCATION_CHECK 0
+
+//========================================================
+// EXPERIMENTAL -- Configure TAP device object to be
+// accessible from non-administrative accounts, based
+// on an advanced properties setting.
+//
+// Duplicates the functionality of OpenVPN's
+// --allow-nonadmin directive.
+//========================================================
+#define ENABLE_NONADMIN 1
+
+//
+// The driver has exactly one instance of the TAP_GLOBAL structure.  NDIS keeps
+// an opaque handle to this data, (it doesn't attempt to read or interpret this
+// data), and it passes the handle back to the miniport in MiniportSetOptions
+// and MiniportInitializeEx.
+//
+typedef struct _TAP_GLOBAL
+{
+    LIST_ENTRY          AdapterList;
+
+    NDIS_RW_LOCK        Lock;
+
+    NDIS_HANDLE         NdisDriverHandle;   // From NdisMRegisterMiniportDriver
+
+} TAP_GLOBAL, *PTAP_GLOBAL;
+
+
+// Global data
+extern TAP_GLOBAL      GlobalData;
+
+#endif // __TAP_H
diff --git a/installer/tap/src/src/tapdrvr.c b/installer/tap/src/src/tapdrvr.c
new file mode 100644
index 0000000..6c537f1
--- /dev/null
+++ b/installer/tap/src/src/tapdrvr.c
@@ -0,0 +1,232 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//======================================================
+// This driver is designed to work on Windows Vista or higher
+// versions of Windows.
+//
+// It is SMP-safe and handles power management.
+//
+// By default we operate as a "tap" virtual ethernet
+// 802.3 interface, but we can emulate a "tun"
+// interface (point-to-point IPv4) through the
+// TAP_WIN_IOCTL_CONFIG_POINT_TO_POINT or
+// TAP_WIN_IOCTL_CONFIG_TUN ioctl.
+//======================================================
+
+//
+// Include files.
+//
+
+#include <string.h>
+
+#include "tap.h"
+
+
+// Global data
+TAP_GLOBAL      GlobalData;
+
+
+#ifdef ALLOC_PRAGMA
+#pragma alloc_text( INIT, DriverEntry )
+#pragma alloc_text( PAGE, TapDriverUnload)
+#endif // ALLOC_PRAGMA
+
+NTSTATUS
+DriverEntry(
+    __in PDRIVER_OBJECT   DriverObject,
+    __in PUNICODE_STRING  RegistryPath
+    )
+/*++
+Routine Description:
+
+    In the context of its DriverEntry function, a miniport driver associates
+    itself with NDIS, specifies the NDIS version that it is using, and
+    registers its entry points.
+
+
+Arguments:
+    PVOID DriverObject - pointer to the driver object.
+    PVOID RegistryPath - pointer to the driver registry path.
+
+    Return Value:
+
+    NTSTATUS code
+
+--*/
+{
+    NTSTATUS                                status;
+
+    UNREFERENCED_PARAMETER(RegistryPath);
+
+    DEBUGP (("[TAP] --> DriverEntry; version [%d.%d] %s %s\n",
+        TAP_DRIVER_MAJOR_VERSION,
+        TAP_DRIVER_MINOR_VERSION,
+        __DATE__,
+        __TIME__));
+
+    DEBUGP (("[TAP] Registry Path: '%wZ'\n", RegistryPath));
+
+    //
+    // Initialize any driver-global variables here.
+    //
+    NdisZeroMemory(&GlobalData, sizeof(GlobalData));
+
+    //
+    // The ApaterList in the GlobalData structure is used to track multiple
+    // adapters controlled by this miniport.
+    //
+    NdisInitializeListHead(&GlobalData.AdapterList);
+
+    //
+    // This lock protects the AdapterList.
+    //
+    NdisInitializeReadWriteLock(&GlobalData.Lock);
+
+    do
+    {
+        NDIS_MINIPORT_DRIVER_CHARACTERISTICS    miniportCharacteristics;
+
+        NdisZeroMemory(&miniportCharacteristics, sizeof(miniportCharacteristics));
+
+        {C_ASSERT(sizeof(miniportCharacteristics) >= NDIS_SIZEOF_MINIPORT_DRIVER_CHARACTERISTICS_REVISION_2);}
+        miniportCharacteristics.Header.Type = NDIS_OBJECT_TYPE_MINIPORT_DRIVER_CHARACTERISTICS;
+        miniportCharacteristics.Header.Size = NDIS_SIZEOF_MINIPORT_DRIVER_CHARACTERISTICS_REVISION_2;
+        miniportCharacteristics.Header.Revision = NDIS_MINIPORT_DRIVER_CHARACTERISTICS_REVISION_2;
+
+        miniportCharacteristics.MajorNdisVersion = TAP_NDIS_MAJOR_VERSION;
+        miniportCharacteristics.MinorNdisVersion = TAP_NDIS_MINOR_VERSION;
+
+        miniportCharacteristics.MajorDriverVersion = TAP_DRIVER_MAJOR_VERSION;
+        miniportCharacteristics.MinorDriverVersion = TAP_DRIVER_MINOR_VERSION;
+
+        miniportCharacteristics.Flags = 0;
+
+        //miniportCharacteristics.SetOptionsHandler = MPSetOptions; // Optional
+        miniportCharacteristics.InitializeHandlerEx = AdapterCreate;
+        miniportCharacteristics.HaltHandlerEx = AdapterHalt;
+        miniportCharacteristics.UnloadHandler = TapDriverUnload;
+        miniportCharacteristics.PauseHandler = AdapterPause;
+        miniportCharacteristics.RestartHandler = AdapterRestart;
+        miniportCharacteristics.OidRequestHandler = AdapterOidRequest;
+        miniportCharacteristics.SendNetBufferListsHandler = AdapterSendNetBufferLists;
+        miniportCharacteristics.ReturnNetBufferListsHandler = AdapterReturnNetBufferLists;
+        miniportCharacteristics.CancelSendHandler = AdapterCancelSend;
+        miniportCharacteristics.CheckForHangHandlerEx = AdapterCheckForHangEx;
+        miniportCharacteristics.ResetHandlerEx = AdapterReset;
+        miniportCharacteristics.DevicePnPEventNotifyHandler = AdapterDevicePnpEventNotify;
+        miniportCharacteristics.ShutdownHandlerEx = AdapterShutdownEx;
+        miniportCharacteristics.CancelOidRequestHandler = AdapterCancelOidRequest;
+
+        //
+        // Associate the miniport driver with NDIS by calling the
+        // NdisMRegisterMiniportDriver. This function returns an NdisDriverHandle.
+        // The miniport driver must retain this handle but it should never attempt
+        // to access or interpret this handle.
+        //
+        // By calling NdisMRegisterMiniportDriver, the driver indicates that it
+        // is ready for NDIS to call the driver's MiniportSetOptions and
+        // MiniportInitializeEx handlers.
+        //
+        DEBUGP (("[TAP] Calling NdisMRegisterMiniportDriver...\n"));
+        //NDIS_DECLARE_MINIPORT_DRIVER_CONTEXT(TAP_GLOBAL);
+        status = NdisMRegisterMiniportDriver(
+                    DriverObject,
+                    RegistryPath,
+                    &GlobalData,
+                    &miniportCharacteristics,
+                    &GlobalData.NdisDriverHandle
+                    );
+
+        if (NDIS_STATUS_SUCCESS == status)
+        {
+            DEBUGP (("[TAP] Registered miniport successfully\n"));
+        }
+        else
+        {
+            DEBUGP(("[TAP] NdisMRegisterMiniportDriver failed: %8.8X\n", status));
+            TapDriverUnload(DriverObject);
+            status = NDIS_STATUS_FAILURE;
+            break;
+        }
+    } while(FALSE);
+
+    DEBUGP (("[TAP] <-- DriverEntry; status = %8.8X\n",status));
+
+    return status;
+}
+
+VOID
+TapDriverUnload(
+    __in PDRIVER_OBJECT DriverObject
+    )
+/*++
+
+Routine Description:
+
+    The unload handler is called during driver unload to free up resources
+    acquired in DriverEntry. This handler is registered in DriverEntry through
+    NdisMRegisterMiniportDriver. Note that an unload handler differs from
+    a MiniportHalt function in that this unload handler releases resources that
+    are global to the driver, while the halt handler releases resource for a
+    particular adapter.
+
+    Runs at IRQL = PASSIVE_LEVEL.
+
+Arguments:
+
+    DriverObject        Not used
+
+Return Value:
+
+    None.
+
+--*/
+{
+    PDEVICE_OBJECT deviceObject = DriverObject->DeviceObject;
+    UNICODE_STRING uniWin32NameString;
+
+    DEBUGP (("[TAP] --> TapDriverUnload; version [%d.%d] %s %s unloaded\n",
+        TAP_DRIVER_MAJOR_VERSION,
+        TAP_DRIVER_MINOR_VERSION,
+        __DATE__,
+        __TIME__
+        ));
+
+    PAGED_CODE();
+
+    //
+    // Clean up all globals that were allocated in DriverEntry
+    //
+
+    ASSERT(IsListEmpty(&GlobalData.AdapterList));
+
+    if(GlobalData.NdisDriverHandle != NULL )
+    {
+        NdisMDeregisterMiniportDriver(GlobalData.NdisDriverHandle);
+    }
+
+    DEBUGP (("[TAP] <-- TapDriverUnload\n"));
+}
+
diff --git a/installer/tap/src/src/txpath.c b/installer/tap/src/src/txpath.c
new file mode 100644
index 0000000..f627934
--- /dev/null
+++ b/installer/tap/src/src/txpath.c
@@ -0,0 +1,1166 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+//
+// Include files.
+//
+
+#include "tap.h"
+
+//======================================================================
+// TAP Send Path Support
+//======================================================================
+
+#ifdef ALLOC_PRAGMA
+#pragma alloc_text( PAGE, TapDeviceRead)
+#endif // ALLOC_PRAGMA
+
+// checksum code for ICMPv6 packet, taken from dhcp.c / udp_checksum
+// see RFC 4443, 2.3, and RFC 2460, 8.1
+USHORT
+icmpv6_checksum(
+    __in const UCHAR *buf,
+    __in const int len_icmpv6,
+    __in const UCHAR *saddr6,
+    __in const UCHAR *daddr6
+    )
+{
+    USHORT word16;
+    ULONG sum = 0;
+    int i;
+
+    // make 16 bit words out of every two adjacent 8 bit words and
+    // calculate the sum of all 16 bit words
+    for (i = 0; i < len_icmpv6; i += 2)
+    {
+        word16 = ((buf[i] << 8) & 0xFF00) + ((i + 1 < len_icmpv6) ? (buf[i+1] & 0xFF) : 0);
+        sum += word16;
+    }
+
+    // add the IPv6 pseudo header which contains the IP source and destination addresses
+    for (i = 0; i < 16; i += 2)
+    {
+        word16 =((saddr6[i] << 8) & 0xFF00) + (saddr6[i+1] & 0xFF);
+        sum += word16;
+    }
+
+    for (i = 0; i < 16; i += 2)
+    {
+        word16 =((daddr6[i] << 8) & 0xFF00) + (daddr6[i+1] & 0xFF);
+        sum += word16;
+    }
+
+    // the next-header number and the length of the ICMPv6 packet
+    sum += (USHORT) IPPROTO_ICMPV6 + (USHORT) len_icmpv6;
+
+    // keep only the last 16 bits of the 32 bit calculated sum and add the carries
+    while (sum >> 16)
+        sum = (sum & 0xFFFF) + (sum >> 16);
+
+    // Take the one's complement of sum
+    return ((USHORT) ~sum);
+}
+
+// check IPv6 packet for "is this an IPv6 Neighbor Solicitation that
+// the tap driver needs to answer?"
+// see RFC 4861 4.3 for the different cases
+static IPV6ADDR IPV6_NS_TARGET_MCAST =
+	{ 0xff, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+          0x00, 0x00, 0x00, 0x01, 0xff, 0x00, 0x00, 0x08 };
+static IPV6ADDR IPV6_NS_TARGET_UNICAST =
+	{ 0xfe, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08 };
+
+BOOLEAN
+HandleIPv6NeighborDiscovery(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in UCHAR * m_Data
+    )
+{
+    const IPV6HDR *ipv6 = (IPV6HDR *) (m_Data + sizeof (ETH_HEADER));
+    const ICMPV6_NS * icmpv6_ns = (ICMPV6_NS *) (m_Data + sizeof (ETH_HEADER) + sizeof (IPV6HDR));
+    ICMPV6_NA_PKT *na;
+    USHORT icmpv6_len, icmpv6_csum;
+
+    // we don't really care about the destination MAC address here
+    // - it's either a multicast MAC, or the userland destination MAC
+    // but since the TAP driver is point-to-point, all packets are "for us"
+
+    // IPv6 target address must be ff02::1::ff00:8 (multicast for
+    // initial NS) or fe80::1 (unicast for recurrent NUD)
+    if ( memcmp( ipv6->daddr, IPV6_NS_TARGET_MCAST,
+        sizeof(IPV6ADDR) ) != 0 &&
+        memcmp( ipv6->daddr, IPV6_NS_TARGET_UNICAST,
+        sizeof(IPV6ADDR) ) != 0 )
+    {
+        return FALSE;				// wrong target address
+    }
+
+    // IPv6 Next-Header must be ICMPv6
+    if ( ipv6->nexthdr != IPPROTO_ICMPV6 )
+    {
+        return FALSE;				// wrong next-header
+    }
+
+    // ICMPv6 type+code must be 135/0 for NS
+    if ( icmpv6_ns->type != ICMPV6_TYPE_NS ||
+        icmpv6_ns->code != ICMPV6_CODE_0 )
+    {
+        return FALSE;				// wrong ICMPv6 type
+    }
+
+    // ICMPv6 target address must be fe80::8 (magic)
+    if ( memcmp( icmpv6_ns->target_addr, IPV6_NS_TARGET_UNICAST,
+        sizeof(IPV6ADDR) ) != 0 )
+    {
+        return FALSE;				// not for us
+    }
+
+    // packet identified, build magic response packet
+
+    na = (ICMPV6_NA_PKT *) MemAlloc (sizeof (ICMPV6_NA_PKT), TRUE);
+    if ( !na ) return FALSE;
+
+    //------------------------------------------------
+    // Initialize Neighbour Advertisement reply packet
+    //------------------------------------------------
+
+    // ethernet header
+    na->eth.proto = htons(NDIS_ETH_TYPE_IPV6);
+    ETH_COPY_NETWORK_ADDRESS(na->eth.dest, Adapter->PermanentAddress);
+    ETH_COPY_NETWORK_ADDRESS(na->eth.src, Adapter->m_TapToUser.dest);
+
+    // IPv6 header
+    na->ipv6.version_prio = ipv6->version_prio;
+    NdisMoveMemory( na->ipv6.flow_lbl, ipv6->flow_lbl,
+        sizeof(na->ipv6.flow_lbl) );
+    icmpv6_len = sizeof(ICMPV6_NA_PKT) - sizeof(ETH_HEADER) - sizeof(IPV6HDR);
+    na->ipv6.payload_len = htons(icmpv6_len);
+    na->ipv6.nexthdr = IPPROTO_ICMPV6;
+    na->ipv6.hop_limit = 255;
+    NdisMoveMemory( na->ipv6.saddr, IPV6_NS_TARGET_UNICAST,
+        sizeof(IPV6ADDR) );
+    NdisMoveMemory( na->ipv6.daddr, ipv6->saddr,
+        sizeof(IPV6ADDR) );
+
+    // ICMPv6
+    na->icmpv6.type = ICMPV6_TYPE_NA;
+    na->icmpv6.code = ICMPV6_CODE_0;
+    na->icmpv6.checksum = 0;
+    na->icmpv6.rso_bits = 0x60;		// Solicited + Override
+    NdisZeroMemory( na->icmpv6.reserved, sizeof(na->icmpv6.reserved) );
+    NdisMoveMemory( na->icmpv6.target_addr, IPV6_NS_TARGET_UNICAST,
+        sizeof(IPV6ADDR) );
+
+    // ICMPv6 option "Target Link Layer Address"
+    na->icmpv6.opt_type = ICMPV6_OPTION_TLLA;
+    na->icmpv6.opt_length = ICMPV6_LENGTH_TLLA;
+    ETH_COPY_NETWORK_ADDRESS( na->icmpv6.target_macaddr, Adapter->m_TapToUser.dest );
+
+    // calculate and set checksum
+    icmpv6_csum = icmpv6_checksum (
+                    (UCHAR*) &(na->icmpv6),
+                    icmpv6_len,
+                    na->ipv6.saddr,
+                    na->ipv6.daddr
+                    );
+
+    na->icmpv6.checksum = htons( icmpv6_csum );
+
+    DUMP_PACKET ("HandleIPv6NeighborDiscovery",
+        (unsigned char *) na,
+        sizeof (ICMPV6_NA_PKT));
+
+    IndicateReceivePacket (Adapter, (UCHAR *) na, sizeof (ICMPV6_NA_PKT));
+
+    MemFree (na, sizeof (ICMPV6_NA_PKT));
+
+    return TRUE;				// all fine
+}
+
+//===================================================
+// Generate an ARP reply message for specific kinds
+// ARP queries.
+//===================================================
+BOOLEAN
+ProcessARP(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in const PARP_PACKET src,
+    __in const IPADDR adapter_ip,
+    __in const IPADDR ip_network,
+    __in const IPADDR ip_netmask,
+    __in const MACADDR mac
+    )
+{
+    //-----------------------------------------------
+    // Is this the kind of packet we are looking for?
+    //-----------------------------------------------
+    if (src->m_Proto == htons (NDIS_ETH_TYPE_ARP)
+        && MAC_EQUAL (src->m_MAC_Source, Adapter->PermanentAddress)
+        && MAC_EQUAL (src->m_ARP_MAC_Source, Adapter->PermanentAddress)
+        && ETH_IS_BROADCAST(src->m_MAC_Destination)
+        && src->m_ARP_Operation == htons (ARP_REQUEST)
+        && src->m_MAC_AddressType == htons (MAC_ADDR_TYPE)
+        && src->m_MAC_AddressSize == sizeof (MACADDR)
+        && src->m_PROTO_AddressType == htons (NDIS_ETH_TYPE_IPV4)
+        && src->m_PROTO_AddressSize == sizeof (IPADDR)
+        && src->m_ARP_IP_Source == adapter_ip
+        && (src->m_ARP_IP_Destination & ip_netmask) == ip_network
+        && src->m_ARP_IP_Destination != adapter_ip)
+    {
+        ARP_PACKET *arp = (ARP_PACKET *) MemAlloc (sizeof (ARP_PACKET), TRUE);
+        if (arp)
+        {
+            //----------------------------------------------
+            // Initialize ARP reply fields
+            //----------------------------------------------
+            arp->m_Proto = htons (NDIS_ETH_TYPE_ARP);
+            arp->m_MAC_AddressType = htons (MAC_ADDR_TYPE);
+            arp->m_PROTO_AddressType = htons (NDIS_ETH_TYPE_IPV4);
+            arp->m_MAC_AddressSize = sizeof (MACADDR);
+            arp->m_PROTO_AddressSize = sizeof (IPADDR);
+            arp->m_ARP_Operation = htons (ARP_REPLY);
+
+            //----------------------------------------------
+            // ARP addresses
+            //----------------------------------------------      
+            ETH_COPY_NETWORK_ADDRESS (arp->m_MAC_Source, mac);
+            ETH_COPY_NETWORK_ADDRESS (arp->m_MAC_Destination, Adapter->PermanentAddress);
+            ETH_COPY_NETWORK_ADDRESS (arp->m_ARP_MAC_Source, mac);
+            ETH_COPY_NETWORK_ADDRESS (arp->m_ARP_MAC_Destination, Adapter->PermanentAddress);
+            arp->m_ARP_IP_Source = src->m_ARP_IP_Destination;
+            arp->m_ARP_IP_Destination = adapter_ip;
+
+            DUMP_PACKET ("ProcessARP",
+                (unsigned char *) arp,
+                sizeof (ARP_PACKET));
+
+            IndicateReceivePacket (Adapter, (UCHAR *) arp, sizeof (ARP_PACKET));
+
+            MemFree (arp, sizeof (ARP_PACKET));
+        }
+
+        return TRUE;
+    }
+    else
+        return FALSE;
+}
+
+//=============================================================
+// CompleteIRP is normally called with an adapter -> userspace
+// network packet and an IRP (Pending I/O request) from userspace.
+//
+// The IRP will normally represent a queued overlapped read
+// operation from userspace that is in a wait state.
+//
+// Use the ethernet packet to satisfy the IRP.
+//=============================================================
+
+VOID
+tapCompletePendingReadIrp(
+    __in PIRP Irp,
+    __in PTAP_PACKET TapPacket
+    )
+{
+    int offset;
+    int len;
+    NTSTATUS    status = STATUS_UNSUCCESSFUL;
+
+    ASSERT(Irp);
+    ASSERT(TapPacket);
+
+    //-------------------------------------------
+    // While TapPacket always contains a
+    // full ethernet packet, including the
+    // ethernet header, in point-to-point mode,
+    // we only want to return the IPv4
+    // component.
+    //-------------------------------------------
+
+    if (TapPacket->m_SizeFlags & TP_TUN)
+    {
+        offset = ETHERNET_HEADER_SIZE;
+        len = (int) (TapPacket->m_SizeFlags & TP_SIZE_MASK) - ETHERNET_HEADER_SIZE;
+    }
+    else
+    {
+        offset = 0;
+        len = (TapPacket->m_SizeFlags & TP_SIZE_MASK);
+    }
+
+    if (len < 0 || (int) Irp->IoStatus.Information < len)
+    {
+        Irp->IoStatus.Information = 0;
+        Irp->IoStatus.Status = status = STATUS_BUFFER_OVERFLOW;
+        NOTE_ERROR ();
+    }
+    else
+    {
+        Irp->IoStatus.Information = len;
+        Irp->IoStatus.Status = status = STATUS_SUCCESS;
+
+        // Copy packet data
+        NdisMoveMemory(
+            Irp->AssociatedIrp.SystemBuffer,
+            TapPacket->m_Data + offset,
+            len
+            );
+    }
+
+    // Free the TAP packet
+    NdisFreeMemory(TapPacket,0,0);
+
+    // Complete the IRP
+    IoCompleteRequest (Irp, IO_NETWORK_INCREMENT);
+}
+
+VOID
+tapProcessSendPacketQueue(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+    )
+{
+    KIRQL  irql;
+
+    // Process the send packet queue
+    KeAcquireSpinLock(&Adapter->SendPacketQueue.QueueLock,&irql);
+
+    while(Adapter->SendPacketQueue.Count > 0 )
+    {
+        PIRP            irp;
+        PTAP_PACKET     tapPacket;
+
+        // Fetch a read IRP
+        irp = IoCsqRemoveNextIrp(
+                &Adapter->PendingReadIrpQueue.CsqQueue,
+                NULL
+                );
+
+        if( irp == NULL )
+        {
+            // No IRP to satisfy
+            break;
+        }
+
+        // Fetch a queued TAP send packet
+        tapPacket = tapPacketRemoveHeadLocked(
+                        &Adapter->SendPacketQueue
+                        );
+
+        ASSERT(tapPacket);
+
+        // BUGBUG!!! Investigate whether release/reacquire can cause
+        // out-of-order IRP completion. Also, whether user-mode can
+        // tolerate out-of-order packets.
+
+        // Release packet queue lock while completing the IRP
+        //KeReleaseSpinLock(&Adapter->SendPacketQueue.QueueLock,irql);
+
+        // Complete the read IRP from queued TAP send packet.
+        tapCompletePendingReadIrp(irp,tapPacket);
+
+        // Reqcquire packet queue lock after completing the IRP
+        //KeAcquireSpinLock(&Adapter->SendPacketQueue.QueueLock,&irql);
+    }
+
+    KeReleaseSpinLock(&Adapter->SendPacketQueue.QueueLock,irql);
+}
+
+// Flush the pending send TAP packet queue.
+VOID
+tapFlushSendPacketQueue(
+    __in PTAP_ADAPTER_CONTEXT   Adapter
+    )
+{
+    KIRQL  irql;
+
+    // Process the send packet queue
+    KeAcquireSpinLock(&Adapter->SendPacketQueue.QueueLock,&irql);
+
+    DEBUGP (("[TAP] tapFlushSendPacketQueue: Flushing %d TAP packets\n",
+        Adapter->SendPacketQueue.Count));
+
+    while(Adapter->SendPacketQueue.Count > 0 )
+    {
+        PTAP_PACKET     tapPacket;
+
+        // Fetch a queued TAP send packet
+        tapPacket = tapPacketRemoveHeadLocked(
+                        &Adapter->SendPacketQueue
+                        );
+
+        ASSERT(tapPacket);
+
+        // Free the TAP packet
+        NdisFreeMemory(tapPacket,0,0);
+    }
+
+    KeReleaseSpinLock(&Adapter->SendPacketQueue.QueueLock,irql);
+}
+
+VOID
+tapAdapterTransmit(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in PNET_BUFFER            NetBuffer,
+    __in  BOOLEAN               DispatchLevel
+    )
+/*++
+
+Routine Description:
+
+    This routine is called to transmit an individual net buffer using a
+    style similar to the previous NDIS 5 AdapterTransmit function.
+
+    In this implementation adapter state and NB length checks have already
+    been done before this function has been called.
+
+    The net buffer will be completed by the calling routine after this
+    routine exits. So, under this design it is necessary to make a deep
+    copy of frame data in the net buffer.
+
+    This routine creates a flat buffer copy of NB frame data. This is an
+    unnecessary performance bottleneck. However, the bottleneck is probably
+    not significant or measurable except for adapters running at 1Gbps or
+    greater speeds. Since this adapter is currently running at 100Mbps this
+    defect can be ignored.
+
+    Runs at IRQL <= DISPATCH_LEVEL
+
+Arguments:
+
+    Adapter                     Pointer to our adapter context
+    NetBuffer                   Pointer to the net buffer to transmit
+    DispatchLevel               TRUE if called at IRQL == DISPATCH_LEVEL
+
+Return Value:
+
+    None.
+
+    In the Microsoft NDIS 6 architecture there is no per-packet status.
+
+--*/
+{
+    NDIS_STATUS     status;
+    ULONG           packetLength;
+    PTAP_PACKET     tapPacket;
+    PVOID           packetData;
+
+    packetLength = NET_BUFFER_DATA_LENGTH(NetBuffer);
+
+    // Allocate TAP packet memory
+    tapPacket = (PTAP_PACKET )NdisAllocateMemoryWithTagPriority(
+                    Adapter->MiniportAdapterHandle,
+                    TAP_PACKET_SIZE (packetLength),
+                    TAP_PACKET_TAG,
+                    NormalPoolPriority
+                    );
+
+    if(tapPacket == NULL)
+    {
+        DEBUGP (("[TAP] tapAdapterTransmit: TAP packet allocation failed\n"));
+        return;
+    }
+
+    tapPacket->m_SizeFlags = (packetLength & TP_SIZE_MASK);
+
+    //
+    // Reassemble packet contents
+    // --------------------------
+    // NdisGetDataBuffer does most of the work. There are two cases:
+    //
+    //    1.) If the NB data was not contiguous it will copy the entire
+    //        NB's data to m_data and return pointer to m_data.
+    //    2.) If the NB data was contiguous it returns a pointer to the
+    //        first byte of the contiguous data instead of a pointer to m_Data.
+    //        In this case the data will not have been copied to m_Data. Copy
+    //        to m_Data will need to be done in an extra step.
+    //
+    // Case 1.) is the most likely in normal operation.
+    //
+    packetData = NdisGetDataBuffer(NetBuffer,packetLength,tapPacket->m_Data,1,0);
+
+    if(packetData == NULL)
+    {
+        DEBUGP (("[TAP] tapAdapterTransmit: Could not get packet data\n"));
+
+        NdisFreeMemory(tapPacket,0,0);
+
+        return;
+    }
+
+    if(packetData != tapPacket->m_Data)
+    {
+        // Packet data was contiguous and not yet copied to m_Data.
+        NdisMoveMemory(tapPacket->m_Data,packetData,packetLength);
+    }
+    
+    DUMP_PACKET ("AdapterTransmit", tapPacket->m_Data, packetLength);
+
+    //=====================================================
+    // If IPv4 packet, check whether or not packet
+    // was truncated.
+    //=====================================================
+#if PACKET_TRUNCATION_CHECK
+    IPv4PacketSizeVerify(
+        tapPacket->m_Data,
+        packetLength,
+        FALSE,
+        "TX",
+        &Adapter->m_TxTrunc
+        );
+#endif
+
+    //=====================================================
+    // Are we running in DHCP server masquerade mode?
+    //
+    // If so, catch both DHCP requests and ARP queries
+    // to resolve the address of our virtual DHCP server.
+    //=====================================================
+    if (Adapter->m_dhcp_enabled)
+    {
+        const ETH_HEADER *eth = (ETH_HEADER *) tapPacket->m_Data;
+        const IPHDR *ip = (IPHDR *) (tapPacket->m_Data + sizeof (ETH_HEADER));
+        const UDPHDR *udp = (UDPHDR *) (tapPacket->m_Data + sizeof (ETH_HEADER) + sizeof (IPHDR));
+
+        // ARP packet?
+        if (packetLength == sizeof (ARP_PACKET)
+            && eth->proto == htons (NDIS_ETH_TYPE_ARP)
+            && Adapter->m_dhcp_server_arp
+            )
+        {
+            if (ProcessARP(
+                    Adapter,
+                    (PARP_PACKET) tapPacket->m_Data,
+                    Adapter->m_dhcp_addr,
+                    Adapter->m_dhcp_server_ip,
+                    ~0,
+                    Adapter->m_dhcp_server_mac)
+                    )
+            {
+                goto no_queue;
+            }
+        }
+
+        // DHCP packet?
+        else if (packetLength >= sizeof (ETH_HEADER) + sizeof (IPHDR) + sizeof (UDPHDR) + sizeof (DHCP)
+            && eth->proto == htons (NDIS_ETH_TYPE_IPV4)
+            && ip->version_len == 0x45 // IPv4, 20 byte header
+            && ip->protocol == IPPROTO_UDP
+            && udp->dest == htons (BOOTPS_PORT)
+            )
+        {
+            const DHCP *dhcp = (DHCP *) (tapPacket->m_Data
+                + sizeof (ETH_HEADER)
+                + sizeof (IPHDR)
+                + sizeof (UDPHDR));
+
+            const int optlen = packetLength
+                - sizeof (ETH_HEADER)
+                - sizeof (IPHDR)
+                - sizeof (UDPHDR)
+                - sizeof (DHCP);
+
+            if (optlen > 0) // we must have at least one DHCP option
+            {
+                if (ProcessDHCP (Adapter, eth, ip, udp, dhcp, optlen))
+                {
+                    goto no_queue;
+                }
+            }
+            else
+            {
+                goto no_queue;
+            }
+        }
+    }
+
+    //===============================================
+    // In Point-To-Point mode, check to see whether
+    // packet is ARP (handled) or IPv4 (sent to app).
+    // IPv6 packets are inspected for neighbour discovery
+    // (to be handled locally), and the rest is forwarded
+    // all other protocols are dropped
+    //===============================================
+    if (Adapter->m_tun)
+    {
+        ETH_HEADER *e;
+
+        e = (ETH_HEADER *) tapPacket->m_Data;
+
+        switch (ntohs (e->proto))
+        {
+        case NDIS_ETH_TYPE_ARP:
+
+            // Make sure that packet is the right size for ARP.
+            if (packetLength != sizeof (ARP_PACKET))
+            {
+                goto no_queue;
+            }
+
+            ProcessARP (
+                Adapter,
+                (PARP_PACKET) tapPacket->m_Data,
+                Adapter->m_localIP,
+                Adapter->m_remoteNetwork,
+                Adapter->m_remoteNetmask,
+                Adapter->m_TapToUser.dest
+                );
+
+        default:
+            goto no_queue;
+
+        case NDIS_ETH_TYPE_IPV4:
+
+            // Make sure that packet is large enough to be IPv4.
+            if (packetLength < (ETHERNET_HEADER_SIZE + IP_HEADER_SIZE))
+            {
+                goto no_queue;
+            }
+
+            // Only accept directed packets, not broadcasts.
+            if (memcmp (e, &Adapter->m_TapToUser, ETHERNET_HEADER_SIZE))
+            {
+                goto no_queue;
+            }
+
+            // Packet looks like IPv4, queue it. :-)
+            tapPacket->m_SizeFlags |= TP_TUN;
+            break;
+
+        case NDIS_ETH_TYPE_IPV6:
+            // Make sure that packet is large enough to be IPv6.
+            if (packetLength < (ETHERNET_HEADER_SIZE + IPV6_HEADER_SIZE))
+            {
+                goto no_queue;
+            }
+
+            // Broadcasts and multicasts are handled specially
+            // (to be implemented)
+
+            // Neighbor discovery packets to fe80::8 are special
+            // OpenVPN sets this next-hop to signal "handled by tapdrv"
+            if ( HandleIPv6NeighborDiscovery(Adapter,tapPacket->m_Data) )
+            {
+                goto no_queue;
+            }
+
+            // Packet looks like IPv6, queue it. :-)
+            tapPacket->m_SizeFlags |= TP_TUN;
+        }
+    }
+
+    //===============================================
+    // Push packet onto queue to wait for read from
+    // userspace.
+    //===============================================
+    if(tapAdapterReadAndWriteReady(Adapter))
+    {
+        tapPacketQueueInsertTail(&Adapter->SendPacketQueue,tapPacket);
+    }
+    else
+    {
+        //
+        // Tragedy. All this work and the packet is of no use... 
+        //
+        NdisFreeMemory(tapPacket,0,0);
+    }
+
+    // Return after queuing or freeing TAP packet.
+    return;
+
+    // Free TAP packet without queuing.
+no_queue:
+    if(tapPacket != NULL )
+    {
+        NdisFreeMemory(tapPacket,0,0);
+    }
+  
+    return;
+}
+
+VOID
+tapSendNetBufferListsComplete(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in PNET_BUFFER_LIST       NetBufferLists,
+    __in NDIS_STATUS            SendCompletionStatus,
+    __in BOOLEAN                DispatchLevel
+    )
+{
+    PNET_BUFFER_LIST    currentNbl;
+    PNET_BUFFER_LIST    nextNbl = NULL;
+    ULONG               sendCompleteFlags = 0;
+
+    for (
+        currentNbl = NetBufferLists;
+        currentNbl != NULL;
+        currentNbl = nextNbl
+        )
+    {
+        ULONG       frameType;
+        ULONG       netBufferCount;
+        ULONG       byteCount;
+
+        nextNbl = NET_BUFFER_LIST_NEXT_NBL(currentNbl);
+
+        // Set NBL completion status.
+        NET_BUFFER_LIST_STATUS(currentNbl) = SendCompletionStatus;
+
+        // Fetch first NBs frame type. All linked NBs will have same type.
+        frameType = tapGetNetBufferFrameType(NET_BUFFER_LIST_FIRST_NB(currentNbl));
+
+        // Fetch statistics for all NBs linked to the NB.
+        netBufferCount = tapGetNetBufferCountsFromNetBufferList(
+                            currentNbl,
+                            &byteCount
+                            );
+
+        // Update statistics by frame type
+        if(SendCompletionStatus == NDIS_STATUS_SUCCESS)
+        {
+            switch(frameType)
+            {
+            case NDIS_PACKET_TYPE_DIRECTED:
+                Adapter->FramesTxDirected += netBufferCount;
+                Adapter->BytesTxDirected += byteCount;
+                break;
+
+            case NDIS_PACKET_TYPE_BROADCAST:
+                Adapter->FramesTxBroadcast += netBufferCount;
+                Adapter->BytesTxBroadcast += byteCount;
+                break;
+
+            case NDIS_PACKET_TYPE_MULTICAST:
+                Adapter->FramesTxMulticast += netBufferCount;
+                Adapter->BytesTxMulticast += byteCount;
+                break;
+
+            default:
+                ASSERT(FALSE);
+                break;
+            }
+        }
+        else
+        {
+            // Transmit error.
+            Adapter->TransmitFailuresOther += netBufferCount;
+        }
+
+        currentNbl = nextNbl;
+    }
+
+    if(DispatchLevel)
+    {
+        sendCompleteFlags |= NDIS_SEND_COMPLETE_FLAGS_DISPATCH_LEVEL;
+    }
+
+    // Complete the NBLs
+    NdisMSendNetBufferListsComplete(
+        Adapter->MiniportAdapterHandle,
+        NetBufferLists,
+        sendCompleteFlags
+        );
+}
+
+BOOLEAN
+tapNetBufferListNetBufferLengthsValid(
+    __in PTAP_ADAPTER_CONTEXT   Adapter,
+    __in  PNET_BUFFER_LIST      NetBufferLists
+    )
+/*++
+
+Routine Description:
+
+    Scan all NBLs and their linked NBs for valid lengths.
+
+    Fairly absurd to find and packets with bogus lengths, but wise
+    to check anyway. If ANY packet has a bogus length, then abort the
+    entire send.
+
+    The only time that one might see this check fail might be during
+    HCK driver testing. The HKC test might send oversize packets to
+    determine if the miniport can gracefully deal with them.
+
+    This check is fairly fast. Unlike NDIS 5 packets, fetching NDIS 6
+    packets lengths do not require any computation.
+
+Arguments:
+
+    Adapter                 Pointer to our adapter context
+    NetBufferLists          Head of a list of NBLs to examine
+
+Return Value:
+
+    Returns TRUE if all NBs have reasonable lengths.
+    Otherwise, returns FALSE.
+
+--*/
+{
+    PNET_BUFFER_LIST        currentNbl;
+
+    currentNbl = NetBufferLists;
+
+    while (currentNbl)
+    {
+        PNET_BUFFER_LIST    nextNbl;
+        PNET_BUFFER         currentNb;
+
+        // Locate next NBL
+        nextNbl = NET_BUFFER_LIST_NEXT_NBL(currentNbl);
+
+        // Locate first NB (aka "packet")
+        currentNb = NET_BUFFER_LIST_FIRST_NB(currentNbl);
+
+        //
+        // Process all NBs linked to this NBL
+        //
+        while(currentNb)
+        {
+            PNET_BUFFER nextNb;
+            ULONG       packetLength;
+
+            // Locate next NB
+            nextNb = NET_BUFFER_NEXT_NB(currentNb);
+
+            packetLength = NET_BUFFER_DATA_LENGTH(currentNb);
+
+            // Minimum packet size is size of Ethernet plus IPv4 headers.
+            ASSERT(packetLength >= (ETHERNET_HEADER_SIZE + IP_HEADER_SIZE));
+
+            if(packetLength < (ETHERNET_HEADER_SIZE + IP_HEADER_SIZE))
+            {
+                return FALSE;
+            }
+
+            // Maximum size should be Ethernet header size plus MTU plus modest pad for
+            // VLAN tag.
+            ASSERT( packetLength <= (ETHERNET_HEADER_SIZE + VLAN_TAG_SIZE + Adapter->MtuSize));
+
+            if(packetLength > (ETHERNET_HEADER_SIZE + VLAN_TAG_SIZE + Adapter->MtuSize))
+            {
+                return FALSE;
+            }
+
+            // Move to next NB
+            currentNb = nextNb;
+        }
+
+        // Move to next NBL
+        currentNbl = nextNbl;
+    }
+
+    return TRUE;
+}
+
+VOID
+AdapterSendNetBufferLists(
+    __in  NDIS_HANDLE             MiniportAdapterContext,
+    __in  PNET_BUFFER_LIST        NetBufferLists,
+    __in  NDIS_PORT_NUMBER        PortNumber,
+    __in  ULONG                   SendFlags
+    )
+/*++
+
+Routine Description:
+
+    Send Packet Array handler. Called by NDIS whenever a protocol
+    bound to our miniport sends one or more packets.
+
+    The input packet descriptor pointers have been ordered according
+    to the order in which the packets should be sent over the network
+    by the protocol driver that set up the packet array. The NDIS
+    library preserves the protocol-determined ordering when it submits
+    each packet array to MiniportSendPackets
+
+    As a deserialized driver, we are responsible for holding incoming send
+    packets in our internal queue until they can be transmitted over the
+    network and for preserving the protocol-determined ordering of packet
+    descriptors incoming to its MiniportSendPackets function.
+    A deserialized miniport driver must complete each incoming send packet
+    with NdisMSendComplete, and it cannot call NdisMSendResourcesAvailable.
+
+    Runs at IRQL <= DISPATCH_LEVEL
+
+Arguments:
+
+    MiniportAdapterContext      Pointer to our adapter
+    NetBufferLists              Head of a list of NBLs to send
+    PortNumber                  A miniport adapter port.  Default is 0.
+    SendFlags                   Additional flags for the send operation
+
+Return Value:
+
+    None.  Write status directly into each NBL with the NET_BUFFER_LIST_STATUS
+    macro.
+
+--*/
+{
+    NDIS_STATUS             status;
+    PTAP_ADAPTER_CONTEXT    adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+    BOOLEAN                 DispatchLevel = (SendFlags & NDIS_SEND_FLAGS_DISPATCH_LEVEL);
+    PNET_BUFFER_LIST        currentNbl;
+    BOOLEAN                 validNbLengths;
+
+    UNREFERENCED_PARAMETER(NetBufferLists);
+    UNREFERENCED_PARAMETER(PortNumber);
+    UNREFERENCED_PARAMETER(SendFlags);
+
+    ASSERT(PortNumber == 0); // Only the default port is supported
+
+    //
+    // Can't process sends if TAP device is not open.
+    // ----------------------------------------------
+    // Just perform a "lying send" and return packets as if they
+    // were successfully sent.
+    //
+    if(adapter->TapFileObject == NULL)
+    {
+        //
+        // Complete all NBLs and return if adapter not ready.
+        //
+        tapSendNetBufferListsComplete(
+            adapter,
+            NetBufferLists,
+            NDIS_STATUS_SUCCESS,
+            DispatchLevel
+            );
+
+        return;
+    }
+
+    //
+    // Check Adapter send/receive ready state.
+    //
+    status = tapAdapterSendAndReceiveReady(adapter);
+
+    if(status != NDIS_STATUS_SUCCESS)
+    {
+        //
+        // Complete all NBLs and return if adapter not ready.
+        //
+        tapSendNetBufferListsComplete(
+            adapter,
+            NetBufferLists,
+            status,
+            DispatchLevel
+            );
+
+        return;
+    }
+
+    //
+    // Scan all NBLs and linked packets for valid lengths.
+    // ---------------------------------------------------
+    // If _ANY_ NB length is invalid, then fail the entire send operation.
+    //
+    //    BUGBUG!!! Perhaps this should be less agressive. Fail only individual
+    //    NBLs...
+    //    
+    // If length check is valid, then TAP_PACKETS can be safely allocated
+    // and processed for all NBs being sent.
+    //
+    validNbLengths = tapNetBufferListNetBufferLengthsValid(
+                        adapter,
+                        NetBufferLists
+                        );
+
+    if(!validNbLengths)
+    {
+        //
+        // Complete all NBLs and return if and NB length is invalid.
+        //
+        tapSendNetBufferListsComplete(
+            adapter,
+            NetBufferLists,
+            NDIS_STATUS_INVALID_LENGTH,
+            DispatchLevel
+            );
+
+        return;
+    }
+
+    //
+    // Process each NBL individually
+    //
+    currentNbl = NetBufferLists;
+
+    while (currentNbl)
+    {
+        PNET_BUFFER_LIST    nextNbl;
+        PNET_BUFFER         currentNb;
+
+        // Locate next NBL
+        nextNbl = NET_BUFFER_LIST_NEXT_NBL(currentNbl);
+
+        // Locate first NB (aka "packet")
+        currentNb = NET_BUFFER_LIST_FIRST_NB(currentNbl);
+
+        // Transmit all NBs linked to this NBL
+        while(currentNb)
+        {
+            PNET_BUFFER nextNb;
+
+            // Locate next NB
+            nextNb = NET_BUFFER_NEXT_NB(currentNb);
+
+            // Transmit the NB
+            tapAdapterTransmit(adapter,currentNb,DispatchLevel);
+
+            // Move to next NB
+            currentNb = nextNb;
+        }
+
+        // Move to next NBL
+        currentNbl = nextNbl;
+    }
+
+    // Complete all NBLs
+    tapSendNetBufferListsComplete(
+        adapter,
+        NetBufferLists,
+        NDIS_STATUS_SUCCESS,
+        DispatchLevel
+        );
+
+    // Attempt to complete pending read IRPs from pending TAP 
+    // send packet queue.
+    tapProcessSendPacketQueue(adapter);
+}
+
+VOID
+AdapterCancelSend(
+    __in  NDIS_HANDLE             MiniportAdapterContext,
+    __in  PVOID                   CancelId
+    )
+{
+    PTAP_ADAPTER_CONTEXT   adapter = (PTAP_ADAPTER_CONTEXT )MiniportAdapterContext;
+
+    //
+    // This miniport completes its sends quickly, so it isn't strictly
+    // neccessary to implement MiniportCancelSend.
+    //
+    // If we did implement it, we'd have to walk the Adapter->SendWaitList
+    // and look for any NB that points to a NBL where the CancelId matches
+    // NDIS_GET_NET_BUFFER_LIST_CANCEL_ID(Nbl).  For any NB that so matches,
+    // we'd remove the NB from the SendWaitList and set the NBL's status to
+    // NDIS_STATUS_SEND_ABORTED, then complete the NBL.
+    //
+}
+
+// IRP_MJ_READ callback.
+NTSTATUS
+TapDeviceRead(
+    PDEVICE_OBJECT DeviceObject,
+    PIRP Irp
+    )
+{
+    NTSTATUS                ntStatus = STATUS_SUCCESS;// Assume success
+    PIO_STACK_LOCATION      irpSp;// Pointer to current stack location
+    PTAP_ADAPTER_CONTEXT    adapter = NULL;
+
+    PAGED_CODE();
+
+    irpSp = IoGetCurrentIrpStackLocation( Irp );
+
+    //
+    // Fetch adapter context for this device.
+    // --------------------------------------
+    // Adapter pointer was stashed in FsContext when handle was opened.
+    //
+    adapter = (PTAP_ADAPTER_CONTEXT )(irpSp->FileObject)->FsContext;
+
+    ASSERT(adapter);
+
+    //
+    // Sanity checks on state variables
+    //
+    if (!tapAdapterReadAndWriteReady(adapter))
+    {
+        //DEBUGP (("[%s] Interface is down in IRP_MJ_READ\n",
+        //    MINIPORT_INSTANCE_ID (adapter)));
+        //NOTE_ERROR();
+
+        Irp->IoStatus.Status = ntStatus = STATUS_CANCELLED;
+        Irp->IoStatus.Information = 0;
+        IoCompleteRequest (Irp, IO_NO_INCREMENT);
+
+        return ntStatus;
+    }
+
+    // Save IRP-accessible copy of buffer length
+    Irp->IoStatus.Information = irpSp->Parameters.Read.Length;
+
+    if (Irp->MdlAddress == NULL)
+    {
+        DEBUGP (("[%s] MdlAddress is NULL for IRP_MJ_READ\n",
+            MINIPORT_INSTANCE_ID (adapter)));
+
+        NOTE_ERROR();
+        Irp->IoStatus.Status = ntStatus = STATUS_INVALID_PARAMETER;
+        Irp->IoStatus.Information = 0;
+        IoCompleteRequest (Irp, IO_NO_INCREMENT);
+
+        return ntStatus;
+    }
+
+    if ((Irp->AssociatedIrp.SystemBuffer
+            = MmGetSystemAddressForMdlSafe(
+                Irp->MdlAddress,
+                NormalPagePriority
+                ) ) == NULL
+        )
+    {
+        DEBUGP (("[%s] Could not map address in IRP_MJ_READ\n",
+            MINIPORT_INSTANCE_ID (adapter)));
+
+        NOTE_ERROR();
+        Irp->IoStatus.Status = ntStatus = STATUS_INSUFFICIENT_RESOURCES;
+        Irp->IoStatus.Information = 0;
+        IoCompleteRequest (Irp, IO_NO_INCREMENT);
+
+        return ntStatus;
+    }
+
+    // BUGBUG!!! Use RemoveLock???
+
+    //
+    // Queue the IRP and return STATUS_PENDING.
+    // ----------------------------------------
+    // Note: IoCsqInsertIrp marks the IRP pending.
+    //
+
+    // BUGBUG!!! NDIS 5 implementation has IRP_QUEUE_SIZE of 16 and 
+    // does not queue IRP if this capacity is exceeded.
+    //
+    // Is this needed???
+    //
+    IoCsqInsertIrp(&adapter->PendingReadIrpQueue.CsqQueue, Irp, NULL);
+
+    // Attempt to complete pending read IRPs from pending TAP 
+    // send packet queue.
+    tapProcessSendPacketQueue(adapter);
+
+    ntStatus = STATUS_PENDING;
+
+    return ntStatus;
+}
+
diff --git a/installer/tap/src/src/types.h b/installer/tap/src/src/types.h
new file mode 100644
index 0000000..acea175
--- /dev/null
+++ b/installer/tap/src/src/types.h
@@ -0,0 +1,90 @@
+/*
+ *  TAP-Windows -- A kernel driver to provide virtual tap
+ *                 device functionality on Windows.
+ *
+ *  This code was inspired by the CIPE-Win32 driver by Damion K. Wilson.
+ *
+ *  This source code is Copyright (C) 2002-2014 OpenVPN Technologies, Inc.,
+ *  and is released under the GPL version 2 (see below).
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program (see the file COPYING included with this
+ *  distribution); if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef TAP_TYPES_DEFINED
+#define TAP_TYPES_DEFINED
+
+//typedef
+//struct _Queue
+//{
+//    ULONG base;
+//    ULONG size;
+//    ULONG capacity;
+//    ULONG max_size;
+//    PVOID data[];
+//} Queue;
+
+//typedef struct _TAP_PACKET;
+
+//typedef struct _TapExtension
+//{
+//  // TAP device object and packet queues
+//  Queue *m_PacketQueue, *m_IrpQueue;
+//  PDEVICE_OBJECT m_TapDevice;
+//  NDIS_HANDLE m_TapDeviceHandle;
+//  ULONG TapFileIsOpen;
+//
+//  // Used to lock packet queues
+//  NDIS_SPIN_LOCK m_QueueLock;
+//  BOOLEAN m_AllocatedSpinlocks;
+//
+//  // Used to bracket open/close
+//  // state changes.
+//  MUTEX m_OpenCloseMutex;
+//
+//  // True if device has been permanently halted
+//  BOOLEAN m_Halt;
+//
+//  // TAP device name
+//  unsigned char *m_TapName;
+//  UNICODE_STRING m_UnicodeLinkName;
+//  BOOLEAN m_CreatedUnicodeLinkName;
+//
+//  // Used for device status ioctl only
+//  const char *m_LastErrorFilename;
+//  int m_LastErrorLineNumber;
+//  LONG TapFileOpenCount;
+//
+//  // Flags
+//  BOOLEAN TapDeviceCreated;
+//  BOOLEAN m_CalledTapDeviceFreeResources;
+//
+//  // DPC queue for deferred packet injection
+//  BOOLEAN m_InjectDpcInitialized;
+//  KDPC m_InjectDpc;
+//  NDIS_SPIN_LOCK m_InjectLock;
+//  Queue *m_InjectQueue;
+//}
+//TapExtension, *TapExtensionPointer;
+
+typedef struct _InjectPacket
+   {
+#   define INJECT_PACKET_SIZE(data_size) (sizeof (InjectPacket) + (data_size))
+#   define INJECT_PACKET_FREE(ib)  NdisFreeMemory ((ib), INJECT_PACKET_SIZE ((ib)->m_Size), 0)
+    ULONG m_Size;
+    UCHAR m_Data []; // m_Data must be the last struct member
+   }
+InjectPacket, *InjectPacketPointer;
+
+#endif
diff --git a/installer/tap/src/version.m4 b/installer/tap/src/version.m4
new file mode 100644
index 0000000..fdd605c
--- /dev/null
+++ b/installer/tap/src/version.m4
@@ -0,0 +1,14 @@
+dnl define the TAP version
+define([PRODUCT_NAME], [TAP-Windows])
+define([PRODUCT_PUBLISHER], [OpenVPN Technologies, Inc.])
+define([PRODUCT_VERSION], [9.21.2])
+define([PRODUCT_VERSION_RESOURCE], [9,0,0,21])
+define([PRODUCT_TAP_WIN_COMPONENT_ID], [tap0901])
+define([PRODUCT_TAP_WIN_MAJOR], [9])
+define([PRODUCT_TAP_WIN_MINOR], [21])
+define([PRODUCT_TAP_WIN_REVISION], [2])
+define([PRODUCT_TAP_WIN_BUILD], [601])
+define([PRODUCT_TAP_WIN_PROVIDER], [TAP-Windows Provider V9])
+define([PRODUCT_TAP_WIN_CHARACTERISTICS], [0x81])
+define([PRODUCT_TAP_WIN_DEVICE_DESCRIPTION], [TAP-Windows Adapter V9])
+define([PRODUCT_TAP_WIN_RELDATE], [04/08/2014])
diff --git a/installer/tap/tap-windows6.nsi b/installer/tap/tap-windows6.nsi
new file mode 100644
index 0000000..1580ea6
--- /dev/null
+++ b/installer/tap/tap-windows6.nsi
@@ -0,0 +1,321 @@
+; ****************************************************************************
+; * Copyright (C) 2002-2010 OpenVPN Technologies, Inc.                       *
+; * Copyright (C)      2012 Alon Bar-Lev <alon.barlev@gmail.com>             *
+; *  This program is free software; you can redistribute it and/or modify    *
+; *  it under the terms of the GNU General Public License version 2          *
+; *  as published by the Free Software Foundation.                           *
+; ****************************************************************************
+
+; TAP-Windows install script for Windows, using NSIS
+
+SetCompressor /SOLID lzma
+
+!addplugindir .
+!include "MUI.nsh"
+!include "StrFunc.nsh"
+!include "x64.nsh"
+!define MULTIUSER_EXECUTIONLEVEL Admin
+!include "MultiUser.nsh"
+!include FileFunc.nsh
+!insertmacro GetParameters
+!insertmacro GetOptions
+
+!define PRODUCT_TAP_WIN_COMPONENT_ID "tap0901"
+!define PRODUCT_NAME "TunSafe-TAP"
+!define PRODUCT_VERSION "9.21.2"
+!define PRODUCT_PUBLISHER "TunSafe"
+
+${StrLoc}
+
+;--------------------------------
+;Configuration
+
+;General
+
+
+OutFile "TunSafe-TAP-${PRODUCT_VERSION}.exe"
+
+BrandingText " "
+ShowInstDetails show
+ShowUninstDetails show
+
+;--------------------------------
+;Modern UI Configuration
+
+Name "${PRODUCT_NAME}"
+
+#!define MUI_WELCOMEPAGE_TEXT "This wizard will guide you through the installation of ${PRODUCT_NAME}, a kernel driver to provide virtual tap device #functionality on Windows originally written by James Yonan.\r\n\r\nNote that ${PRODUCT_NAME} will only run on Windows Vista or later.\r\n\r\n\r\n"
+
+!define MUI_COMPONENTSPAGE_TEXT_TOP "Select the components to install/upgrade.  Stop any ${PRODUCT_NAME} processes or the ${PRODUCT_NAME} service if it is running."
+
+#!define MUI_COMPONENTSPAGE_SMALLDESC
+!define MUI_FINISHPAGE_NOAUTOCLOSE
+!define MUI_ABORTWARNING
+!define MUI_ICON "icon.ico"
+!define MUI_UNICON "icon.ico"
+!define MUI_HEADERIMAGE
+!define MUI_HEADERIMAGE_BITMAP "install-whirl.bmp"
+!define MUI_UNFINISHPAGE_NOAUTOCLOSE
+!define MUI_TEXT_LICENSE_TITLE "Welcome to the TunSafe-TAP installer"
+
+#!insertmacro MUI_PAGE_WELCOME
+!insertmacro MUI_PAGE_LICENSE "COPYING"
+#!insertmacro MUI_PAGE_COMPONENTS
+!define MUI_PAGE_CUSTOMFUNCTION_PRE dirPre
+!insertmacro MUI_PAGE_DIRECTORY
+!insertmacro MUI_PAGE_INSTFILES
+#!insertmacro MUI_PAGE_FINISH
+
+!insertmacro MUI_UNPAGE_CONFIRM
+!insertmacro MUI_UNPAGE_INSTFILES
+#!insertmacro MUI_UNPAGE_FINISH
+
+;--------------------------------
+;Languages
+
+!insertmacro MUI_LANGUAGE "English"
+
+;--------------------------------
+;Language Strings
+
+LangString DESC_SecTAP ${LANG_ENGLISH} "Install/Upgrade the TAP Virtual Ethernet Adapter from OpenVPN."
+LangString DESC_SecTAPUtilities ${LANG_ENGLISH} "Install the TAP Utilities."
+
+Function dirPre
+	${GetParameters} $R0
+	${GetOptions} "$R0" "/X" $R1
+	IfErrors +2 0
+	Abort
+FunctionEnd
+
+;--------------------------------
+;Installer Sections
+
+Section "TAP Virtual Ethernet Adapter" SecTAP
+	SetOverwrite on
+
+	${If} ${RunningX64}
+		DetailPrint "We are running on a 64-bit system."
+
+		SetOutPath "$INSTDIR"
+		File "prebuilt\x64\tapinstall.exe"
+
+		SetOutPath "$INSTDIR\driver"
+		File "prebuilt\x64\OemVista.inf"
+		File "prebuilt\x64\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat"
+		File "prebuilt\x64\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys"
+	${Else}
+		DetailPrint "We are running on a 32-bit system."
+
+		SetOutPath "$INSTDIR"
+		File "prebuilt\x86\tapinstall.exe"
+
+		SetOutPath "$INSTDIR\driver"
+		File "prebuilt\x86\OemVista.inf"
+		File "prebuilt\x86\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat"
+		File "prebuilt\x86\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys"
+	${EndIf}
+SectionEnd
+
+Section "TAP Utilities" SecTAPUtilities
+	SetOverwrite on
+
+	# Delete previous start menu
+	RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}"
+
+	FileOpen $R0 "$INSTDIR\addtap.bat" w
+	FileWrite $R0 "rem Add a new TAP virtual ethernet adapter$\r$\n"
+	FileWrite $R0 '"$INSTDIR\tapinstall.exe" install "$INSTDIR\driver\OemVista.inf" ${PRODUCT_TAP_WIN_COMPONENT_ID}$\r$\n'
+	FileWrite $R0 "pause$\r$\n"
+	FileClose $R0
+
+	FileOpen $R0 "$INSTDIR\deltapall.bat" w
+	FileWrite $R0 "echo WARNING: this script will delete ALL TAP virtual adapters (use the device manager to delete adapters one at a time)$\r$\n"
+	FileWrite $R0 "pause$\r$\n"
+	FileWrite $R0 '"$INSTDIR\tapinstall.exe" remove ${PRODUCT_TAP_WIN_COMPONENT_ID}$\r$\n'
+	FileWrite $R0 "pause$\r$\n"
+	FileClose $R0
+
+	; Create shortcuts
+	CreateDirectory "$SMPROGRAMS\${PRODUCT_NAME}\Utilities"
+	CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Add a new TAP virtual ethernet adapter.lnk" "$INSTDIR\addtap.bat" ""
+	; set runas admin flag on the addtap link
+	ShellLink::SetRunAsAdministrator "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Add a new TAP virtual ethernet adapter.lnk"
+	Pop $0
+	${If} $0 != 0
+		DetailPrint "Setting RunAsAdmin flag on addtap failed: status = $0"
+	${Endif}
+	CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Delete ALL TAP virtual ethernet adapters.lnk" "$INSTDIR\deltapall.bat" ""
+	; set runas admin flag on the deltapall link
+	ShellLink::SetRunAsAdministrator "$SMPROGRAMS\${PRODUCT_NAME}\Utilities\Delete ALL TAP virtual ethernet adapters.lnk"
+	Pop $0
+	${If} $0 != 0
+		DetailPrint "Setting RunAsAdmin flag on deltapall failed: status = $0"
+	${Endif}
+SectionEnd
+
+Function .onInit
+	${GetParameters} $R0
+	ClearErrors
+
+${IfNot} ${AtLeastWin7}
+	MessageBox MB_OK "TunSafe-TAP requires at least Windows 7"
+	SetErrorLevel 1
+	Quit
+${EndIf}
+
+	!insertmacro MULTIUSER_INIT
+	SetShellVarContext all
+
+	${If} $INSTDIR == ""
+		StrCpy $1 "$PROGRAMFILES\TunSafe\TAP"
+		${If} ${RunningX64}
+			SetRegView 64
+			StrCpy $1 "$PROGRAMFILES64\TunSafe\TAP"
+		${EndIf}
+		ReadRegStr $INSTDIR HKLM "SOFTWARE\${PRODUCT_NAME}" ""
+		StrCmp $INSTDIR "" 0 +2
+		StrCpy $INSTDIR $1
+	${EndIf}
+FunctionEnd
+
+;--------------------------------
+;Dependencies
+
+Function .onSelChange
+#	${If} ${SectionIsSelected} ${SecTAPUtilities}
+#		!insertmacro SelectSection ${SecTAP}
+#	${EndIf}
+FunctionEnd
+
+;--------------------
+;Post-install section
+
+Section -post
+
+	; Store README, license, icon
+	SetOverwrite on
+	SetOutPath $INSTDIR
+	File "COPYING"
+
+	${If} ${SectionIsSelected} ${SecTAP}
+		;
+		; install/upgrade TAP driver if selected, using devcon
+		;
+		; TAP install/update was selected.
+		; Should we install or update?
+		; If tapinstall error occurred, $R5 will
+		; be nonzero.
+		IntOp $R5 0 & 0
+		nsExec::ExecToStack '"$INSTDIR\tapinstall.exe" hwids ${PRODUCT_TAP_WIN_COMPONENT_ID}'
+		Pop $R0 # return value/error/timeout
+		IntOp $R5 $R5 | $R0
+		DetailPrint "tapinstall.exe hwids returned: $R0"
+
+		; If tapinstall output string contains "${PRODUCT_TAP_WIN_COMPONENT_ID}" we assume
+		; that TAP device has been previously installed,
+		; therefore we will update, not install.
+		Push "${PRODUCT_TAP_WIN_COMPONENT_ID}"
+		Push ">"
+		Call StrLoc
+		Pop $R0
+
+		${If} $R5 == 0
+			${If} $R0 == ""
+				StrCpy $R1 "install"
+			${Else}
+				StrCpy $R1 "update"
+			${EndIf}
+			DetailPrint "TAP $R1 (${PRODUCT_TAP_WIN_COMPONENT_ID}) (May require confirmation)"
+			nsExec::ExecToLog '"$INSTDIR\tapinstall.exe" $R1 "$INSTDIR\driver\OemVista.inf" ${PRODUCT_TAP_WIN_COMPONENT_ID}'
+			Pop $R0 # return value/error/timeout
+			${If} $R0 == ""
+				IntOp $R0 0 & 0
+				SetRebootFlag true
+				DetailPrint "REBOOT flag set"
+			${EndIf}
+			IntOp $R5 $R5 | $R0
+			DetailPrint "tapinstall.exe returned: $R0"
+		${EndIf}
+
+		DetailPrint "tapinstall.exe cumulative status: $R5"
+		${If} $R5 != 0
+			MessageBox MB_OK "An error occurred installing the TAP device driver."
+		${EndIf}
+
+		; Store install folder in registry
+		WriteRegStr HKLM SOFTWARE\${PRODUCT_NAME} "" $INSTDIR
+	${EndIf}
+
+	; Create uninstaller
+	WriteUninstaller "$INSTDIR\Uninstall.exe"
+
+	; Show up in Add/Remove programs
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayName" "${PRODUCT_NAME} ${PRODUCT_VERSION}"
+	WriteRegExpandStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "UninstallString" "$INSTDIR\Uninstall.exe"
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayIcon" "$INSTDIR\Uninstall.exe"
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayVersion" "${PRODUCT_VERSION}"
+	WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoModify" 1
+	WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoRepair" 1
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "Publisher" "${PRODUCT_PUBLISHER}"
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "HelpLink" "https://tunsafe.com/open-source"
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "URLInfoAbout" "https://tunsafe.com"
+
+	${GetSize} "$INSTDIR" "/S=0K" $0 $1 $2
+	IntFmt $0 "0x%08X" $0
+	WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "EstimatedSize" "$0"
+
+	${GetParameters} $R0
+	${GetOptions} "$R0" "/X" $R1
+	IfErrors +3 0
+	SetErrorLevel 0
+	Quit
+
+SectionEnd
+
+;--------------------------------
+;Descriptions
+
+!insertmacro MUI_FUNCTION_DESCRIPTION_BEGIN
+!insertmacro MUI_DESCRIPTION_TEXT ${SecTAP} $(DESC_SecTAP)
+!insertmacro MUI_DESCRIPTION_TEXT ${SecTAPUtilities} $(DESC_SecTAPUtilities)
+!insertmacro MUI_FUNCTION_DESCRIPTION_END
+
+;--------------------------------
+;Uninstaller Section
+
+Function un.onInit
+	ClearErrors
+	!insertmacro MULTIUSER_UNINIT
+	SetShellVarContext all
+	${If} ${RunningX64}
+		SetRegView 64
+	${EndIf}
+FunctionEnd
+
+Section "Uninstall"
+	DetailPrint "TAP REMOVE"
+	nsExec::ExecToLog '"$INSTDIR\tapinstall.exe" remove ${PRODUCT_TAP_WIN_COMPONENT_ID}'
+	Pop $R0 # return value/error/timeout
+	DetailPrint "tapinstall.exe remove returned: $R0"
+
+	Delete "$INSTDIR\tapinstall.exe"
+	Delete "$INSTDIR\addtap.bat"
+	Delete "$INSTDIR\deltapall.bat"
+
+	Delete "$INSTDIR\driver\OemVista.inf"
+	Delete "$INSTDIR\driver\${PRODUCT_TAP_WIN_COMPONENT_ID}.cat"
+	Delete "$INSTDIR\driver\${PRODUCT_TAP_WIN_COMPONENT_ID}.sys"
+
+	Delete "$INSTDIR\COPYING"
+	Delete "$INSTDIR\Uninstall.exe"
+
+	RMDir "$INSTDIR"
+	RMDir "$INSTDIR\driver"
+	RMDir "$INSTDIR\include"
+	RMDir "$INSTDIR"
+	RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}"
+
+	DeleteRegKey HKLM "SOFTWARE\${PRODUCT_NAME}"
+	DeleteRegKey HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}"
+SectionEnd
diff --git a/installer/tunsafe.nsi b/installer/tunsafe.nsi
new file mode 100644
index 0000000..7b77322
--- /dev/null
+++ b/installer/tunsafe.nsi
@@ -0,0 +1,214 @@
+; ****************************************************************************
+; * Copyright (C)      2018 Ludde                                            *
+; ****************************************************************************
+
+SetCompressor /SOLID lzma
+
+!addplugindir .
+!include "MUI2.nsh"
+!include "x64.nsh"
+!define MULTIUSER_EXECUTIONLEVEL Admin
+!include "MultiUser.nsh"
+!insertmacro GetParameters
+!insertmacro GetOptions
+
+!define PRODUCT_NAME "TunSafe"
+!define PRODUCT_PUBLISHER "TunSafe"
+
+OutFile "TunSafe-${PRODUCT_VERSION}.exe"
+
+BrandingText " "
+ShowInstDetails show
+ShowUninstDetails show
+
+Name "${PRODUCT_NAME}"
+
+!define MUI_COMPONENTSPAGE_SMALLDESC
+!define MUI_FINISHPAGE_NOAUTOCLOSE
+!define MUI_ABORTWARNING
+!define MUI_ICON "icon.ico"
+!define MUI_UNICON "icon.ico"
+!define MUI_HEADERIMAGE
+!define MUI_HEADERIMAGE_BITMAP "tap\install-whirl.bmp"
+!define MUI_UNFINISHPAGE_NOAUTOCLOSE
+
+!define MUI_TEXT_LICENSE_TITLE "Welcome to the TunSafe installer"
+
+#!insertmacro MUI_PAGE_WELCOME
+!insertmacro MUI_PAGE_LICENSE "LICENSE.TXT"
+!insertmacro MUI_PAGE_COMPONENTS
+!insertmacro MUI_PAGE_DIRECTORY
+!insertmacro MUI_PAGE_INSTFILES
+#!insertmacro MUI_PAGE_FINISH
+
+!insertmacro MUI_UNPAGE_CONFIRM
+!insertmacro MUI_UNPAGE_INSTFILES
+#!insertmacro MUI_UNPAGE_FINISH
+
+!insertmacro MUI_LANGUAGE "English"
+
+LangString DESC_SecTAP ${LANG_ENGLISH} "Install the TunSafe client."
+LangString DESC_SecTapAdapter ${LANG_ENGLISH} "Download and Install the TunSafe-TAP Virtual Ethernet Adapter (GPL)."
+
+Section "TunSafe Client" SecTunSafe
+	SetOverwrite on
+	${If} ${RunningX64}
+		DetailPrint "Installing 64-bit version of TunSafe."
+		SetOutPath "$INSTDIR"
+		File "x64\TunSafe.exe"
+	${Else}
+		DetailPrint "Installing 32-bit version of TunSafe."
+		SetOutPath "$INSTDIR"
+		File "x86\TunSafe.exe"
+	${EndIf}
+	File "License.txt"
+	File "ChangeLog.txt"
+	CreateDirectory "$INSTDIR\Config"
+	SetOutPath "$INSTDIR\Config"
+	File "TunSafe.conf"
+  CreateDirectory "$SMPROGRAMS\${PRODUCT_NAME}"
+	CreateShortCut "$SMPROGRAMS\${PRODUCT_NAME}\TunSafe.lnk" "$INSTDIR\TunSafe.exe" ""
+SectionEnd
+
+Section "TunSafe-TAP Ethernet Adapter (GPL)" SecTapAdapter
+	SetOverwrite on
+
+	Delete "$INSTDIR\tunsafe-tap-installer.exe"
+ 	NSISdl::download http://tunsafe.com/downloads/TunSafe-TAP-auto.exe "$INSTDIR\TunSafe-TAP Installer.exe"
+	Pop $R0 ;Get the return value
+  ${Unless} $R0 == "success"
+		MessageBox MB_ICONEXCLAMATION "An error occurred while downloading the TunSafe-TAP Virtual Ethernet Adapter. The installer will now abort."
+		SetErrorLevel 1
+	  Quit
+	${EndUnless}
+
+ 	NSISdl::download http://tunsafe.com/downloads/TunSafe-TAP-auto.exe.sig "$INSTDIR\TunSafe-TAP Installer.exe.sig"
+  ${Unless} $R0 == "success"
+  	Delete "$INSTDIR\TunSafe-TAP Installer.exe.sig"
+		MessageBox MB_ICONEXCLAMATION "An error occurred while downloading the TunSafe-TAP Virtual Ethernet Adapter. The installer will now abort."
+		SetErrorLevel 1
+	  Quit
+	${EndUnless}
+
+	SignPlugin::myFunction "$INSTDIR\TunSafe-TAP Installer.exe"
+	Pop $R1 ;Get the return value
+
+	Delete "$INSTDIR\TunSafe-TAP Installer.exe.sig"
+
+	${Unless} $R1 = 0
+		MessageBox MB_ICONEXCLAMATION "The TunSafe-TAP installer that was downloaded is broken (error $R1). The installer will now abort."
+		SetErrorLevel 1
+	  Quit
+	${EndUnless}
+
+
+ 	HideWindow
+	# Launch TunSafe-TAP installer
+ 	ExecWait '"$INSTDIR\TunSafe-TAP Installer.exe" /X /D=$INSTDIR\TAP' $1
+	ShowWindow $HWNDPARENT ${SW_SHOW}
+	${Unless} $1 = 0
+		MessageBox MB_ICONEXCLAMATION "An error occurred while installing the TunSafe-TAP Virtual Ethernet Adapter. The installer will now abort."
+		SetErrorLevel 1
+	  Quit
+	${EndUnless}
+
+	BringToFront
+SectionEnd
+
+Function CloseTunsafe
+again:
+  FindWindow $0 "TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90"
+  IntCmp $0 0 done
+		MessageBox MB_ICONEXCLAMATION|MB_OKCANCEL "TunSafe is currently started. The installer will close TunSafe and proceed with the installation." IDOK proceed
+			Quit
+		proceed:
+		SendMessage $0 1034 1 0 $1
+		IntCmp $1 31337 proceed2
+			MessageBox MB_ICONEXCLAMATION|MB_OKCANCEL "Unable to close TunSafe. Please close it and press OK to continue." IDOK again
+			Quit
+		proceed2:
+		Sleep 500
+		Goto again
+	done:
+FunctionEnd
+
+Function .onInit
+	${GetParameters} $R0
+	ClearErrors
+${IfNot} ${AtLeastWin7}
+	MessageBox MB_OK "TunSafe requires at least Windows 7"
+	SetErrorLevel 1
+	Quit
+${EndIf}
+	Call CloseTunsafe
+
+	!insertmacro MULTIUSER_INIT
+	SetShellVarContext all
+
+	${If} $INSTDIR == ""
+		StrCpy $1 "$PROGRAMFILES\TunSafe"
+		${If} ${RunningX64}
+			SetRegView 64
+			StrCpy $1 "$PROGRAMFILES64\TunSafe"
+		${EndIf}
+		ReadRegStr $INSTDIR HKLM "SOFTWARE\${PRODUCT_NAME}" ""
+		StrCmp $INSTDIR "" 0 +2
+		StrCpy $INSTDIR $1
+	${EndIf}
+FunctionEnd
+
+Section -post
+	SetOverwrite on
+	SetOutPath $INSTDIR
+
+	WriteRegStr HKLM SOFTWARE\${PRODUCT_NAME} "" $INSTDIR
+
+	; Create uninstaller
+	WriteUninstaller "$INSTDIR\Uninstall.exe"
+
+	; Show up in Add/Remove programs
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayName" "${PRODUCT_NAME} ${PRODUCT_VERSION}"
+	WriteRegExpandStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "UninstallString" "$INSTDIR\Uninstall.exe"
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayIcon" "$INSTDIR\TunSafe.exe"
+	WriteRegStr HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "DisplayVersion" "${PRODUCT_VERSION}"
+	WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoModify" 1
+	WriteRegDWORD HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "NoRepair" 1
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "Publisher" "${PRODUCT_PUBLISHER}"
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "HelpLink" "https://tunsafe.com"
+	WriteRegStr HKLM "SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}" "URLInfoAbout" "https://tunsafe.com"
+
+SectionEnd
+
+Function .onInstSuccess
+	ExecShell "" "$INSTDIR\TunSafe.exe"
+FunctionEnd
+
+!insertmacro MUI_FUNCTION_DESCRIPTION_BEGIN
+!insertmacro MUI_DESCRIPTION_TEXT ${SecTunSafe} $(DESC_SecTAP)
+!insertmacro MUI_DESCRIPTION_TEXT ${SecTapAdapter} $(DESC_SecTapAdapter)
+!insertmacro MUI_FUNCTION_DESCRIPTION_END
+
+Function un.onInit
+	ClearErrors
+	!insertmacro MULTIUSER_UNINIT
+	SetShellVarContext all
+	${If} ${RunningX64}
+		SetRegView 64
+	${EndIf}
+FunctionEnd
+
+Section "Uninstall"
+	Delete "$INSTDIR\TunSafe.exe"
+	Delete "$INSTDIR\License.txt"
+	Delete "$INSTDIR\ChangeLog.txt"
+	Delete "$INSTDIR\Config\TunSafe.conf"
+  Delete "$INSTDIR\Uninstall.exe"
+  Delete "$INSTDIR\TunSafe-TAP Installer.exe"
+
+	RMDir "$INSTDIR"
+	RMDir "$INSTDIR\Config"
+	RMDir /r "$SMPROGRAMS\${PRODUCT_NAME}"
+
+	DeleteRegKey HKLM "SOFTWARE\${PRODUCT_NAME}"
+	DeleteRegKey HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\${PRODUCT_NAME}"
+SectionEnd
diff --git a/ipzip2/ipzip2.cpp b/ipzip2/ipzip2.cpp
new file mode 100644
index 0000000..1b23962
--- /dev/null
+++ b/ipzip2/ipzip2.cpp
@@ -0,0 +1 @@
+// this is a placeholder for a packet compression algorithm not yet released.
\ No newline at end of file
diff --git a/netapi.h b/netapi.h
new file mode 100644
index 0000000..56af4f6
--- /dev/null
+++ b/netapi.h
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#ifndef TINYVPN_NETAPI_H_
+#define TINYVPN_NETAPI_H_
+
+#include "stdafx.h"
+#include "tunsafe_types.h"
+
+#include <vector>
+#include <string>
+
+#if !defined(OS_WIN)
+#include <sys/types.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <sys/socket.h>
+#endif
+
+#pragma warning (disable: 4200)
+
+void OsGetRandomBytes(uint8 *dst, size_t dst_size);
+uint64 OsGetMilliseconds();
+void OsGetTimestampTAI64N(uint8 dst[12]);
+void OsInterruptibleSleep(int millis);
+
+union IpAddr {
+  sockaddr_in sin;
+  sockaddr_in6 sin6;
+};
+
+struct WgCidrAddr {
+  uint8 addr[16];
+  uint8 size;
+  uint8 cidr;
+};
+
+struct Packet {
+  union {
+    Packet *next;
+#if defined(OS_WIN)
+    SLIST_ENTRY list_entry;
+#endif
+  };
+  unsigned int post_target, size;
+  byte *data;
+
+#if defined(OS_WIN)
+  OVERLAPPED overlapped;      // For Windows overlapped IO
+#endif
+
+  IpAddr addr;            // Optionally set to target/source of the packet
+  int sin_size;
+
+  byte data_pre[4];
+  byte data_buf[0];
+
+  enum {
+    // there's always this much data before data_ptr
+    HEADROOM_BEFORE = 64,
+  };
+};
+
+enum {
+  kPacketAllocSize = 2048 - 16,
+  kPacketCapacity = kPacketAllocSize - sizeof(Packet) - Packet::HEADROOM_BEFORE,
+};
+
+void FreePacket(Packet *packet);
+void FreePackets(Packet *packet, Packet **end, int count);
+Packet *AllocPacket();
+void FreeAllPackets();
+
+class TunInterface {
+public:
+  struct PrePostCommands {
+    std::vector<std::string> pre_up;
+    std::vector<std::string> post_up;
+    std::vector<std::string> pre_down;
+    std::vector<std::string> post_down;
+  };
+
+
+  struct TunConfig {
+    // IP address and netmask of the tun device
+    in_addr_t ip;
+    uint8 cidr;
+
+    bool block_dns_on_adapters;
+
+    // no, yes(firewall), yes(route), yes(both), 255(default)
+    uint8 internet_blocking;
+
+    // Set this to configure a default route for ipv4
+    bool use_ipv4_default_route;
+
+    // Set this to configure a default route for ipv6
+    bool use_ipv6_default_route;
+
+    // DHCP settings
+    const byte *dhcp_options;
+    size_t dhcp_options_size;
+
+    // This holds the address of the vpn endpoint, so those get routed to the old iface.
+    uint32 default_route_endpoint_v4;
+    
+    // Set mtu
+    int mtu;
+
+    // Set ipv6 address?
+    uint8 ipv6_address[16];
+    uint8 ipv6_cidr;
+
+    bool set_ipv6_dns;
+
+    // Set this to configure DNS server.
+    uint8 dns_server_v6[16];
+
+    // This holds the address of the vpn endpoint, so those get routed to the old iface.
+    uint8 default_route_endpoint_v6[16];
+
+    // This holds all cidr addresses to add as additional routing entries
+    std::vector<WgCidrAddr> extra_routes;
+
+    // This holds the pre/post commands
+    PrePostCommands pre_post_commands;
+  };
+
+  struct TunConfigOut {
+    bool enable_neighbor_discovery_spoofing;
+    uint8 neighbor_discovery_spoofing_mac[6];
+  };
+
+  virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) = 0;
+  virtual void WriteTunPacket(Packet *packet) = 0;
+};
+
+class UdpInterface {
+public:
+  virtual bool Initialize(int listen_port) = 0;
+  virtual void WriteUdpPacket(Packet *packet) = 0;
+};
+
+extern bool g_allow_pre_post;
+
+#endif  // TINYVPN_NETAPI_H_
diff --git a/network_bsd.cpp b/network_bsd.cpp
new file mode 100644
index 0000000..b617835
--- /dev/null
+++ b/network_bsd.cpp
@@ -0,0 +1,898 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "netapi.h"
+#include "wireguard.h"
+#include "wireguard_config.h"
+#include "tunsafe_endian.h"
+#include "util.h"
+
+#include <stdio.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <string.h>
+#include <arpa/inet.h>
+#include <sys/stat.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <assert.h>
+#include <signal.h>
+
+#include <sys/socket.h>
+#include <net/route.h>
+#include <sys/time.h>
+
+#if defined(OS_MACOSX)
+#include <sys/kern_control.h>
+#include <net/if_utun.h>
+#include <sys/sys_domain.h>
+#include <mach/mach.h>
+#include <mach/mach_time.h>
+#include <net/if_dl.h>
+#elif defined(OS_FREEBSD)
+#include <net/if_tun.h>
+#include <net/if_dl.h>
+#elif defined(OS_LINUX)
+#include <linux/if.h>
+#include <linux/if_tun.h>
+#endif
+
+static Packet *freelist;
+
+void FreePacket(Packet *packet) {
+  packet->next = freelist;
+  freelist = packet;
+}
+
+Packet *AllocPacket() {
+  Packet *p = freelist;
+  if (p) {
+    freelist = p->next;
+  } else {
+    p = (Packet*)malloc(kPacketAllocSize);  
+    if (p == NULL) {
+      RERROR("Allocation failure");
+      abort();
+    }
+  }
+  p->data = p->data_buf + Packet::HEADROOM_BEFORE;
+  p->size = 0;
+  return p;
+}
+
+void FreePackets() {
+  Packet *p;
+  while ( (p = freelist ) != NULL) {
+    freelist = p->next;
+    free(p);
+  }
+}
+
+
+#if defined(OS_MACOSX)
+static mach_timebase_info_data_t timebase = { 0, 0 };
+static uint64_t                  initclock;
+
+void InitOsxGetMilliseconds() {
+  if (mach_timebase_info(&timebase) != 0)
+    abort();
+  initclock = mach_absolute_time();
+
+  timebase.denom *= 1000000;
+}
+
+uint64 OsGetMilliseconds()
+{
+  uint64_t clock = mach_absolute_time() - initclock;
+  return clock * (uint64_t)timebase.numer / (uint64_t)timebase.denom;
+}
+
+#else  // defined(OS_MACOSX)
+uint64 OsGetMilliseconds() {
+  struct timespec ts;
+  if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) {
+    //error
+    fprintf(stderr, "clock_gettime failed\n");
+    exit(1);
+  }
+  return (uint64)ts.tv_sec * 1000 + (ts.tv_nsec / 1000000);
+}
+#endif
+
+void OsGetTimestampTAI64N(uint8 dst[12]) {
+  struct timeval tv;
+  gettimeofday(&tv, NULL);
+  uint64 secs_since_epoch = tv.tv_sec + 0x400000000000000a;
+  uint32 nanos = tv.tv_usec * 1000;
+  WriteBE64(dst, secs_since_epoch);
+  WriteBE32(dst + 8, nanos);
+}
+
+void OsGetRandomBytes(uint8 *data, size_t data_size) {
+  int fd = open("/dev/urandom", O_RDONLY);
+  int r = read(fd, data, data_size);
+  if (r < 0) r = 0;
+  close(fd);
+  for (; r < data_size; r++)
+    data[r] = rand() >> 6;
+}
+
+void OsInterruptibleSleep(int millis) {
+  usleep((useconds_t)millis * 1000);
+}
+
+#if defined(OS_MACOSX)
+#define TUN_PREFIX_BYTES 4
+int open_tun(char *devname, size_t devname_size) {
+  struct sockaddr_ctl sc;
+  struct ctl_info ctlinfo = {0};
+  int fd;
+
+  memcpy(ctlinfo.ctl_name, UTUN_CONTROL_NAME, sizeof(UTUN_CONTROL_NAME));
+
+  for(int i = 0; i < 256; i++) {
+    fd = socket(PF_SYSTEM, SOCK_DGRAM, SYSPROTO_CONTROL);
+    if (fd < 0) {
+      RERROR("socket(SYSPROTO_CONTROL) failed");
+      return -1;
+    }
+
+    if (ioctl(fd, CTLIOCGINFO, &ctlinfo) == -1) {
+      RERROR("ioctl(CTLIOCGINFO) failed: %d", errno);
+      close(fd);
+      return -1;
+    }
+    sc.sc_id = ctlinfo.ctl_id;
+    sc.sc_len = sizeof(sc);
+    sc.sc_family = AF_SYSTEM;
+    sc.ss_sysaddr = AF_SYS_CONTROL;
+    sc.sc_unit = i + 1;
+    if (connect(fd, (struct sockaddr *)&sc, sizeof(sc)) == 0) {
+      socklen_t devname_size2 = devname_size;
+      if (getsockopt(fd, SYSPROTO_CONTROL, UTUN_OPT_IFNAME, devname, &devname_size2)) {
+        RERROR("getsockopt(UTUN_OPT_IFNAME) failed");
+        close(fd);
+        return -1;
+      }
+
+
+      return fd;
+    }
+    close(fd);
+  }
+  return -1;  
+}
+
+#elif defined(OS_FREEBSD)
+#define TUN_PREFIX_BYTES 4
+int open_tun(char *devname, size_t devname_size) {
+  char buf[32];
+  int tun_fd;
+  // First open an existing tun device
+  for(int i = 0; i < 256; i++) {
+    sprintf(buf, "/dev/tun%d", i);
+    tun_fd = open(buf, O_RDWR);
+    if (tun_fd >= 0) goto did_open;
+  }
+  tun_fd = open("/dev/tun", O_RDWR);
+  if (tun_fd < 0)
+    return tun_fd;
+did_open:
+  if (!fdevname_r(tun_fd, devname, devname_size)) {
+    RERROR("Unable to get name of tun device");
+    close(tun_fd);
+    return -1;
+  }
+  int flags = IFF_POINTOPOINT | IFF_MULTICAST;
+  if (ioctl(tun_fd, TUNSIFMODE, &flags) < 0) {
+    RERROR("ioctl(TUNSIFMODE) failed");
+    close(tun_fd);
+    return -1;
+
+  }
+  flags = 1;
+  if (ioctl(tun_fd, TUNSIFHEAD, &flags) < 0) {
+    RERROR("ioctl(TUNSIFHEAD) failed");
+    close(tun_fd);
+    return -1;
+  }
+  return tun_fd;
+}
+
+#elif defined(OS_LINUX)
+#define TUN_PREFIX_BYTES 0
+int open_tun(char *devname, size_t devname_size) {
+  int fd, err;
+  struct ifreq ifr;
+
+  fd = open("/dev/net/tun", O_RDWR);
+  if (fd < 0)
+    return fd;
+
+  memset(&ifr, 0, sizeof(ifr));
+  ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
+
+  if ((err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0) {
+    close(fd);
+    return err;
+  }
+  strcpy(devname, ifr.ifr_name);
+  return fd;
+}
+#endif
+
+int open_udp(int listen_on_port) {
+  int udp_fd = socket(AF_INET, SOCK_DGRAM, 0);
+  if (udp_fd < 0) return udp_fd;
+  sockaddr_in sin = {0};
+  sin.sin_family = AF_INET;
+  sin.sin_port = htons(listen_on_port);
+  if (bind(udp_fd, (struct sockaddr*)&sin, sizeof(sin)) != 0) {
+    close(udp_fd);
+    return -1;
+  }
+  return udp_fd;
+}
+
+struct RouteInfo {
+  uint8 family;
+  uint8 cidr;
+  uint8 ip[16];
+  uint8 gw[16];
+};
+
+class TunsafeBackendBsd : public TunInterface, public UdpInterface {
+public:
+  TunsafeBackendBsd();
+  void RunLoop();
+  void Cleanup();
+
+  void SetProcessor(WireguardProcessor *wg) { processor_ = wg; }
+
+  // -- from TunInterface
+  virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) override;
+  virtual void WriteTunPacket(Packet *packet) override;
+
+  // -- from UdpInterface
+  virtual bool Initialize(int listen_port) override;
+  virtual void WriteUdpPacket(Packet *packet) override;
+
+
+  void HandleSigAlrm() { got_sig_alarm_ = true; }
+  void HandleExit() { exit_ = true; }
+  
+private:
+  bool ReadFromUdp();
+  bool ReadFromTun();
+  bool WriteToUdp();
+  bool WriteToTun();
+
+
+  void SetUdpFd(int fd);
+  void SetTunFd(int fd);
+
+  void AddRoute(uint32 ip, uint32 cidr, uint32 gw);
+  void DelRoute(const RouteInfo &cd);
+  bool AddRoute(int family, const void *dest, int dest_prefix, const void *gateway);
+
+
+  inline void RecomputeMaxFd() { max_fd_ = ((tun_fd_>udp_fd_) ? tun_fd_ : udp_fd_) + 1; }
+
+  WireguardProcessor *processor_;
+
+  int tun_fd_, udp_fd_, max_fd_;
+  bool got_sig_alarm_;
+  bool exit_;
+
+  bool tun_readable_, tun_writable_;
+  bool udp_readable_, udp_writable_;
+
+  Packet *tun_queue_, **tun_queue_end_;
+  Packet *udp_queue_, **udp_queue_end_;
+
+  Packet *read_packet_;
+
+  std::vector<RouteInfo> cleanup_commands_;
+
+  fd_set readfds_, writefds_;
+
+
+};
+
+TunsafeBackendBsd::TunsafeBackendBsd() 
+    : processor_(NULL),
+      tun_fd_(-1),
+      udp_fd_(-1),
+      tun_readable_(false),
+      tun_writable_(false),
+      udp_readable_(false),
+      udp_writable_(false),
+      got_sig_alarm_(false),
+      exit_(false),
+      tun_queue_(NULL),
+      tun_queue_end_(&tun_queue_),
+      udp_queue_(NULL),
+      udp_queue_end_(&udp_queue_),
+      read_packet_(NULL) {
+  RecomputeMaxFd();
+
+  FD_ZERO(&readfds_);
+  FD_ZERO(&writefds_);
+  read_packet_ = AllocPacket();
+}
+
+void TunsafeBackendBsd::SetUdpFd(int fd) {
+  udp_fd_ = fd;
+  RecomputeMaxFd();
+  udp_writable_ = true;
+}
+
+void TunsafeBackendBsd::SetTunFd(int fd) {
+  tun_fd_ = fd;
+  RecomputeMaxFd();
+  tun_writable_ = true;
+}
+
+
+bool TunsafeBackendBsd::ReadFromUdp() {
+  socklen_t sin_len;
+  sin_len = sizeof(read_packet_->addr.sin);
+  int r = recvfrom(udp_fd_, read_packet_->data, kPacketCapacity, 0,
+                   (sockaddr*)&read_packet_->addr.sin, &sin_len);
+  if (r >= 0) {
+//    printf("Read %d bytes from UDP\n", r);
+    read_packet_->sin_size = sin_len;
+    read_packet_->size = r;
+    if (processor_) {
+      processor_->HandleUdpPacket(read_packet_, false);
+      read_packet_ = AllocPacket();
+    }
+    return true;        
+  } else {
+    if (errno != EAGAIN) {
+      fprintf(stderr, "Read from UDP failed\n");
+    }
+    udp_readable_ = false;
+    return false;
+  }
+}
+
+bool TunsafeBackendBsd::WriteToUdp() {
+  assert(udp_writable_);
+//  RINFO("Send %d bytes to %s", (int)udp_queue_->size, inet_ntoa(udp_queue_->sin.sin_addr));
+  int r = sendto(udp_fd_, udp_queue_->data, udp_queue_->size, 0, 
+                 (sockaddr*)&udp_queue_->addr.sin, sizeof(udp_queue_->addr.sin));
+  if (r < 0) {
+    if (errno == EAGAIN) {
+      udp_writable_ = false;
+      return false;
+    }
+    perror("Write to UDP failed");
+  } else {
+    if (r != udp_queue_->size)
+      perror("Write to udp incomplete!");
+//    else
+//      RINFO("Wrote %d bytes to UDP", r);
+  }
+  Packet *next = udp_queue_->next;
+  FreePacket(udp_queue_);
+  if ((udp_queue_ = next) != NULL) return true;
+  udp_queue_end_ = &udp_queue_;
+  return false;
+}
+
+static inline bool IsCompatibleProto(uint32 v) {
+  return v == AF_INET || v == AF_INET6;
+}
+
+bool TunsafeBackendBsd::ReadFromTun() {
+  assert(tun_readable_);
+  Packet *packet = read_packet_;
+  int r = read(tun_fd_, packet->data - TUN_PREFIX_BYTES, kPacketCapacity + TUN_PREFIX_BYTES);
+  if (r >= 0) {
+//    printf("Read %d bytes from TUN\n", r);
+    packet->size = r - TUN_PREFIX_BYTES;
+    if (r >= TUN_PREFIX_BYTES && (!TUN_PREFIX_BYTES || IsCompatibleProto(ReadBE32(packet->data - TUN_PREFIX_BYTES))) && processor_) {
+//      printf("%X %X %X %X %X %X %X %X\n",
+//        read_packet_->data[0], read_packet_->data[1], read_packet_->data[2], read_packet_->data[3], 
+//        read_packet_->data[4], read_packet_->data[5], read_packet_->data[6], read_packet_->data[7]);
+      read_packet_ = AllocPacket();
+      processor_->HandleTunPacket(packet);
+    }
+    return true;        
+  } else {
+    if (errno != EAGAIN) {
+      fprintf(stderr, "Read from tun failed\n");
+    }
+    tun_readable_ = false;
+    return false;
+  }
+}
+
+static uint32 GetProtoFromPacket(const uint8 *data, size_t size) {
+  return size < 1 || (data[0] >> 4) != 6 ? AF_INET : AF_INET6;
+}
+
+bool TunsafeBackendBsd::WriteToTun() {
+  assert(tun_writable_);
+  if (TUN_PREFIX_BYTES) {
+    WriteBE32(tun_queue_->data - TUN_PREFIX_BYTES, GetProtoFromPacket(tun_queue_->data, tun_queue_->size));
+  }
+  int r = write(tun_fd_, tun_queue_->data - TUN_PREFIX_BYTES, tun_queue_->size + TUN_PREFIX_BYTES);
+  if (r < 0) {
+    if (errno == EAGAIN) {
+      tun_writable_ = false;
+      return false;
+    }
+    RERROR("Write to tun failed");
+  } else {
+    r -= TUN_PREFIX_BYTES;
+    if (r != tun_queue_->size)
+      RERROR("Write to tun incomplete!");
+//    else
+//      RINFO("Wrote %d bytes to TUN", r);
+  }  
+  Packet *next = tun_queue_->next;
+  FreePacket(tun_queue_);
+  if ((tun_queue_ = next) != NULL) return true;
+  tun_queue_end_ = &tun_queue_;
+  return false;
+}
+
+static uint32 CidrToNetmaskV4(int cidr) {
+  return cidr == 32 ? 0xffffffff : 0xffffffff << (32 - cidr);
+}
+
+#if defined(OS_MACOSX) || defined(OS_FREEBSD)
+struct MyRouteMsg {
+  struct rt_msghdr hdr;
+  uint32 pad;
+  struct sockaddr_in target;
+  struct sockaddr_in netmask;
+};
+
+struct MyRouteReply {
+  struct rt_msghdr hdr;
+  uint8 buf[512];
+};
+
+// Zero gets rounded up
+#if defined(OS_MACOSX)
+#define RTMSG_ROUNDUP(a) ((a) ? ((((a) - 1) | (sizeof(uint32_t) - 1)) + 1) : sizeof(uint32_t))
+#else
+#define RTMSG_ROUNDUP(a) ((a) ? ((((a) - 1) | (sizeof(long) - 1)) + 1) : sizeof(long))
+#endif
+
+
+static bool GetDefaultRoute(char *iface, size_t iface_size, uint32 *gw_addr) {
+  int fd, pid, len;
+
+  union {
+    MyRouteMsg rt;
+    MyRouteReply rep;
+  };
+
+  fd = socket(PF_ROUTE, SOCK_RAW, AF_INET);
+  if (fd < 0)
+    return false;
+
+  memset(&rt, 0, sizeof(rt));
+
+  rt.hdr.rtm_type = RTM_GET;
+  rt.hdr.rtm_flags = RTF_UP | RTF_GATEWAY;
+  rt.hdr.rtm_version = RTM_VERSION;
+  rt.hdr.rtm_seq = 0;
+  rt.hdr.rtm_addrs = RTA_DST | RTA_NETMASK | RTA_IFP;
+
+  rt.target.sin_family = AF_INET;
+  rt.netmask.sin_family = AF_INET;
+
+  rt.target.sin_len = sizeof(struct sockaddr_in);
+  rt.netmask.sin_len = sizeof(struct sockaddr_in);
+
+  rt.hdr.rtm_msglen = sizeof(rt);
+
+  if (write(fd, (char*)&rt, sizeof(rt)) != sizeof(rt)) {
+    RERROR("PF_ROUTE write failed.");
+    close(fd);
+    return false;
+  }
+
+  pid = getpid();
+  do {
+    len = read(fd, (char *)&rep, sizeof(rep));
+    if (len <= 0) {
+      RERROR("PF_ROUTE read failed.");
+      close(fd);
+      return false;
+    }
+  } while (rep.hdr.rtm_seq != 0 || rep.hdr.rtm_pid != pid);
+  close(fd);
+
+  const struct sockaddr_dl *ifp = NULL;
+  const struct sockaddr_in *gw = NULL;
+
+  uint8 *pos = rep.buf;
+  for(int i = 1; i && i < rep.hdr.rtm_addrs; i <<= 1) {
+    if (rep.hdr.rtm_addrs & i) {
+      if (1 > rep.buf + 512 - pos)
+        break; // invalid
+      size_t len = RTMSG_ROUNDUP(((struct sockaddr*)pos)->sa_len);
+      if (len > rep.buf + 512 - pos)
+        break; // invalid
+//      RINFO("rtm %d %d", i, ((struct sockaddr*)pos)->sa_len);
+      if (i == RTA_IFP && ((struct sockaddr*)pos)->sa_len == sizeof(struct sockaddr_dl)) {
+        ifp = (struct sockaddr_dl *)pos;
+      } else if (i == RTA_GATEWAY && ((struct sockaddr*)pos)->sa_len == sizeof(struct sockaddr_in)) {
+        gw = (struct sockaddr_in *)pos;
+
+      }
+      pos += len;
+    }
+  }
+
+  if (ifp && ifp->sdl_nlen && ifp->sdl_nlen < iface_size) {
+    iface[ifp->sdl_nlen] = 0;
+    memcpy(iface, ifp->sdl_data, ifp->sdl_nlen);
+    if (gw && gw->sin_family == AF_INET) {
+      *gw_addr = ReadBE32(&gw->sin_addr);
+      return true;
+    }
+    
+  }
+//  RINFO("Read %d %d %d", len, rep.hdr.rtm_addrs, (int)sizeof(struct rt_msghdr ));
+  return false;
+}
+#endif  // defined(OS_MACOSX) || defined(OS_FREEBSD)
+
+#if defined(OS_LINUX)
+static bool GetDefaultRoute(char *iface, size_t iface_size, uint32 *gw_addr) {
+  return false;
+}
+#endif  // defined(OS_LINUX)
+
+static uint32 ComputeIpv4DefaultRoute(uint32 ip, uint32 netmask) {
+  uint32 default_route_v4 = (ip & netmask) | 1;
+  if (default_route_v4 == ip)
+    default_route_v4++;
+  return default_route_v4;
+}
+
+static void ComputeIpv6DefaultRoute(const uint8 *ipv6_address, uint8 ipv6_cidr, uint8 *default_route_v6) {
+  memcpy(default_route_v6, ipv6_address, 16);
+  // clear the last bits of the ipv6 address to match the cidr.
+  size_t n = (ipv6_cidr + 7) >> 3;
+  memset(&default_route_v6[n], 0, 16 - n);
+  if (n == 0)
+    return;
+  // adjust the final byte
+  default_route_v6[n - 1] &= ~(0xff >> (ipv6_cidr & 7));
+  // set the very last byte to something
+  default_route_v6[15] |= 1;
+  // ensure it doesn't collide
+  if (memcmp(default_route_v6, ipv6_address, 16) == 0)
+    default_route_v6[15] ^= 3;
+}
+
+void TunsafeBackendBsd::AddRoute(uint32 ip, uint32 cidr, uint32 gw) {
+  uint32 ip_be, gw_be;
+  WriteBE32(&ip_be, ip);
+  WriteBE32(&gw_be, gw);
+  AddRoute(AF_INET, &ip_be, cidr, &gw_be);
+}
+
+static void AddOrRemoveRoute(const RouteInfo &cd, bool remove) {
+  char buf1[kSizeOfAddress], buf2[kSizeOfAddress];
+
+  print_ip_prefix(buf1, cd.family, cd.ip, cd.cidr);
+  print_ip_prefix(buf2, cd.family, cd.gw, -1);
+
+#if defined(OS_LINUX)
+  const char *cmd = remove ? "delete" : "add";
+  if (cd.family == AF_INET) {
+    RunCommand("/sbin/route %s -net %s gw %s", cmd, buf1, buf2);
+  } else {
+    RunCommand("/sbin/route %s -net inet6 %s gw %s", cmd, buf1, buf2);
+  }
+#elif defined(OS_MACOSX)
+  const char *cmd = remove ? "delete" : "add";
+  if (cd.family == AF_INET) {
+    RunCommand("/sbin/route -q %s %s %s", cmd, buf1, buf2);
+  } else {
+    RunCommand("/sbin/route -q %s -inet6 %s %s", cmd, buf1, buf2);
+  }
+#endif
+}
+
+bool TunsafeBackendBsd::AddRoute(int family, const void *dest, int dest_prefix, const void *gateway) {
+  RouteInfo c;
+
+  c.family = family;
+  size_t len = (family == AF_INET) ? 4 : 16;
+  memcpy(c.ip, dest, len);
+  memcpy(c.gw, gateway, len);
+  c.cidr = dest_prefix;
+  cleanup_commands_.push_back(c);
+  AddOrRemoveRoute(c, false);
+  return true;
+}
+
+void TunsafeBackendBsd::DelRoute(const RouteInfo &cd) {
+  AddOrRemoveRoute(cd, true);
+}
+
+static bool IsIpv6AddressSet(const void *p) {
+  return (ReadLE64(p) | ReadLE64((char*)p + 8)) != 0;
+}
+
+// Called to initialize tun
+bool TunsafeBackendBsd::Initialize(const TunConfig &&config, TunConfigOut *out) override {
+  char devname[12];
+  char def_iface[12];
+  char buf[kSizeOfAddress];
+
+  Cleanup();
+
+  out->enable_neighbor_discovery_spoofing = false;
+
+  int tun_fd = open_tun(devname, sizeof(devname));
+  if (tun_fd < 0) { RERROR("Error opening tun device"); return false; }
+
+  fcntl(tun_fd, F_SETFD, FD_CLOEXEC);
+  fcntl(tun_fd, F_SETFL, O_NONBLOCK);
+
+  SetTunFd(tun_fd);
+
+  uint32 netmask = CidrToNetmaskV4(config.cidr);
+  uint32 default_route_v4 = ComputeIpv4DefaultRoute(config.ip, netmask);
+ 
+  RunCommand("/sbin/ifconfig %s %A mtu %d %A netmask %A up", devname, config.ip, config.mtu, config.ip, netmask);
+  AddRoute(config.ip & netmask, config.cidr, config.ip);
+
+  if (config.use_ipv4_default_route) {
+    if (config.default_route_endpoint_v4) {
+      uint32 gw;
+      if (!GetDefaultRoute(def_iface, sizeof(def_iface), &gw)) {
+        RERROR("Unable to determine default interface.");
+        return false;
+      }
+      AddRoute(config.default_route_endpoint_v4, 32, gw);
+
+    }
+    AddRoute(0x00000000, 1, default_route_v4);
+    AddRoute(0x80000000, 1, default_route_v4);
+  }
+
+  uint8 default_route_v6[16];
+
+  if (config.ipv6_cidr) {
+    static const uint8 matchall_1_route[17] = {0x80, 0, 0, 0};
+
+    ComputeIpv6DefaultRoute(config.ipv6_address, config.ipv6_cidr, default_route_v6);
+
+    RunCommand("/sbin/ifconfig %s inet6 %s", devname, print_ip_prefix(buf, AF_INET6, config.ipv6_address, config.ipv6_cidr));
+
+    if (config.use_ipv6_default_route) {
+      if (IsIpv6AddressSet(config.default_route_endpoint_v6)) {
+        RERROR("default_route_endpoint_v6 not supported");
+      }
+      AddRoute(AF_INET6, matchall_1_route + 1, 1, default_route_v6);
+      AddRoute(AF_INET6, matchall_1_route + 0, 1, default_route_v6);
+    }
+  }
+
+  // Add all the extra routes
+  for (auto it = config.extra_routes.begin(); it != config.extra_routes.end(); ++it) {
+    if (it->size == 32) {
+      AddRoute(ReadBE32(it->addr), it->cidr, default_route_v4);
+    } else if (it->size == 128 && config.ipv6_cidr) {
+      AddRoute(AF_INET6, it->addr, it->cidr, default_route_v6);
+    }
+  }
+
+  return true;
+}
+
+void TunsafeBackendBsd::Cleanup() {
+  for(auto it = cleanup_commands_.begin(); it != cleanup_commands_.end(); ++it)
+    DelRoute(*it);
+  cleanup_commands_.clear();
+}
+
+void TunsafeBackendBsd::WriteTunPacket(Packet *packet) override {
+  assert(tun_fd_ >= 0);
+  Packet *queue_is_used = tun_queue_;
+  *tun_queue_end_ = packet;
+  tun_queue_end_ = &packet->next;
+  packet->next = NULL;
+  if (!queue_is_used)
+    WriteToTun();
+}
+
+// Called to initialize udp
+bool TunsafeBackendBsd::Initialize(int listen_port) override {
+  int udp_fd = open_udp(listen_port);
+  if (udp_fd < 0) { RERROR("Error opening udp"); return false; }
+  fcntl(udp_fd, F_SETFD, FD_CLOEXEC);
+  fcntl(udp_fd, F_SETFL, O_NONBLOCK);
+  SetUdpFd(udp_fd);
+  return true;
+}
+
+void TunsafeBackendBsd::WriteUdpPacket(Packet *packet) override {
+  assert(udp_fd_ >= 0);
+  Packet *queue_is_used = udp_queue_;
+  *udp_queue_end_ = packet;
+  udp_queue_end_ = &packet->next;
+  packet->next = NULL;
+  if (!queue_is_used)
+    WriteToUdp();
+}
+
+static TunsafeBackendBsd *g_socket_loop;
+
+static void SigAlrm(int sig) {
+  if (g_socket_loop)
+    g_socket_loop->HandleSigAlrm();
+}
+
+static bool did_ctrlc;
+
+void SigInt(int sig) {
+  if (did_ctrlc)
+    exit(1);
+  did_ctrlc = true;
+  write(1, "Ctrl-C detected. Exiting. Press again to force quit.\n", sizeof("Ctrl-C detected. Exiting. Press again to force quit.\n")-1);
+  
+  if (g_socket_loop)
+    g_socket_loop->HandleExit();    
+}
+
+void TunsafeBackendBsd::RunLoop() {
+  int free_packet_interval = 10;
+
+  assert(!g_socket_loop);
+  assert(processor_);
+
+  g_socket_loop = this;
+  // We want an alarm signal every second.
+  {
+    struct sigaction act = {0};
+    act.sa_handler = SigAlrm;
+    if (sigaction(SIGALRM, &act, NULL) < 0) {
+      RERROR("Unable to install SIGALRM handler.");
+      return;
+    }
+  }
+
+  {
+    struct sigaction act = {0};
+    act.sa_handler = SigInt;
+    if (sigaction(SIGINT, &act, NULL) < 0) {
+      RERROR("Unable to install SIGINT handler.");
+      return;
+    }
+  }
+
+#if defined(OS_LINUX) || defined(OS_FREEBSD)
+  {
+    struct itimerspec tv = {0};
+    struct sigevent sev;
+    timer_t timer_id;
+
+    tv.it_interval.tv_sec = 1;
+    tv.it_value.tv_sec = 1;
+
+    sev.sigev_notify = SIGEV_SIGNAL;
+    sev.sigev_signo = SIGALRM;
+    sev.sigev_value.sival_ptr = NULL;
+
+    if (timer_create(CLOCK_MONOTONIC, &sev, &timer_id) < 0) {
+      RERROR("timer_create failed");
+      return;
+    }    
+
+    if (timer_settime(timer_id, 0, &tv, NULL) < 0) {
+      RERROR("timer_settime failed");
+      return;
+    }
+  }
+#elif defined(OS_MACOSX)
+  ualarm(1000000, 1000000);
+#endif
+
+  while (!exit_) {
+    int n = -1;
+
+//    printf("entering sleep %d,%d,%d %d\n", udp_fd_, tun_fd_, max_fd_, FD_ISSET(tun_fd_, &readfds_));
+    // Wait for sockets to become usable
+    if (!got_sig_alarm_) {
+
+      if (tun_fd_ >= 0) {
+        FD_SET(tun_fd_, &readfds_);
+        if (tun_writable_) FD_CLR(tun_fd_, &writefds_); else FD_SET(tun_fd_, &writefds_);
+      }
+
+      if (udp_fd_ >= 0) {
+        FD_SET(udp_fd_, &readfds_);
+        if (udp_writable_) FD_CLR(udp_fd_, &writefds_); else FD_SET(udp_fd_, &writefds_);
+      }
+
+      n = select(max_fd_, &readfds_, &writefds_, NULL, NULL);
+      if (n == -1) {
+        if (errno != EINTR) {
+          fprintf(stderr, "select failed\n");
+          break;
+        }
+      }
+    }
+    // This is not fully signal safe.
+    if (got_sig_alarm_) {
+      got_sig_alarm_ = false;
+      processor_->SecondLoop();
+      if (free_packet_interval == 0) {
+        FreePackets();
+        free_packet_interval = 10;
+      }
+      free_packet_interval--;
+    }
+    if (n < 0) continue;
+
+    if (tun_fd_ >= 0) {
+      tun_readable_ = (FD_ISSET(tun_fd_, &readfds_) != 0);
+      tun_writable_ |= (FD_ISSET(tun_fd_, &writefds_) != 0);
+    }
+    if (udp_fd_ >= 0) {
+      udp_readable_ = (FD_ISSET(udp_fd_, &readfds_) != 0);
+      udp_writable_ |= (FD_ISSET(udp_fd_, &writefds_) != 0);
+    }
+
+    for(int loop = 0; loop < 256; loop++) {
+      bool more_work = false;
+      if (tun_queue_ != NULL && tun_writable_) more_work |= WriteToTun();
+      if (udp_queue_ != NULL && udp_writable_) more_work |= WriteToUdp();
+      if (tun_readable_)                       more_work |= ReadFromTun();
+      if (udp_readable_)                       more_work |= ReadFromUdp();
+      if (!more_work)
+        break;
+    }    
+  }
+
+  g_socket_loop = NULL;
+}
+
+void InitCpuFeatures();
+void Benchmark();
+
+int main(int argc, char **argv) {
+  bool exit_flag = false;
+
+  InitCpuFeatures();
+
+  if (argc == 2 && strcmp(argv[1], "--benchmark") == 0) {
+    Benchmark();
+    return 0;
+  }
+
+  if (argc < 2) {
+    fprintf(stderr, "Syntax: tunsafe file.conf\n");
+    return 1;
+  }
+  
+#if defined(OS_MACOSX)
+  InitOsxGetMilliseconds();
+#endif
+
+  TunsafeBackendBsd socket_loop;
+  WireguardProcessor wg(&socket_loop, &socket_loop, NULL);
+  socket_loop.SetProcessor(&wg);
+
+  if (!ParseWireGuardConfigFile(&wg, argv[1], &exit_flag)) return 1;
+  if (!wg.Start()) return 1;
+
+  socket_loop.RunLoop();
+  socket_loop.Cleanup();
+  return 0;
+}
diff --git a/network_bsd_mt.cpp b/network_bsd_mt.cpp
new file mode 100644
index 0000000..3f1a043
--- /dev/null
+++ b/network_bsd_mt.cpp
@@ -0,0 +1,1251 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "netapi.h"
+#include "wireguard.h"
+#include "wireguard_config.h"
+#include "tunsafe_endian.h"
+#include "tunsafe_config.h"
+#include "util.h"
+
+#include <stdio.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <string.h>
+#include <arpa/inet.h>
+#include <sys/stat.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <assert.h>
+#include <signal.h>
+
+#include <sys/socket.h>
+#include <net/route.h>
+#include <sys/time.h>
+
+#include <pthread.h>
+
+#if defined(OS_MACOSX)
+#include <sys/kern_control.h>
+#include <net/if_utun.h>
+#include <sys/sys_domain.h>
+#include <mach/mach.h>
+#include <mach/mach_time.h>
+#include <net/if_dl.h>
+#elif defined(OS_FREEBSD)
+#include <net/if_tun.h>
+#include <net/if_dl.h>
+#elif defined(OS_LINUX)
+#include <linux/if.h>
+#include <linux/if_tun.h>
+#include <sys/prctl.h>
+#endif
+
+
+
+static Packet *freelist;
+
+void SetThreadName(const char *name) {
+#if defined(OS_LINUX)
+  prctl(PR_SET_NAME, name, 0, 0, 0);
+#endif  // defined(OS_LINUX)
+}
+
+void FreePacket(Packet *packet) {
+  free(packet);
+//  packet->next = freelist;
+//  freelist = packet;
+}
+
+Packet *AllocPacket() {
+  Packet *p = NULL;// freelist;
+  if (p) {
+    freelist = p->next;
+  } else {
+    p = (Packet*)malloc(kPacketAllocSize);  
+    if (p == NULL) {
+      RERROR("Allocation failure");
+      abort();
+    }
+  }
+  p->data = p->data_buf + Packet::HEADROOM_BEFORE;
+  p->size = 0;
+  return p;
+}
+
+void FreePackets() {
+  Packet *p;
+  while ( (p = freelist ) != NULL) {
+    freelist = p->next;
+    free(p);
+  }
+}
+
+#if defined(OS_MACOSX) || defined(OS_FREEBSD)
+struct MyRouteMsg {
+  struct rt_msghdr hdr;
+  uint32 pad;
+  struct sockaddr_in target;
+  struct sockaddr_in netmask;
+};
+
+struct MyRouteReply {
+  struct rt_msghdr hdr;
+  uint8 buf[512];
+};
+
+// Zero gets rounded up
+#if defined(OS_MACOSX)
+#define RTMSG_ROUNDUP(a) ((a) ? ((((a) - 1) | (sizeof(uint32_t) - 1)) + 1) : sizeof(uint32_t))
+#else
+#define RTMSG_ROUNDUP(a) ((a) ? ((((a) - 1) | (sizeof(long) - 1)) + 1) : sizeof(long))
+#endif
+
+
+static bool GetDefaultRoute(char *iface, size_t iface_size, uint32 *gw_addr) {
+  int fd, pid, len;
+
+  union {
+    MyRouteMsg rt;
+    MyRouteReply rep;
+  };
+
+  fd = socket(PF_ROUTE, SOCK_RAW, AF_INET);
+  if (fd < 0)
+    return false;
+
+  memset(&rt, 0, sizeof(rt));
+
+  rt.hdr.rtm_type = RTM_GET;
+  rt.hdr.rtm_flags = RTF_UP | RTF_GATEWAY;
+  rt.hdr.rtm_version = RTM_VERSION;
+  rt.hdr.rtm_seq = 0;
+  rt.hdr.rtm_addrs = RTA_DST | RTA_NETMASK | RTA_IFP;
+
+  rt.target.sin_family = AF_INET;
+  rt.netmask.sin_family = AF_INET;
+
+  rt.target.sin_len = sizeof(struct sockaddr_in);
+  rt.netmask.sin_len = sizeof(struct sockaddr_in);
+
+  rt.hdr.rtm_msglen = sizeof(rt);
+
+  if (write(fd, (char*)&rt, sizeof(rt)) != sizeof(rt)) {
+    RERROR("PF_ROUTE write failed.");
+    close(fd);
+    return false;
+  }
+
+  pid = getpid();
+  do {
+    len = read(fd, (char *)&rep, sizeof(rep));
+    if (len <= 0) {
+      RERROR("PF_ROUTE read failed.");
+      close(fd);
+      return false;
+    }
+  } while (rep.hdr.rtm_seq != 0 || rep.hdr.rtm_pid != pid);
+  close(fd);
+
+  const struct sockaddr_dl *ifp = NULL;
+  const struct sockaddr_in *gw = NULL;
+
+  uint8 *pos = rep.buf;
+  for (int i = 1; i && i < rep.hdr.rtm_addrs; i <<= 1) {
+    if (rep.hdr.rtm_addrs & i) {
+      if (1 > rep.buf + 512 - pos)
+        break; // invalid
+      size_t len = RTMSG_ROUNDUP(((struct sockaddr*)pos)->sa_len);
+      if (len > rep.buf + 512 - pos)
+        break; // invalid
+               //      RINFO("rtm %d %d", i, ((struct sockaddr*)pos)->sa_len);
+      if (i == RTA_IFP && ((struct sockaddr*)pos)->sa_len == sizeof(struct sockaddr_dl)) {
+        ifp = (struct sockaddr_dl *)pos;
+      } else if (i == RTA_GATEWAY && ((struct sockaddr*)pos)->sa_len == sizeof(struct sockaddr_in)) {
+        gw = (struct sockaddr_in *)pos;
+
+      }
+      pos += len;
+    }
+  }
+
+  if (ifp && ifp->sdl_nlen && ifp->sdl_nlen < iface_size) {
+    iface[ifp->sdl_nlen] = 0;
+    memcpy(iface, ifp->sdl_data, ifp->sdl_nlen);
+    if (gw && gw->sin_family == AF_INET) {
+      *gw_addr = ReadBE32(&gw->sin_addr);
+      return true;
+    }
+
+  }
+  //  RINFO("Read %d %d %d", len, rep.hdr.rtm_addrs, (int)sizeof(struct rt_msghdr ));
+  return false;
+}
+#endif  // defined(OS_MACOSX) || defined(OS_FREEBSD)
+
+#if defined(OS_LINUX)
+static bool GetDefaultRoute(char *iface, size_t iface_size, uint32 *gw_addr) {
+  return false;
+}
+#endif  // defined(OS_LINUX)
+
+
+#if defined(OS_MACOSX)
+static mach_timebase_info_data_t timebase = { 0, 0 };
+static uint64_t                  initclock;
+
+void InitOsxGetMilliseconds() {
+  if (mach_timebase_info(&timebase) != 0)
+    abort();
+  initclock = mach_absolute_time();
+
+  timebase.denom *= 1000000;
+}
+
+uint64 OsGetMilliseconds()
+{
+  uint64_t clock = mach_absolute_time() - initclock;
+  return clock * (uint64_t)timebase.numer / (uint64_t)timebase.denom;
+}
+
+#else  // defined(OS_MACOSX)
+uint64 OsGetMilliseconds() {
+  struct timespec ts;
+  if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) {
+    //error
+    fprintf(stderr, "clock_gettime failed\n");
+    exit(1);
+  }
+  return (uint64)ts.tv_sec * 1000 + (ts.tv_nsec / 1000000);
+}
+#endif
+
+void OsGetTimestampTAI64N(uint8 dst[12]) {
+  struct timeval tv;
+  gettimeofday(&tv, NULL);
+  uint64 secs_since_epoch = tv.tv_sec + 0x400000000000000a;
+  uint32 nanos = tv.tv_usec * 1000;
+  WriteBE64(dst, secs_since_epoch);
+  WriteBE32(dst + 8, nanos);
+}
+
+void OsGetRandomBytes(uint8 *data, size_t data_size) {
+  int fd = open("/dev/urandom", O_RDONLY);
+  int r = read(fd, data, data_size);
+  if (r < 0) r = 0;
+  close(fd);
+  for (; r < data_size; r++)
+    data[r] = rand() >> 6;
+}
+
+void OsInterruptibleSleep(int millis) {
+  usleep((useconds_t)millis * 1000);
+}
+
+#if defined(OS_MACOSX)
+#define TUN_PREFIX_BYTES 4
+int open_tun(char *devname, size_t devname_size) {
+  struct sockaddr_ctl sc;
+  struct ctl_info ctlinfo = {0};
+  int fd;
+
+  memcpy(ctlinfo.ctl_name, UTUN_CONTROL_NAME, sizeof(UTUN_CONTROL_NAME));
+
+  for(int i = 0; i < 256; i++) {
+    fd = socket(PF_SYSTEM, SOCK_DGRAM, SYSPROTO_CONTROL);
+    if (fd < 0) {
+      RERROR("socket(SYSPROTO_CONTROL) failed");
+      return -1;
+    }
+
+    if (ioctl(fd, CTLIOCGINFO, &ctlinfo) == -1) {
+      RERROR("ioctl(CTLIOCGINFO) failed: %d", errno);
+      close(fd);
+      return -1;
+    }
+    sc.sc_id = ctlinfo.ctl_id;
+    sc.sc_len = sizeof(sc);
+    sc.sc_family = AF_SYSTEM;
+    sc.ss_sysaddr = AF_SYS_CONTROL;
+    sc.sc_unit = i + 1;
+    if (connect(fd, (struct sockaddr *)&sc, sizeof(sc)) == 0) {
+      socklen_t devname_size2 = devname_size;
+      if (getsockopt(fd, SYSPROTO_CONTROL, UTUN_OPT_IFNAME, devname, &devname_size2)) {
+        RERROR("getsockopt(UTUN_OPT_IFNAME) failed");
+        close(fd);
+        return -1;
+      }
+
+
+      return fd;
+    }
+    close(fd);
+  }
+  return -1;  
+}
+
+#elif defined(OS_FREEBSD)
+#define TUN_PREFIX_BYTES 4
+int open_tun(char *devname, size_t devname_size) {
+  char buf[32];
+  int tun_fd;
+  // First open an existing tun device
+  for(int i = 0; i < 256; i++) {
+    sprintf(buf, "/dev/tun%d", i);
+    tun_fd = open(buf, O_RDWR);
+    if (tun_fd >= 0) goto did_open;
+  }
+  tun_fd = open("/dev/tun", O_RDWR);
+  if (tun_fd < 0)
+    return tun_fd;
+did_open:
+  if (!fdevname_r(tun_fd, devname, devname_size)) {
+    RERROR("Unable to get name of tun device");
+    close(tun_fd);
+    return -1;
+  }
+  int flags = IFF_POINTOPOINT | IFF_MULTICAST;
+  if (ioctl(tun_fd, TUNSIFMODE, &flags) < 0) {
+    RERROR("ioctl(TUNSIFMODE) failed");
+    close(tun_fd);
+    return -1;
+
+  }
+  flags = 1;
+  if (ioctl(tun_fd, TUNSIFHEAD, &flags) < 0) {
+    RERROR("ioctl(TUNSIFHEAD) failed");
+    close(tun_fd);
+    return -1;
+  }
+  return tun_fd;
+}
+
+#elif defined(OS_LINUX)
+#define TUN_PREFIX_BYTES 0
+int open_tun(char *devname, size_t devname_size) {
+  int fd, err;
+  struct ifreq ifr;
+
+  fd = open("/dev/net/tun", O_RDWR);
+  if (fd < 0)
+    return fd;
+
+  memset(&ifr, 0, sizeof(ifr));
+  ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
+
+  if ((err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0) {
+    close(fd);
+    return err;
+  }
+  strcpy(devname, ifr.ifr_name);
+  return fd;
+}
+#endif
+
+int open_udp(int listen_on_port) {
+  int udp_fd = socket(AF_INET, SOCK_DGRAM, 0);
+  if (udp_fd < 0) return udp_fd;
+  sockaddr_in sin = {0};
+  sin.sin_family = AF_INET;
+  sin.sin_port = htons(listen_on_port);
+  if (bind(udp_fd, (struct sockaddr*)&sin, sizeof(sin)) != 0) {
+    close(udp_fd);
+    return -1;
+  }
+  return udp_fd;
+}
+
+class WorkerLoop {
+public:
+  WorkerLoop();
+  ~WorkerLoop();
+
+  bool Initialize(WireguardProcessor *processor);
+
+  void *ThreadMain();
+  void StartThread();
+
+  void StopThread();
+
+  void NotifyStop();
+
+  enum {
+    TARGET_UDP, TARGET_TUN
+  };
+
+  void HandleUdpPacket(Packet *packet) {
+    HandlePacket(packet, TARGET_UDP);
+  }
+  void HandleTunPacket(Packet *packet) {
+    HandlePacket(packet, TARGET_TUN);
+  }
+
+  void HandleSigAlrm() {
+    got_sig_alarm_ = true;
+  }
+
+private:
+  static void *ThreadMainStatic(void *x);
+  void HandlePacket(Packet *packet, int target);
+
+  WireguardProcessor *processor_;
+  pthread_t tid_;
+  Packet *queue_, **queue_end_;
+  bool shutting_down_;
+  bool got_sig_alarm_;
+
+  pthread_mutex_t lock_;
+  pthread_cond_t cond_;
+};
+
+// Handles the threads that read/write to the udp socket.
+class UdpLoop {
+public:
+  UdpLoop();
+  ~UdpLoop();
+
+  bool Initialize(int listen_port, WorkerLoop *worker);
+  void Start();
+  void Stop();
+
+  void WriteUdpPacket(Packet *packet);
+private:
+  static void *ReaderMainStatic(void *x);
+  static void *WriterMainStatic(void *x);
+  void *ReaderMain();
+  void *WriterMain();
+  
+  int fd_;
+  WorkerLoop *worker_;
+  pthread_t read_tid_, write_tid_;
+
+  Packet *queue_, **queue_end_;
+
+  bool shutting_down_;
+
+  pthread_mutex_t lock_;
+  pthread_cond_t cond_;
+};
+
+// Handles the threads that read/write to the tun socket.
+class TunLoop {
+public:
+  TunLoop();
+  ~TunLoop();
+
+  bool Initialize(WorkerLoop *worker);
+  void Start();
+  void Stop();
+
+  void WriteTunPacket(Packet *packet);
+
+  char *devname() { return devname_; }
+private:
+  static void *ReaderMainStatic(void *x);
+  static void *WriterMainStatic(void *x);
+  void *ReaderMain();
+  void *WriterMain();
+
+  int fd_;
+  bool shutting_down_;
+  char devname_[16];
+
+  WorkerLoop *worker_;
+  pthread_t read_tid_, write_tid_;
+  Packet *queue_, **queue_end_;
+  pthread_mutex_t lock_;
+  pthread_cond_t cond_;
+};
+
+WorkerLoop::WorkerLoop() {
+  queue_end_ = &queue_;
+  queue_ = NULL;
+  tid_ = 0;
+  shutting_down_ = false;
+  got_sig_alarm_ = false;
+  processor_ = NULL;
+  pthread_mutex_init(&lock_, NULL);
+  pthread_cond_init(&cond_, NULL);
+}
+
+WorkerLoop::~WorkerLoop() {
+  pthread_mutex_destroy(&lock_);
+  pthread_cond_destroy(&cond_);
+}
+
+bool WorkerLoop::Initialize(WireguardProcessor *processor) {
+  processor_ = processor;
+  return true;
+}
+
+void WorkerLoop::StartThread() {
+  assert(tid_ == 0);
+  pthread_create(&tid_, NULL, &ThreadMainStatic, this);
+}
+
+void WorkerLoop::StopThread() {
+  pthread_mutex_lock(&lock_);
+  shutting_down_ = true;
+  pthread_mutex_unlock(&lock_);
+
+  if (tid_) {
+    void *x;
+    pthread_join(tid_, &x);
+    tid_ = 0;
+  }
+}
+
+
+// This is called from signal handler so cannot block etc.
+void WorkerLoop::NotifyStop() {
+  shutting_down_ = true;
+}
+
+void WorkerLoop::HandlePacket(Packet *packet, int target) {
+//  RINFO("WorkerLoop::HandlePacket");
+  packet->post_target = target;
+  pthread_mutex_lock(&lock_);
+  Packet *old_queue = queue_;
+  *queue_end_ = packet;
+  queue_end_ = &packet->next;
+  packet->next = NULL;
+  if (old_queue == NULL) {
+    pthread_mutex_unlock(&lock_);
+    pthread_cond_signal(&cond_);
+  } else {
+    pthread_mutex_unlock(&lock_);
+  }
+}
+
+void *WorkerLoop::ThreadMainStatic(void *x) {
+  return ((WorkerLoop*)x)->ThreadMain();
+}
+
+void *WorkerLoop::ThreadMain() {
+  Packet *packet_queue;
+
+  pthread_mutex_lock(&lock_);
+  for (;;) {
+    // Grab the whole list
+    for (;;) {
+      while (got_sig_alarm_) {
+        got_sig_alarm_ = false;
+        pthread_mutex_unlock(&lock_);
+        processor_->SecondLoop();
+        pthread_mutex_lock(&lock_);
+      }
+      if (shutting_down_ || queue_ != NULL)
+        break;
+      pthread_cond_wait(&cond_, &lock_);
+    }
+    if (shutting_down_)
+      break;
+    packet_queue = queue_;
+    queue_ = NULL;
+    queue_end_ = &queue_;
+    
+    pthread_mutex_unlock(&lock_);
+    // And send all items in the list
+    while (packet_queue != NULL) {
+      Packet *next = packet_queue->next;
+      if (packet_queue->post_target == TARGET_TUN) {
+        processor_->HandleTunPacket(packet_queue);
+      } else {
+        processor_->HandleUdpPacket(packet_queue, false);
+      }
+      packet_queue = next;
+    }
+    pthread_mutex_lock(&lock_);
+  }
+  pthread_mutex_unlock(&lock_);
+  return NULL;
+}
+
+
+
+UdpLoop::UdpLoop() {
+  fd_ = -1;
+  read_tid_ = 0;
+  write_tid_ = 0;
+  shutting_down_ = false;
+  worker_ = NULL;
+  queue_ = NULL;
+  queue_end_ = &queue_;
+  pthread_mutex_init(&lock_, NULL);
+  pthread_cond_init(&cond_, NULL);
+}
+
+UdpLoop::~UdpLoop() {
+  if (fd_ != -1)
+    close(fd_);
+  pthread_mutex_destroy(&lock_);
+  pthread_cond_destroy(&cond_);
+}
+
+bool UdpLoop::Initialize(int listen_port, WorkerLoop *worker) {
+  int fd = open_udp(listen_port);
+  if (fd < 0) { RERROR("Error opening udp"); return false; }
+  fcntl(fd, F_SETFD, FD_CLOEXEC);
+  fd_ = fd;
+  worker_ = worker;
+  return true;
+}
+
+void UdpLoop::Start() {
+  pthread_create(&read_tid_, NULL, &ReaderMainStatic, this);
+  pthread_create(&write_tid_, NULL, &WriterMainStatic, this);
+}
+
+void UdpLoop::Stop() {
+  void *x;
+
+  pthread_mutex_lock(&lock_);
+  shutting_down_ = true;
+  pthread_mutex_unlock(&lock_);
+  pthread_cond_signal(&cond_);
+
+  pthread_kill(read_tid_, SIGUSR1);
+  pthread_kill(write_tid_, SIGUSR1);
+  
+  pthread_join(read_tid_, &x);
+  pthread_join(write_tid_, &x);
+
+  read_tid_ = 0;
+  write_tid_ = 0;
+}
+
+void *UdpLoop::ReaderMainStatic(void *x) {
+  SetThreadName("tunsafe-ur");
+  return ((UdpLoop*)x)->ReaderMain();
+}
+
+void *UdpLoop::WriterMainStatic(void *x) {
+  SetThreadName("tunsafe-uw");
+  return ((UdpLoop*)x)->WriterMain();
+}
+
+void *UdpLoop::ReaderMain() {
+  Packet *packet;
+  socklen_t sin_len;
+  int r;
+
+  while (!shutting_down_) {
+    packet = AllocPacket();
+    sin_len = sizeof(packet->addr.sin);
+    r = recvfrom(fd_, packet->data, kPacketCapacity, 0, (sockaddr*)&packet->addr.sin, &sin_len);
+    if (r < 0) {
+      FreePacket(packet);
+      if (shutting_down_)
+        break;
+
+      RERROR("ReadMain failed %d", errno);
+
+    } else {
+      packet->size = r;
+      worker_->HandleUdpPacket(packet);
+    }
+  }
+  return NULL;
+}
+
+void *UdpLoop::WriterMain() {
+  Packet *queue;
+
+  pthread_mutex_lock(&lock_);
+  for (;;) {
+    // Grab the whole list
+    while (!shutting_down_ && queue_ == NULL)
+      pthread_cond_wait(&cond_, &lock_);
+    if (shutting_down_)
+      break;
+    queue = queue_;
+    queue_ = NULL;
+    queue_end_ = &queue_;
+    pthread_mutex_unlock(&lock_);
+    // And send all items in the list
+    while (queue != NULL) {
+      int r = sendto(fd_, queue->data, queue->size, 0,
+          (sockaddr*)&queue->addr.sin, sizeof(queue->addr.sin));
+      if (r != queue->size) {
+        if (errno != ENOBUFS)
+          RERROR("WriterMain failed: %d", errno);
+      } else {
+//        RINFO("WRote udp packet!");
+      }
+      Packet *to_free = queue;
+      queue = queue->next;
+      FreePacket(to_free);
+    }
+    pthread_mutex_lock(&lock_);
+  }
+  pthread_mutex_unlock(&lock_);
+  return NULL;
+}
+
+void UdpLoop::WriteUdpPacket(Packet *packet) {
+//  RINFO("write udp packet to queue!");
+  packet->next = NULL;
+
+  pthread_mutex_lock(&lock_);
+  Packet *old_queue = queue_;
+  *queue_end_ = packet;
+  queue_end_ = &packet->next;
+  if (old_queue == NULL) {
+    pthread_mutex_unlock(&lock_);
+    pthread_cond_signal(&cond_);
+  } else {
+    pthread_mutex_unlock(&lock_);
+  }
+}
+
+TunLoop::TunLoop() {
+  fd_ = -1;
+  shutting_down_ = false;
+  worker_ = NULL;
+  read_tid_ = 0;
+  write_tid_ = 0;
+  queue_ = NULL;
+  queue_end_ = &queue_;
+  pthread_mutex_init(&lock_, NULL);
+  pthread_cond_init(&cond_, NULL);
+}
+
+TunLoop::~TunLoop() {
+  if (fd_ != -1)
+    close(fd_);
+  pthread_mutex_destroy(&lock_);
+  pthread_cond_destroy(&cond_);
+}
+
+bool TunLoop::Initialize(WorkerLoop *worker) {
+  int fd = open_tun(devname_, sizeof(devname_));
+  if (fd < 0) { RERROR("Error opening tun"); return false; }
+  fcntl(fd, F_SETFD, FD_CLOEXEC);
+  fd_ = fd;
+  worker_ = worker;
+  return true;
+}
+
+void TunLoop::Start() {
+  pthread_create(&read_tid_, NULL, &ReaderMainStatic, this);
+  pthread_create(&write_tid_, NULL, &WriterMainStatic, this);
+}
+
+void TunLoop::Stop() {
+  void *x;
+
+  pthread_mutex_lock(&lock_);
+  shutting_down_ = true;
+  pthread_mutex_unlock(&lock_);
+
+  pthread_kill(read_tid_, SIGUSR1);
+  pthread_kill(write_tid_, SIGUSR1);
+  pthread_join(read_tid_, &x);
+  pthread_join(write_tid_, &x);
+
+  read_tid_ = 0;
+  write_tid_ = 0;
+}
+
+void *TunLoop::ReaderMainStatic(void *x) {
+  SetThreadName("tunsafe-tr");
+  return ((TunLoop*)x)->ReaderMain();
+}
+
+void *TunLoop::WriterMainStatic(void *x) {
+  SetThreadName("tunsafe-tw");
+  return ((TunLoop*)x)->WriterMain();
+}
+
+void *TunLoop::ReaderMain() {
+  Packet *packet = AllocPacket();
+  while (!shutting_down_) {
+    int r = read(fd_, packet->data - TUN_PREFIX_BYTES, kPacketCapacity + TUN_PREFIX_BYTES);
+    if (r >= 0) {
+      packet->size = r - TUN_PREFIX_BYTES;
+      if (r >= TUN_PREFIX_BYTES && (!TUN_PREFIX_BYTES || ReadBE32(packet->data - TUN_PREFIX_BYTES) == AF_INET)) {
+        worker_->HandleTunPacket(packet);
+        packet = AllocPacket();
+      }
+    }
+  }
+  return NULL;
+}
+
+void *TunLoop::WriterMain() {
+  Packet *queue;
+
+  pthread_mutex_lock(&lock_);
+  for (;;) {
+    // Grab the whole list
+    while (!shutting_down_ && queue_ == NULL) {
+      pthread_cond_wait(&cond_, &lock_);
+    }
+    if (shutting_down_)
+      break;
+    queue = queue_;
+    queue_ = NULL;
+    queue_end_ = &queue_;
+    pthread_mutex_unlock(&lock_);
+    // And send all items in the list
+    while (queue != NULL) {
+      if (TUN_PREFIX_BYTES)
+        WriteBE32(queue->data - TUN_PREFIX_BYTES, AF_INET);
+      int r = write(fd_, queue->data - TUN_PREFIX_BYTES, queue->size + TUN_PREFIX_BYTES);
+      if (r != queue->size + TUN_PREFIX_BYTES) {
+        RERROR("WriterMain failed: %d", errno);
+        break;
+      }
+      Packet *to_free = queue;
+      queue = queue->next;
+      FreePacket(to_free);
+    }
+    pthread_mutex_lock(&lock_);
+  }
+  pthread_mutex_unlock(&lock_);
+  return NULL;
+}
+
+void TunLoop::WriteTunPacket(Packet *packet) {
+  packet->next = NULL;
+
+  pthread_mutex_lock(&lock_);
+  Packet *old_queue = queue_;
+  *queue_end_ = packet;
+  queue_end_ = &packet->next;
+  if (old_queue == NULL) {
+    pthread_mutex_unlock(&lock_);
+    pthread_cond_signal(&cond_);
+  } else {
+    pthread_mutex_unlock(&lock_);
+  }
+}
+
+struct RouteInfo {
+  uint8 family;
+  uint8 cidr;
+  uint8 ip[16];
+  uint8 gw[16];
+};
+
+class TunsafeBackendBsd : public TunInterface, public UdpInterface {
+public:
+  TunsafeBackendBsd();
+  ~TunsafeBackendBsd();
+
+  void RunLoop();
+  void CleanupRoutes();
+
+  void SetProcessor(WireguardProcessor *wg) { processor_ = wg; }
+
+  // -- from TunInterface
+  virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) override;
+  virtual void WriteTunPacket(Packet *packet) override;
+
+  // -- from UdpInterface
+  virtual bool Initialize(int listen_port) override;
+  virtual void WriteUdpPacket(Packet *packet) override;
+
+  void HandleSigAlrm() { worker_.HandleSigAlrm(); }
+  void HandleExit() { worker_.NotifyStop(); }
+  
+private:
+  void AddRoute(uint32 ip, uint32 cidr, uint32 gw);
+  void DelRoute(const RouteInfo &cd);
+  bool AddRoute(int family, const void *dest, int dest_prefix, const void *gateway);
+  bool RunPrePostCommand(const std::vector<std::string> &vec);
+
+
+  WireguardProcessor *processor_;
+
+  bool got_sig_alarm_;
+  bool exit_;
+
+  uint32 added_route_addr_, added_route_gw_;
+
+  WorkerLoop worker_;
+  UdpLoop udp_;
+  TunLoop tun_;
+
+  std::vector<RouteInfo> cleanup_commands_;
+
+  std::vector<std::string> pre_down_, post_down_;
+};
+
+TunsafeBackendBsd::TunsafeBackendBsd() 
+    : processor_(NULL),
+      got_sig_alarm_(false),
+      exit_(false) {
+}
+
+TunsafeBackendBsd::~TunsafeBackendBsd() {
+}
+
+static uint32 CidrToNetmaskV4(int cidr) {
+  return cidr == 32 ? 0xffffffff : 0xffffffff << (32 - cidr);
+}
+
+static uint32 ComputeIpv4DefaultRoute(uint32 ip, uint32 netmask) {
+  uint32 default_route_v4 = (ip & netmask) | 1;
+  if (default_route_v4 == ip)
+    default_route_v4++;
+  return default_route_v4;
+}
+
+static void ComputeIpv6DefaultRoute(const uint8 *ipv6_address, uint8 ipv6_cidr, uint8 *default_route_v6) {
+  memcpy(default_route_v6, ipv6_address, 16);
+  // clear the last bits of the ipv6 address to match the cidr.
+  size_t n = (ipv6_cidr + 7) >> 3;
+  memset(&default_route_v6[n], 0, 16 - n);
+  if (n == 0)
+    return;
+  // adjust the final byte
+  default_route_v6[n - 1] &= ~(0xff >> (ipv6_cidr & 7));
+  // set the very last byte to something
+  default_route_v6[15] |= 1;
+  // ensure it doesn't collide
+  if (memcmp(default_route_v6, ipv6_address, 16) == 0)
+    default_route_v6[15] ^= 3;
+}
+
+void TunsafeBackendBsd::AddRoute(uint32 ip, uint32 cidr, uint32 gw) {
+  uint32 ip_be, gw_be;
+  WriteBE32(&ip_be, ip);
+  WriteBE32(&gw_be, gw);
+  AddRoute(AF_INET, &ip_be, cidr, &gw_be);
+}
+
+static void AddOrRemoveRoute(const RouteInfo &cd, bool remove) {
+  char buf1[kSizeOfAddress], buf2[kSizeOfAddress];
+
+  print_ip_prefix(buf1, cd.family, cd.ip, cd.cidr);
+  print_ip_prefix(buf2, cd.family, cd.gw, -1);
+
+#if defined(OS_LINUX)
+  const char *cmd = remove ? "delete" : "add";
+  if (cd.family == AF_INET) {
+    RunCommand("/sbin/route %s -net %s gw %s", cmd, buf1, buf2);
+  } else {
+    RunCommand("/sbin/route %s -net inet6 %s gw %s", cmd, buf1, buf2);
+  }
+#elif defined(OS_MACOSX)
+  const char *cmd = remove ? "delete" : "add";
+  if (cd.family == AF_INET) {
+    RunCommand("/sbin/route -q %s %s %s", cmd, buf1, buf2);
+  } else {
+    RunCommand("/sbin/route -q %s -inet6 %s %s", cmd, buf1, buf2);
+  }
+#endif
+}
+
+bool TunsafeBackendBsd::AddRoute(int family, const void *dest, int dest_prefix, const void *gateway) {
+  RouteInfo c;
+
+  c.family = family;
+  size_t len = (family == AF_INET) ? 4 : 16;
+  memcpy(c.ip, dest, len);
+  memcpy(c.gw, gateway, len);
+  c.cidr = dest_prefix;
+  cleanup_commands_.push_back(c);
+  AddOrRemoveRoute(c, false);
+  return true;
+}
+
+void TunsafeBackendBsd::DelRoute(const RouteInfo &cd) {
+  AddOrRemoveRoute(cd, true);
+}
+
+static bool IsIpv6AddressSet(const void *p) {
+  return (ReadLE64(p) | ReadLE64((char*)p + 8)) != 0;
+}
+
+// Called to initialize tun
+bool TunsafeBackendBsd::Initialize(const TunConfig &&config, TunConfigOut *out) override {
+  char def_iface[12];
+
+  if (!RunPrePostCommand(config.pre_post_commands.pre_up)) {
+    RERROR("Pre command failed!");
+    return false;
+  }
+
+  out->enable_neighbor_discovery_spoofing = false;
+
+  if (!tun_.Initialize(&worker_))
+    return false;
+  
+  if (config.ipv6_cidr)
+    RERROR("IPv6 not supported");
+
+  uint32 netmask = CidrToNetmaskV4(config.cidr);
+  uint32 default_route_v4 = ComputeIpv4DefaultRoute(config.ip, netmask);
+  
+  RunCommand("/sbin/ifconfig %s %A mtu %d %A netmask %A up", tun_.devname(), config.ip, config.mtu, config.ip, netmask);
+  AddRoute(config.ip & netmask, config.cidr, config.ip);
+
+  if (config.use_ipv4_default_route) {
+    if (config.default_route_endpoint_v4) {
+      uint32 gw;
+      if (!GetDefaultRoute(def_iface, sizeof(def_iface), &gw)) {
+        RERROR("Unable to determine default interface.");
+        return false;
+      }
+      AddRoute(config.default_route_endpoint_v4, 32, gw);
+
+    }
+    AddRoute(0x00000000, 1, default_route_v4);
+    AddRoute(0x80000000, 1, default_route_v4);
+  }
+
+  uint8 default_route_v6[16];
+
+  if (config.ipv6_cidr) {
+    static const uint8 matchall_1_route[17] = {0x80, 0, 0, 0};
+    char buf[kSizeOfAddress];
+
+    ComputeIpv6DefaultRoute(config.ipv6_address, config.ipv6_cidr, default_route_v6);
+
+    RunCommand("/sbin/ifconfig %s inet6 %s", tun_.devname(), print_ip_prefix(buf, AF_INET6, config.ipv6_address, config.ipv6_cidr));
+
+    if (config.use_ipv6_default_route) {
+      if (IsIpv6AddressSet(config.default_route_endpoint_v6)) {
+        RERROR("default_route_endpoint_v6 not supported");
+      }
+      AddRoute(AF_INET6, matchall_1_route + 1, 1, default_route_v6);
+      AddRoute(AF_INET6, matchall_1_route + 0, 1, default_route_v6);
+    }
+  }
+
+  // Add all the extra routes
+  for (auto it = config.extra_routes.begin(); it != config.extra_routes.end(); ++it) {
+    if (it->size == 32) {
+      AddRoute(ReadBE32(it->addr), it->cidr, default_route_v4);
+    } else if (it->size == 128 && config.ipv6_cidr) {
+      AddRoute(AF_INET6, it->addr, it->cidr, default_route_v6);
+    }
+  }
+
+  RunPrePostCommand(config.pre_post_commands.post_up);
+
+  pre_down_ = std::move(config.pre_post_commands.pre_down);
+  post_down_ = std::move(config.pre_post_commands.post_down);
+
+  return true;
+}
+
+void TunsafeBackendBsd::CleanupRoutes() {
+  RunPrePostCommand(pre_down_);
+
+  for(auto it = cleanup_commands_.begin(); it != cleanup_commands_.end(); ++it)
+    DelRoute(*it);
+  cleanup_commands_.clear();
+
+  RunPrePostCommand(post_down_);
+
+  pre_down_.clear();
+  post_down_.clear();
+}
+
+static bool RunOneCommand(const std::string &cmd) {
+  RINFO("Run: %s", cmd.c_str());
+  int exit_code = system(cmd.c_str());
+  if (exit_code) {
+    RERROR("Run Failed (%d) : %s", exit_code, cmd.c_str());
+    return false;
+  }
+  return true;
+}
+
+bool TunsafeBackendBsd::RunPrePostCommand(const std::vector<std::string> &vec) {
+  bool success = true;
+  for (auto it = vec.begin(); it != vec.end(); ++it) {
+    success &= RunOneCommand(*it);
+  }
+  return success;
+}
+
+
+void TunsafeBackendBsd::WriteTunPacket(Packet *packet) override {
+  tun_.WriteTunPacket(packet);
+}
+
+// Called to initialize udp
+bool TunsafeBackendBsd::Initialize(int listen_port) override {
+  return udp_.Initialize(listen_port, &worker_);
+}
+
+void TunsafeBackendBsd::WriteUdpPacket(Packet *packet) override {
+  udp_.WriteUdpPacket(packet);
+}
+
+static TunsafeBackendBsd *g_tunsafe_backend_bsd;
+
+static void SigAlrm(int sig) {
+  if (g_tunsafe_backend_bsd)
+    g_tunsafe_backend_bsd->HandleSigAlrm();
+}
+
+static void SigUsr1(int sig) {
+
+}
+
+static bool did_ctrlc;
+
+void SigInt(int sig) {
+  if (did_ctrlc)
+    exit(1);
+  did_ctrlc = true;
+  write(1, "Ctrl-C detected. Exiting. Press again to force quit.\n", sizeof("Ctrl-C detected. Exiting. Press again to force quit.\n")-1);
+  
+  if (g_tunsafe_backend_bsd)
+    g_tunsafe_backend_bsd->HandleExit();
+}
+
+void TunsafeBackendBsd::RunLoop() {
+  int free_packet_interval = 10;
+
+  assert(!g_tunsafe_backend_bsd);
+  assert(processor_);
+
+  g_tunsafe_backend_bsd = this;
+  // We want an alarm signal every second.
+  {
+    struct sigaction act = {0};
+    act.sa_handler = SigAlrm;
+    if (sigaction(SIGALRM, &act, NULL) < 0) {
+      RERROR("Unable to install SIGALRM handler.");
+      return;
+    }
+  }
+
+  {
+    struct sigaction act = {0};
+    act.sa_handler = SigInt;
+    if (sigaction(SIGINT, &act, NULL) < 0) {
+      RERROR("Unable to install SIGINT handler.");
+      return;
+    }
+  }
+
+  {
+    struct sigaction act = {0};
+    act.sa_handler = SigUsr1;
+    if (sigaction(SIGUSR1, &act, NULL) < 0) {
+      RERROR("Unable to install SIGUSR1 handler.");
+      return;
+    }
+  }
+
+
+
+#if defined(OS_LINUX) || defined(OS_FREEBSD)
+  {
+    struct itimerspec tv = {0};
+    struct sigevent sev;
+    timer_t timer_id;
+
+    tv.it_interval.tv_sec = 1;
+    tv.it_value.tv_sec = 1;
+
+    sev.sigev_notify = SIGEV_SIGNAL;
+    sev.sigev_signo = SIGALRM;
+    sev.sigev_value.sival_ptr = NULL;
+
+    if (timer_create(CLOCK_MONOTONIC, &sev, &timer_id) < 0) {
+      RERROR("timer_create failed");
+      return;
+    }    
+
+    if (timer_settime(timer_id, 0, &tv, NULL) < 0) {
+      RERROR("timer_settime failed");
+      return;
+    }
+  }
+#elif defined(OS_MACOSX)
+  ualarm(1000000, 1000000);
+#endif
+  
+  worker_.Initialize(processor_);
+
+  // Start the processing threads
+  udp_.Start();
+  tun_.Start();
+
+  worker_.ThreadMain();
+
+  tun_.Stop();
+  udp_.Stop();
+
+  g_tunsafe_backend_bsd = NULL;
+}
+
+void InitCpuFeatures();
+void Benchmark();
+
+
+uint32 g_ui_ip;
+
+const char *print_ip(char buf[kSizeOfAddress], in_addr_t ip) {
+  snprintf(buf, kSizeOfAddress, "%d.%d.%d.%d", (ip >> 24) & 0xff, (ip >> 16) & 0xff, (ip >> 8) & 0xff, (ip >> 0) & 0xff);
+  return buf;
+}
+
+
+class MyProcessorDelegate : public ProcessorDelegate {
+public:
+  virtual void OnConnected(in_addr_t my_ip) {
+    if (my_ip != g_ui_ip) {
+      if (my_ip) {
+        char buf[kSizeOfAddress];
+        print_ip(buf, my_ip);
+        RINFO("Connection established. IP %s", buf);
+      }
+      g_ui_ip = my_ip;
+    }
+  }
+  virtual void OnDisconnected() {
+    MyProcessorDelegate::OnConnected(0);
+  }
+};
+
+
+
+
+int main(int argc, char **argv) {
+  bool exit_flag = false;
+
+  InitCpuFeatures();
+
+  if (argc == 2 && strcmp(argv[1], "--benchmark") == 0) {
+    Benchmark();
+    return 0;
+  }
+
+  fprintf(stderr, "%s\n", TUNSAFE_VERSION_STRING);
+
+  if (argc < 2) {
+    fprintf(stderr, "Syntax: tunsafe file.conf\n");
+    return 1;
+  }
+  
+#if defined(OS_MACOSX)
+  InitOsxGetMilliseconds();
+#endif
+
+  SetThreadName("tunsafe-m");
+
+
+  MyProcessorDelegate my_procdel;
+  TunsafeBackendBsd socket_loop;
+  WireguardProcessor wg(&socket_loop, &socket_loop, &my_procdel);
+  socket_loop.SetProcessor(&wg);
+
+  if (!ParseWireGuardConfigFile(&wg, argv[1], &exit_flag)) return 1;
+  if (!wg.Start()) return 1;
+
+  socket_loop.RunLoop();
+  socket_loop.CleanupRoutes();
+
+  return 0;
+}
diff --git a/network_win32.cpp b/network_win32.cpp
new file mode 100644
index 0000000..beb1b39
--- /dev/null
+++ b/network_win32.cpp
@@ -0,0 +1,1956 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "network_win32.h"
+#include "wireguard_config.h"
+#include "netapi.h"
+#include <Iphlpapi.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <malloc.h>
+#include <stddef.h>
+#include <string.h>
+#include <vector>
+#include <Iphlpapi.h>
+#include <ws2ipdef.h>
+#include <assert.h>
+#include <exdisp.h>
+#include "tunsafe_endian.h"
+#include "wireguard.h"
+#include "util.h"
+#include <algorithm>
+#include "network_win32_dnsblock.h"
+
+enum {
+  HARD_MAXIMUM_QUEUE_SIZE = 102400,
+  MAX_BYTES_IN_UDP_OUT_QUEUE = 256 * 1024,
+  MAX_BYTES_IN_UDP_OUT_QUEUE_SMALL = (256 + 64) * 1024,
+};
+
+enum {
+  ROUTE_BLOCK_UNKNOWN = 0,
+  ROUTE_BLOCK_OFF = 1,
+  ROUTE_BLOCK_ON = 2,
+  ROUTE_BLOCK_PENDING = 3,
+};
+static uint8 internet_route_blocking_state;
+static SLIST_HEADER freelist_head;
+
+bool g_allow_pre_post;
+
+Packet *AllocPacket() {
+  Packet *packet = (Packet*)InterlockedPopEntrySList(&freelist_head);
+  if (packet == NULL)
+    packet = (Packet *)_aligned_malloc(kPacketAllocSize, 16);
+  packet->data = packet->data_buf + Packet::HEADROOM_BEFORE;
+  packet->size = 0;
+  return packet;
+}
+
+void FreePacket(Packet *packet) {
+  InterlockedPushEntrySList(&freelist_head, &packet->list_entry);
+}
+
+extern "C"
+PSLIST_ENTRY __fastcall InterlockedPushListSList(
+  IN PSLIST_HEADER ListHead,
+  IN PSLIST_ENTRY List,
+  IN PSLIST_ENTRY ListEnd,
+  IN ULONG Count
+);
+
+void FreePackets(Packet *packet, Packet **end, int count) {
+  InterlockedPushListSList(&freelist_head, &packet->list_entry, (PSLIST_ENTRY)end, count);
+}
+
+void FreeAllPackets() {
+  Packet *p;
+  p = (Packet*)InterlockedFlushSList(&freelist_head);
+  while (Packet *r = p) {
+    p = p->next;
+    _aligned_free(r);
+  }
+}
+
+void InitPacketMutexes() {
+  static bool mutex_inited;
+  if (!mutex_inited) {
+    mutex_inited = true;
+    InitializeSListHead(&freelist_head);
+  }
+}
+
+
+void CallbackUpdateUI();
+void CallbackTriggerReconnect();
+void CallbackSetPublicKey(const uint8 public_key[32]);
+
+int tpq_last_qsize;
+int g_tun_reads, g_tun_writes;
+
+struct {
+  uint32 pad1[3];
+  uint32 udp_qsize1;
+  uint32 pad2[3];
+  uint32 udp_qsize2;
+} qs;
+
+
+#define kConcurrentReadUdp 16
+#define kConcurrentWriteUdp 16
+#define kConcurrentReadTap 16
+#define kConcurrentWriteTap 16
+
+#define kAdapterKeyName "SYSTEM\\CurrentControlSet\\Control\\Class\\{4D36E972-E325-11CE-BFC1-08002BE10318}"
+#define kTapComponentId "tap0901"
+
+#define TAP_CONTROL_CODE(request,method) \
+  CTL_CODE (FILE_DEVICE_UNKNOWN, request, method, FILE_ANY_ACCESS)
+
+#define TAP_IOCTL_GET_MAC               TAP_CONTROL_CODE(1, METHOD_BUFFERED)
+#define TAP_IOCTL_GET_VERSION           TAP_CONTROL_CODE(2, METHOD_BUFFERED)
+#define TAP_IOCTL_GET_MTU               TAP_CONTROL_CODE(3, METHOD_BUFFERED)
+#define TAP_IOCTL_GET_INFO              TAP_CONTROL_CODE(4, METHOD_BUFFERED)
+#define TAP_IOCTL_CONFIG_POINT_TO_POINT TAP_CONTROL_CODE(5, METHOD_BUFFERED)
+#define TAP_IOCTL_SET_MEDIA_STATUS      TAP_CONTROL_CODE(6, METHOD_BUFFERED)
+#define TAP_IOCTL_CONFIG_DHCP_MASQ      TAP_CONTROL_CODE(7, METHOD_BUFFERED)
+#define TAP_IOCTL_GET_LOG_LINE          TAP_CONTROL_CODE(8, METHOD_BUFFERED)
+#define TAP_IOCTL_CONFIG_DHCP_SET_OPT   TAP_CONTROL_CODE(9, METHOD_BUFFERED)
+#define TAP_IOCTL_CONFIG_TUN            TAP_CONTROL_CODE(10, METHOD_BUFFERED)
+
+static bool RunNetsh(const char *cmdline) {
+  wchar_t path[MAX_PATH + 20];
+  size_t size = GetSystemDirectoryW(path, MAX_PATH);
+  bool result = false;
+  if (!size) {
+    RERROR("GetSystemDirectory failed");
+    return false;
+  }
+  memcpy(path + size, L"\\netsh.exe", 11 * sizeof(path[0]));
+
+  size_t cmdline_size = strlen(cmdline);
+  wchar_t *cmdlinew = new wchar_t[cmdline_size + 1];
+  for (size_t i = 0; i <= cmdline_size; i++)
+    cmdlinew[i] = cmdline[i];
+
+  STARTUPINFOW si = {0};
+  PROCESS_INFORMATION pi = {0};
+
+  GetStartupInfoW(&si);
+  si.dwFlags = STARTF_USESHOWWINDOW;
+  si.wShowWindow = SW_HIDE;
+   if (CreateProcessW(path, cmdlinew, NULL, NULL, FALSE, CREATE_NO_WINDOW, NULL, NULL, &si, &pi)) {
+    DWORD exit_code = -1;
+    WaitForSingleObject(pi.hProcess, INFINITE);
+    GetExitCodeProcess(pi.hProcess, &exit_code);
+    if (exit_code != 0)
+      RERROR("Netsh failed (%d) : %s", exit_code, cmdline);
+    else {
+      RINFO("Run: %s", cmdline);
+      result = true;
+    }
+    CloseHandle(pi.hThread);
+    CloseHandle(pi.hProcess);
+  } else {
+    RERROR("CreateProcess failed: %s", cmdline);
+  }
+  delete[]cmdlinew;
+  return result;
+}
+
+// Retrieve the device path to the TAP adapter.
+static bool GetTapAdapterGuid(char guid[64]) {
+  LONG err;
+  HKEY adapter_key, device_key;
+  bool retval = false;
+  err = RegOpenKeyEx(HKEY_LOCAL_MACHINE, kAdapterKeyName, 0, KEY_READ, &adapter_key);
+  if (err != ERROR_SUCCESS) {
+    RERROR("GetTapAdapterName: RegOpenKeyEx failed: 0x%X", GetLastError());
+    return false;
+  }
+  for (int i = 0; !retval; i++) {
+    char keyname[64 + sizeof(kAdapterKeyName) + 1];
+    char value[64];
+    DWORD len = sizeof(value), type;
+    err = RegEnumKeyEx(adapter_key, i, value, &len, NULL, NULL, NULL, NULL);
+    if (err == ERROR_NO_MORE_ITEMS)
+      break;
+    if (err != ERROR_SUCCESS) {
+      RERROR("GetTapAdapterName: RegEnumKeyEx failed: 0x%X", GetLastError());
+      return false;
+    }
+    snprintf(keyname, sizeof(keyname), "%s\\%s", kAdapterKeyName, value);
+    err = RegOpenKeyEx(HKEY_LOCAL_MACHINE, keyname, 0, KEY_READ, &device_key);
+    if (err == ERROR_SUCCESS) {
+      len = sizeof(value);
+      err = RegQueryValueEx(device_key, "ComponentId", NULL, &type, (LPBYTE)value, &len);
+      if (err == ERROR_SUCCESS && type == REG_SZ && !memcmp(value, kTapComponentId, sizeof(kTapComponentId))) {
+        len = 64;
+        err = RegQueryValueEx(device_key, "NetCfgInstanceId", NULL, &type, (LPBYTE)guid, &len);
+        if (err == ERROR_SUCCESS && type == REG_SZ) {
+          guid[63] = 0;
+          retval = true;
+        }
+      }
+      RegCloseKey(device_key);
+    }
+  }
+  RegCloseKey(adapter_key);
+  return retval;
+}
+
+// Open the TAP adapter
+static HANDLE OpenTunAdapter(char guid[64], int retry_count, bool *exit_thread, DWORD open_flags) {
+  char path[128];
+  HANDLE h;
+  int retries = 0;
+  if (!GetTapAdapterGuid(guid)) {
+    RERROR("Unable to find ID of TAP adapter");
+    RERROR("  Please ensure that TunSafe-TAP is properly installed.");
+    return NULL;
+  }
+  snprintf(path, sizeof(path), "\\\\.\\Global\\%s.tap", guid);
+RETRY:
+  h = CreateFile(path, GENERIC_READ | GENERIC_WRITE, 0, 0, OPEN_EXISTING,
+                 FILE_ATTRIBUTE_SYSTEM | open_flags, 0);
+  if (h == INVALID_HANDLE_VALUE) {
+    int error_code = GetLastError();
+    
+    // Sometimes if you close the device right before, it will fail to open with errorcode 31.
+    // When resuming from sleep in my VM, the error code is ERROR_FILE_NOT_FOUND
+    if ((error_code == ERROR_FILE_NOT_FOUND || error_code == ERROR_GEN_FAILURE) && retry_count != 0 && !*exit_thread) {
+      RERROR("OpenTapAdapter: CreateFile failed: 0x%X... retrying", error_code);
+      retry_count--;
+      Sleep(250 * ++retries);
+      goto RETRY;
+    }
+    
+    RERROR("OpenTapAdapter: CreateFile failed: 0x%X", error_code);
+    if (error_code == ERROR_FILE_NOT_FOUND) {
+      RERROR("  Please ensure that TunSafe-TAP is properly installed.");
+    } else if (error_code == 0x1f) {
+      RERROR("  Please ensure that the TAP device is not in use.");
+    }
+    return NULL;
+  }
+  return h;
+}
+
+static bool AddRoute(int family,
+                     const void *dest, int dest_prefix,
+                     const void *gateway, const NET_LUID *interface_luid,
+                     std::vector<MIB_IPFORWARD_ROW2> *undo_array = NULL) {
+  MIB_IPFORWARD_ROW2 row = {0};
+  char buf1[kSizeOfAddress], buf2[kSizeOfAddress];
+
+  row.InterfaceLuid = *interface_luid;
+  row.DestinationPrefix.PrefixLength = dest_prefix;
+  row.DestinationPrefix.Prefix.si_family = family;
+  row.NextHop.si_family = family;
+  if (family == AF_INET) {
+    memcpy(&row.DestinationPrefix.Prefix.Ipv4.sin_addr, dest, 4);
+    memcpy(&row.NextHop.Ipv4.sin_addr, gateway, 4);
+  } else if (family == AF_INET6) {
+    memcpy(&row.DestinationPrefix.Prefix.Ipv6.sin6_addr, dest, 16);
+    memcpy(&row.NextHop.Ipv6.sin6_addr, gateway, 16);
+  } else {
+    return false;
+  }
+  row.ValidLifetime = 0xffffffff;
+  row.PreferredLifetime = 0xffffffff;
+  row.Metric = 100;
+  row.Protocol = MIB_IPPROTO_NETMGMT;
+
+  if (undo_array)
+    undo_array->push_back(row);
+
+  DWORD error = CreateIpForwardEntry2(&row);
+  if (error == NO_ERROR || error == ERROR_OBJECT_ALREADY_EXISTS) {
+    RINFO("Added Route %s  =>  %s", print_ip_prefix(buf1, family, dest, dest_prefix),
+          print_ip_prefix(buf2, family, gateway, -1));
+    return true;
+  }
+  RINFO("AddRoute failed (%d) %s  =>  %s", error, print_ip_prefix(buf1, family, dest, dest_prefix),
+        print_ip_prefix(buf2, family, gateway, -1));
+  return false;
+}
+
+static bool DeleteRoute(MIB_IPFORWARD_ROW2 *row) {
+  char buf1[kSizeOfAddress], buf2[kSizeOfAddress];
+  DWORD error = DeleteIpForwardEntry2(row);
+  
+  print_ip_prefix(buf1, row->DestinationPrefix.Prefix.si_family,
+    (row->DestinationPrefix.Prefix.si_family == AF_INET) ? (uint8*) &row->DestinationPrefix.Prefix.Ipv4.sin_addr : (uint8*) &row->DestinationPrefix.Prefix.Ipv6.sin6_addr, row->DestinationPrefix.PrefixLength);
+
+  print_ip_prefix(buf2, row->NextHop.si_family,
+    (row->NextHop.si_family == AF_INET) ? (uint8*)&row->NextHop.Ipv4.sin_addr : (uint8*)&row->NextHop.Ipv6.sin6_addr, -1);
+
+  if (error == NO_ERROR) {
+    RINFO("Deleted Route %s  =>  %s", buf1, buf2);
+    return true;
+  }
+  RINFO("DeleteRoute failed (%d) %s  =>  %s", error, buf1, buf2);
+  return false;
+}
+
+
+static uint32 CidrToNetmaskV4(int cidr) {
+  return cidr == 32 ? 0xffffffff : 0xffffffff << (32 - cidr);
+}
+
+struct RouteInfo {
+  uint8 default_gw[16];
+  NET_LUID default_adapter;
+  bool found_default_adapter;
+  uint8 found_null_routes;
+};
+
+static inline bool IsRouteOriginatingFromNullRoute(MIB_IPFORWARD_ROW2 *row) {
+  if (!(row->InterfaceLuid.Info.IfType == 24 && row->Protocol == MIB_IPPROTO_NETMGMT && row->DestinationPrefix.PrefixLength == 1))
+    return false;
+  if (row->NextHop.si_family == AF_INET) {
+    return (row->NextHop.Ipv4.sin_addr.S_un.S_addr == 0);
+  } else if (row->NextHop.si_family == AF_INET6) {
+    static const uint32 nulladdr[4];
+    return memcmp(&row->NextHop.Ipv6.sin6_addr, nulladdr, 16) == 0;
+  }
+  return false;
+}
+
+static inline bool IsRouteTheAddressOfTheServer(int family, MIB_IPFORWARD_ROW2 *row, uint8 *old_endpoint_to_delete) {
+  if (!(row->Protocol == MIB_IPPROTO_NETMGMT && row->DestinationPrefix.Prefix.si_family == family))
+    return false;
+  if (family == AF_INET) {
+    return (row->DestinationPrefix.PrefixLength == 32 && memcmp(&row->DestinationPrefix.Prefix.Ipv4.sin_addr, old_endpoint_to_delete, 4) == 0);
+  } else if (family == AF_INET6) {
+    return (row->DestinationPrefix.PrefixLength == 128 && memcmp(&row->DestinationPrefix.Prefix.Ipv6.sin6_addr, old_endpoint_to_delete, 16) == 0);
+  }
+  return false;
+}
+
+static void DeleteRouteOrPrintErr(MIB_IPFORWARD_ROW2 *row) {
+  char buf1[kSizeOfAddress];
+  UINT32 r = DeleteIpForwardEntry2(row);
+  if (r)
+    RERROR("Unable to delete old route (%d): %s", r,
+           print_ip_prefix(buf1, row->DestinationPrefix.Prefix.si_family, row->DestinationPrefix.Prefix.si_family == AF_INET ?
+              (void*)&row->DestinationPrefix.Prefix.Ipv4.sin_addr :
+              (void*)&row->DestinationPrefix.Prefix.Ipv6.sin6_addr, row->DestinationPrefix.PrefixLength));
+}
+
+static bool GetDefaultRouteAndDeleteOldRoutes(int family, const NET_LUID *InterfaceLuid, bool keep_null_routes, uint8 *old_endpoint_to_delete, RouteInfo *ri) {
+  MIB_IPFORWARD_TABLE2 *table = NULL;
+
+  assert(family == AF_INET || family == AF_INET6);
+
+  if (GetIpForwardTable2(family, &table))
+    return false;
+  DWORD rv = 0;
+  DWORD gw_metric = 0xffffffff;
+  ri->found_default_adapter = false;
+  ri->found_null_routes = 0;
+  for (unsigned i = 0; i < table->NumEntries; i++) {
+    MIB_IPFORWARD_ROW2 *row = &table->Table[i];
+    if (InterfaceLuid && memcmp(&row->InterfaceLuid, InterfaceLuid, sizeof(NET_LUID)) == 0) {
+      if (row->Protocol == MIB_IPPROTO_NETMGMT)
+        DeleteRouteOrPrintErr(row);
+    } else if (IsRouteOriginatingFromNullRoute(row)) {
+      ri->found_null_routes++;
+      if (!keep_null_routes)
+        DeleteRouteOrPrintErr(row);
+    } else if (row->DestinationPrefix.PrefixLength == 0 && row->Metric < gw_metric) {
+      gw_metric = row->Metric;
+      if (family == AF_INET) {
+        memcpy(&ri->default_gw, &row->NextHop.Ipv4.sin_addr, 4);
+      } else {
+        memcpy(&ri->default_gw, &row->NextHop.Ipv6.sin6_addr, 16);
+      }
+      ri->default_adapter = row->InterfaceLuid;
+      ri->found_default_adapter = true;
+    }
+  }
+
+  if (old_endpoint_to_delete && ri->found_default_adapter) {
+    for (unsigned i = 0; i < table->NumEntries; i++) {
+      MIB_IPFORWARD_ROW2 *row = &table->Table[i];
+      if (memcmp(&row->InterfaceLuid, &ri->default_adapter, sizeof(NET_LUID)) == 0) {
+        if (IsRouteTheAddressOfTheServer(family, row, old_endpoint_to_delete))
+          DeleteRouteOrPrintErr(row);
+      }
+    }
+  }
+
+  FreeMibTable(table);
+  return (rv == 0);
+}
+
+static inline bool NoMoreAllocationRetry(volatile bool *exit_flag) {
+  if (*exit_flag)
+    return true;
+  Sleep(1000);
+  return *exit_flag;
+}
+
+static inline bool AllocPacketFrom(Packet **list, int *counter, bool *exit_flag, Packet **res) {
+   Packet *p;
+   if (p = *list) {
+     *list = p->next;
+     (*counter)--;
+     p->data = p->data_buf + Packet::HEADROOM_BEFORE;
+   } else {
+     while ((p = AllocPacket()) == NULL) {
+      if (NoMoreAllocationRetry(exit_flag))
+        return false;
+    }
+   }
+  *res = p;
+  return true;
+}
+
+static void FreePacketList(Packet *pp) {
+  while (Packet *p = pp) {
+    pp = p->next;
+    FreePacket(p);
+  }
+}
+
+UdpSocketWin32::UdpSocketWin32() {
+  wqueue_end_ = &wqueue_;
+  wqueue_ = NULL;
+  exit_thread_ = false;
+  socket_ = INVALID_SOCKET;
+  thread_ = NULL;
+  socket_ipv6_ = INVALID_SOCKET;
+  completion_port_handle_ = NULL;
+
+  InitializeCriticalSectionAndSpinCount(&mutex_, 1024);
+}
+
+UdpSocketWin32::~UdpSocketWin32() {
+  assert(thread_ == NULL);
+  closesocket(socket_);
+  closesocket(socket_ipv6_);
+  CloseHandle(completion_port_handle_);
+  FreePacketList(wqueue_);
+  DeleteCriticalSection(&mutex_);
+}
+
+bool UdpSocketWin32::Initialize(int listen_on_port) {
+  SOCKET s = WSASocket(AF_INET, SOCK_DGRAM, 0, NULL, 0, WSA_FLAG_OVERLAPPED);
+  if (s == INVALID_SOCKET) {
+    RERROR("UdpSocketWin32::Initialize WSASocket failed");
+    return false;
+  }
+  completion_port_handle_ = CreateIoCompletionPort((HANDLE)s, NULL, NULL, 0);
+  if (!completion_port_handle_) {
+    closesocket(s);
+    return false;
+  }
+  socket_ = s;
+
+  sockaddr_in sin = {0};
+  sin.sin_family = AF_INET;
+  sin.sin_port = htons(listen_on_port);
+  if (bind(s, (struct sockaddr*)&sin, sizeof(sin)) != 0) {
+    RERROR("UdpSocketWin32::Initialize bind failed");
+    return false;
+  }
+
+  // Also open up a socket for ipv6
+  s = WSASocket(AF_INET6, SOCK_DGRAM, 0, NULL, 0, WSA_FLAG_OVERLAPPED);
+  if (s != INVALID_SOCKET) {
+    if (!CreateIoCompletionPort((HANDLE)s, completion_port_handle_, 1, 0)) {
+      RERROR("IPv6 Socket completion port failed.");
+      closesocket(s);
+    } else {
+      socket_ipv6_ = s;
+      sockaddr_in6 sin6 = {0};
+      sin6.sin6_family = AF_INET6;
+      sin6.sin6_port = htons(listen_on_port);
+      if (bind(s, (struct sockaddr*)&sin6, sizeof(sin6)) != 0) {
+        RERROR("UdpSocketWin32::Initialize bind failed IPv6");
+      }
+    }
+  } else {
+    RERROR("IPv6 Socket creation failed.");
+  }
+  return true;
+}
+
+enum {
+  kUdpGetQueuedCompletionStatusSize = kConcurrentWriteTap + kConcurrentReadTap + 1
+};
+
+static inline void ClearOverlapped(OVERLAPPED *o) {
+  memset(o, 0, sizeof(*o));
+}
+
+#ifndef STATUS_PORT_UNREACHABLE
+#define STATUS_PORT_UNREACHABLE 0xC000023F
+#endif
+
+static inline bool IsIgnoredUdpError(DWORD err) {
+  return err == WSAEMSGSIZE || err == WSAECONNRESET || err == WSAENETRESET || err == STATUS_PORT_UNREACHABLE;
+}
+
+void UdpSocketWin32::ThreadMain() {
+  OVERLAPPED_ENTRY entries[kUdpGetQueuedCompletionStatusSize];
+  Packet *pending_writes = NULL;
+  int num_reads[2] = {0,0}, num_writes = 0;
+  enum { IPV4, IPV6 };
+  Packet *finished_reads = NULL, **finished_reads_end = &finished_reads;
+  Packet *freed_packets = NULL, **freed_packets_end = &freed_packets;
+  int freed_packets_count = 0;
+  int max_read_ipv6 = socket_ipv6_ != INVALID_SOCKET ? 1 : 0;
+
+  while (!exit_thread_) {
+    // Listen with multiple ipv6 packets only if we ever sent an ipv6 packet.
+    for (int i = num_reads[IPV6]; i < max_read_ipv6; i++) {
+      Packet *p;
+      if (!AllocPacketFrom(&freed_packets, &freed_packets_count, &exit_thread_, &p))
+        break;
+restart_read_udp6:
+      ClearOverlapped(&p->overlapped);
+      p->post_target = ThreadedPacketQueue::TARGET_PROCESSOR_UDP;
+      WSABUF wsabuf = {(ULONG)kPacketCapacity, (char*)p->data};
+      DWORD flags = 0;
+      p->sin_size = sizeof(p->addr.sin6);
+      if (WSARecvFrom(socket_ipv6_, &wsabuf, 1, NULL, &flags, (struct sockaddr*)&p->addr, &p->sin_size, &p->overlapped, NULL) != 0) {
+        DWORD err = WSAGetLastError();
+        if (err != WSA_IO_PENDING) {
+          if (err == WSAEMSGSIZE || err == WSAECONNRESET || err == WSAENETRESET)
+            goto restart_read_udp6;
+          RERROR("UdpSocketWin32:WSARecvFrom failed 0x%X", err);
+          FreePacket(p);
+          break;
+        }
+      }
+      num_reads[IPV6]++;
+    }
+
+    // Initiate more reads, reusing the Packet structures in |finished_writes|.
+    for (int i = num_reads[IPV4]; i < kConcurrentReadTap; i++) {
+      Packet *p;
+      if (!AllocPacketFrom(&freed_packets, &freed_packets_count, &exit_thread_, &p))
+        break;
+restart_read_udp:
+      ClearOverlapped(&p->overlapped);
+      p->post_target = ThreadedPacketQueue::TARGET_PROCESSOR_UDP;
+      WSABUF wsabuf = {(ULONG)kPacketCapacity, (char*)p->data};
+      DWORD flags = 0;
+      p->sin_size = sizeof(p->addr.sin);
+      if (WSARecvFrom(socket_, &wsabuf, 1, NULL, &flags, (struct sockaddr*)&p->addr, &p->sin_size, &p->overlapped, NULL) != 0) {
+        DWORD err = WSAGetLastError();
+        if (err != WSA_IO_PENDING) {
+          if (err == WSAEMSGSIZE || err == WSAECONNRESET || err == WSAENETRESET)
+            goto restart_read_udp;
+          RERROR("UdpSocketWin32:WSARecvFrom failed 0x%X", err);
+          FreePacket(p);
+          break;
+        }
+      }
+      num_reads[IPV4]++;
+    }
+
+    assert(freed_packets_count >= 0);
+    if (freed_packets_count >= 32) {
+      FreePackets(freed_packets, freed_packets_end, freed_packets_count);
+      freed_packets_count = 0;
+      freed_packets_end = &freed_packets;
+    } else if (freed_packets == NULL) {
+      assert(freed_packets_count == 0);
+      freed_packets_end = &freed_packets;
+    }
+
+    ULONG num_entries = 0;
+    if (!GetQueuedCompletionStatusEx(completion_port_handle_, entries, kUdpGetQueuedCompletionStatusSize, &num_entries, INFINITE, FALSE)) {
+      RINFO("GetQueuedCompletionStatusEx failed.");
+      break;
+    }
+    finished_reads_end = &finished_reads;
+
+    int finished_reads_count = 0;
+    // Go through the finished entries and determine which ones are reads, and which ones are writes.
+    for (ULONG i = 0; i < num_entries; i++) {
+      if (!entries[i].lpOverlapped)
+        continue; // This is the dummy entry from |PostQueuedCompletionStatus|
+      Packet *p = (Packet*)((byte*)entries[i].lpOverlapped - offsetof(Packet, overlapped));
+      if (p->post_target == ThreadedPacketQueue::TARGET_PROCESSOR_UDP) {
+        num_reads[entries[i].lpCompletionKey]--;
+        if ((DWORD)p->overlapped.Internal != 0) {
+          if (!IsIgnoredUdpError((DWORD)p->overlapped.Internal))
+            RERROR("UdpSocketWin32::Read error 0x%X", (DWORD)p->overlapped.Internal);
+          FreePacket(p);
+          continue;
+        }
+        p->size = (int)p->overlapped.InternalHigh;
+        *finished_reads_end = p;
+        finished_reads_end = &p->next;
+        finished_reads_count++;
+      } else {
+        num_writes--;
+        if ((DWORD)p->overlapped.Internal != 0) {
+          RERROR("UdpSocketWin32::Write error 0x%X", (DWORD)p->overlapped.Internal);
+          FreePacket(p);
+          continue;
+        }
+        *freed_packets_end = p;
+        freed_packets_end = &p->next;
+        freed_packets_count++;
+      }
+    }
+    *finished_reads_end = NULL;
+    *freed_packets_end = NULL;
+    assert(num_writes >= 0);
+
+    // Push all the finished reads to the packet handler
+    if (finished_reads != NULL) {
+      packet_handler_->Post(finished_reads, finished_reads_end, finished_reads_count);
+    }
+    // Initiate more writes from |wqueue_|
+    while (num_writes < kConcurrentWriteTap) {
+      // Refill from queue if empty, avoid taking the mutex if it looks empty
+      if (!pending_writes) {
+        if (!wqueue_)
+          break;
+        EnterCriticalSection(&mutex_);
+        pending_writes = wqueue_;
+        wqueue_end_ = &wqueue_;
+        wqueue_ = NULL;
+        LeaveCriticalSection(&mutex_);
+        if (!pending_writes)
+          break;
+      }
+
+      qs.udp_qsize1+= pending_writes->size;
+
+      // Then issue writes
+      Packet *p = pending_writes;
+      pending_writes = p->next;
+      ClearOverlapped(&p->overlapped);
+      p->post_target = ThreadedPacketQueue::TARGET_UDP_DEVICE;
+      WSABUF wsabuf = {(ULONG)p->size, (char*)p->data};
+
+      int rv;
+      if (p->addr.sin.sin_family == AF_INET) {
+        rv = WSASendTo(socket_, &wsabuf, 1, NULL, 0, (struct sockaddr*)&p->addr.sin, sizeof(p->addr.sin), &p->overlapped, NULL);
+      } else {
+        if (socket_ipv6_ == INVALID_SOCKET) {
+          RERROR("UdpSocketWin32: unavailable ipv6 socket");
+          FreePacket(p);
+          continue;
+        }
+        max_read_ipv6 = kConcurrentReadTap;
+        rv = WSASendTo(socket_ipv6_, &wsabuf, 1, NULL, 0, (struct sockaddr*)&p->addr.sin6, sizeof(p->addr.sin6), &p->overlapped, NULL);
+      }
+      if (rv != 0) {
+        DWORD err = WSAGetLastError();
+        if (err != ERROR_IO_PENDING) {
+          RERROR("UdpSocketWin32: WSASendTo failed 0x%X", err);
+          FreePacket(p);
+          continue;
+        }
+      }
+      num_writes++;
+    }
+  }
+  FreePacketList(freed_packets);
+  FreePacketList(pending_writes);
+
+  // Cancel all IO and wait for all completions
+  CancelIo((HANDLE)socket_);
+  CancelIo((HANDLE)socket_ipv6_);
+
+  while (num_reads[IPV4] + num_reads[IPV6] + num_writes) {
+    ULONG num_entries = 0;
+    if (!GetQueuedCompletionStatusEx(completion_port_handle_, entries, 1, &num_entries, INFINITE, FALSE)) {
+      RINFO("GetQueuedCompletionStatusEx failed.");
+      break;
+    }
+    if (!entries[0].lpOverlapped)
+      continue; // This is the dummy entry from |PostQueuedCompletionStatus|
+    Packet *p = (Packet*)((byte*)entries[0].lpOverlapped - offsetof(Packet, overlapped));
+    if (p->post_target == ThreadedPacketQueue::TARGET_PROCESSOR_UDP) {
+      num_reads[entries[0].lpCompletionKey]--;
+    } else {
+      num_writes--;
+    }
+    FreePacket(p);
+  }
+}
+
+
+
+// Called on another thread to queue up a udp packet
+void UdpSocketWin32::WriteUdpPacket(Packet *packet) {
+  if (qs.udp_qsize2 - qs.udp_qsize1 >= (unsigned)(packet->size < 576 ? MAX_BYTES_IN_UDP_OUT_QUEUE_SMALL : MAX_BYTES_IN_UDP_OUT_QUEUE)) {
+    FreePacket(packet);
+    return;
+  }
+  packet->next = NULL;
+  qs.udp_qsize2 += packet->size;
+  
+  EnterCriticalSection(&mutex_);
+  Packet *was_empty = wqueue_;
+  *wqueue_end_ = packet;
+  wqueue_end_ = &packet->next;
+  LeaveCriticalSection(&mutex_);
+
+  if (was_empty == NULL) {
+    // Notify the worker thread that it should attempt more writes
+    PostQueuedCompletionStatus(completion_port_handle_, NULL, NULL, NULL);
+  }
+}
+
+DWORD WINAPI UdpSocketWin32::UdpThread(void *x) {
+  UdpSocketWin32 *udp = (UdpSocketWin32 *)x;
+  udp->ThreadMain();
+  return 0;
+}
+
+void UdpSocketWin32::StartThread() {
+  DWORD thread_id;
+  thread_ = CreateThread(NULL, 0, &UdpThread, this, 0, &thread_id);
+  SetThreadPriority(thread_, ABOVE_NORMAL_PRIORITY_CLASS);
+}
+
+void UdpSocketWin32::StopThread() {
+  exit_thread_ = true;
+  PostQueuedCompletionStatus(completion_port_handle_, NULL, NULL, NULL);
+  WaitForSingleObject(thread_, INFINITE);
+  CloseHandle(thread_);
+  thread_ = NULL;
+}
+
+ThreadedPacketQueue::ThreadedPacketQueue(WireguardProcessor *wg, NetworkStats *stats) {
+  wg_ = wg;
+  stats_ = stats;
+  InitializeCriticalSectionAndSpinCount(&mutex_, 1024);
+  event_ = CreateEvent(NULL, FALSE, FALSE, NULL);
+
+  last_ptr_ = &first_;
+  first_ = NULL;
+  handle_ = NULL;
+  timer_handle_ = NULL;
+  exit_flag_ = false;
+  timer_interrupt_ = false;
+  packets_in_queue_ = 0;
+  need_notify_ = 0;
+}
+
+ThreadedPacketQueue::~ThreadedPacketQueue() {
+  assert(handle_ == NULL);
+  assert(timer_handle_ == NULL);
+  first_ = NULL;
+  last_ptr_ = &first_;
+  DeleteCriticalSection(&mutex_);
+  CloseHandle(event_);
+}
+
+DWORD WINAPI ThreadedPacketQueue::ThreadedPacketQueueLauncher(VOID *x) {
+  ThreadedPacketQueue *pq = (ThreadedPacketQueue *)x;
+  return pq->ThreadMain();
+}
+
+DWORD ThreadedPacketQueue::ThreadMain() {
+  int free_packets_ctr = 0;
+  int overload = 0;
+
+  EnterCriticalSection(&mutex_);
+  while (!exit_flag_) {
+    if (timer_interrupt_) {
+      timer_interrupt_ = false;
+      need_notify_ = 0;
+      LeaveCriticalSection(&mutex_);
+      wg_->SecondLoop();
+      EnterCriticalSection(&stats_->mutex);
+      if (stats_->reset_stats) {
+        stats_->reset_stats = false;
+        wg_->ResetStats();
+      }
+      stats_->packet_stats = wg_->GetStats();
+      LeaveCriticalSection(&stats_->mutex);
+
+      CallbackUpdateUI();
+
+      // Conserve memory every 10s
+      if (free_packets_ctr++ == 10) {
+        free_packets_ctr = 0;
+        FreeAllPackets();
+      }
+      if (overload)
+        overload -= 1;
+      EnterCriticalSection(&mutex_);
+      continue;
+    }
+
+    // Grab the elements of the queue
+    Packet *packet = first_;
+    if (packet == NULL) {
+      need_notify_ = 1;
+      LeaveCriticalSection(&mutex_);
+      WaitForSingleObject(event_, INFINITE);
+      EnterCriticalSection(&mutex_);
+
+      //SleepConditionVariableCS(&cv_, &mutex, INFINITE);
+      continue;
+    }
+    // Steal the whole work queue
+    first_ = NULL;
+    last_ptr_ = &first_;
+    int packets_in_queue = packets_in_queue_;
+    packets_in_queue_ = 0;
+    need_notify_ = 0;
+    LeaveCriticalSection(&mutex_);
+
+    tpq_last_qsize = packets_in_queue;
+    if (packets_in_queue >= 1024)
+      overload = 2;
+    bool is_overload = (overload != 0);
+
+    WireguardProcessor *procint = wg_;
+    do {
+      Packet *next = packet->next;
+      if (packet->post_target == TARGET_PROCESSOR_UDP)
+        procint->HandleUdpPacket(packet, is_overload);
+      else
+        procint->HandleTunPacket(packet);
+      packet = next;
+    } while (packet);
+    EnterCriticalSection(&mutex_);
+  }
+  LeaveCriticalSection(&mutex_);
+  return 0;
+}
+
+void ThreadedPacketQueue::Start() {
+  if (handle_ == NULL) {
+    exit_flag_ = false;
+    DWORD thread_id;
+    handle_ = CreateThread(NULL, 0, &ThreadedPacketQueueLauncher, this, 0, &thread_id);
+  }
+
+  assert(timer_handle_ == NULL);
+  timer_handle_ = CreateWaitableTimer(NULL, FALSE, NULL);
+  long long due_time = 10000000;
+  SetWaitableTimer(timer_handle_, (LARGE_INTEGER*)&due_time, 1000, &TimerRoutine, this, FALSE);
+}
+
+void ThreadedPacketQueue::Stop() {
+  EnterCriticalSection(&mutex_);
+  exit_flag_ = true;
+  LeaveCriticalSection(&mutex_);
+
+  SetEvent(event_);
+
+  if (timer_handle_ != NULL) {
+    // Not sure if just CloseHandle will close any outstanding APCs
+    CancelWaitableTimer(timer_handle_);
+    CloseHandle(timer_handle_);
+    timer_handle_ = NULL;
+  }
+
+  if (handle_ != NULL) {
+    WaitForSingleObject(handle_, INFINITE);
+    CloseHandle(handle_);
+    handle_ = NULL;
+  }
+
+}
+
+void ThreadedPacketQueue::AbortingDriver() {
+  EnterCriticalSection(&mutex_);
+  exit_flag_ = true;
+  LeaveCriticalSection(&mutex_);
+}
+
+void ThreadedPacketQueue::Post(Packet *packet, Packet **end, int count) {
+  EnterCriticalSection(&mutex_);
+  if (packets_in_queue_ >= HARD_MAXIMUM_QUEUE_SIZE) {
+    LeaveCriticalSection(&mutex_);
+    FreePackets(packet, end, count);
+    return;
+  }
+  assert(packet != NULL);
+  if (!first_) {
+    assert(last_ptr_ == &first_);
+  }
+  packets_in_queue_ += count;
+  *last_ptr_ = packet;
+  last_ptr_ = end;
+  if (!first_) {
+    assert(last_ptr_ == &first_);
+  }
+  if (need_notify_) {
+    need_notify_ = 0;
+    LeaveCriticalSection(&mutex_);
+    SetEvent(event_);
+    return;
+  }
+  LeaveCriticalSection(&mutex_);
+}
+
+void CALLBACK ThreadedPacketQueue::TimerRoutine(LPVOID lpArgToCompletionRoutine, DWORD dwTimerLowValue, DWORD dwTimerHighValue) {
+  ((ThreadedPacketQueue*)lpArgToCompletionRoutine)->PostTimerInterrupt();
+}
+
+void ThreadedPacketQueue::PostTimerInterrupt() {
+  EnterCriticalSection(&mutex_);
+  timer_interrupt_ = true;
+  if (need_notify_) {
+    need_notify_ = 0;
+    LeaveCriticalSection(&mutex_);
+    SetEvent(event_);
+    return;
+  }
+  LeaveCriticalSection(&mutex_);
+}
+
+bool GetNetLuidFromGuid(const char *adapter_guid, NET_LUID *luid) {
+  char buffer[64];
+  UUID uuid;
+  size_t len = strlen(adapter_guid);
+  if (adapter_guid[0] != '{' || adapter_guid[len - 1] != '}' || len >= 64) return false;
+  buffer[len - 2] = 0;
+  memcpy(buffer, adapter_guid + 1, len - 2);
+  RPC_STATUS status = UuidFromStringA((RPC_CSTR)buffer, &uuid);
+  if (status != 0)
+    return false;
+  return ConvertInterfaceGuidToLuid((GUID*)&uuid, luid) == 0;
+}
+
+DWORD SetMtuOnNetworkAdapter(NET_LUID *InterfaceLuid, ADDRESS_FAMILY family, int new_mtu) {
+  MIB_IPINTERFACE_ROW row;
+  DWORD err;
+  InitializeIpInterfaceEntry(&row);
+  row.Family = family;
+  row.InterfaceLuid = *InterfaceLuid;
+  if ((err = GetIpInterfaceEntry(&row)) == 0) {
+    row.NlMtu = new_mtu;
+    if (row.Family == AF_INET)
+      row.SitePrefixLength = 0;
+    err = SetIpInterfaceEntry(&row);
+  }
+  return err;
+}
+
+DWORD SetMetricOnNetworkAdapter(NET_LUID *InterfaceLuid, ADDRESS_FAMILY family, int new_metric) {
+  MIB_IPINTERFACE_ROW row;
+  DWORD err;
+  InitializeIpInterfaceEntry(&row);
+  row.Family = family;
+  row.InterfaceLuid = *InterfaceLuid;
+  if ((err = GetIpInterfaceEntry(&row)) == 0) {
+    row.Metric = new_metric;
+    row.UseAutomaticMetric = (new_metric == 0);
+    if (row.Family == AF_INET)
+      row.SitePrefixLength = 0;
+    err = SetIpInterfaceEntry(&row);
+  }
+  return err;
+}
+
+static const char *PrintIPV6(const uint8 new_address[16]) {
+  sockaddr_in6 sin6 = {0};
+  static char buf[100];
+  if (!inet_ntop(PF_INET6, new_address, buf, 100))
+    memcpy(buf, "unknown", 8);
+  return buf;
+}
+
+static bool SetIPV6AddressOnInterface(NET_LUID *InterfaceLuid, const uint8 new_address[16], int new_cidr) {
+  NETIO_STATUS Status;
+  PMIB_UNICASTIPADDRESS_TABLE table = NULL;
+  Status = GetUnicastIpAddressTable(AF_INET6, &table);
+  if (Status != 0) {
+    RERROR("GetUnicastAddressTable Failed. Error %d\n", Status);
+    return false;
+  }
+
+  bool found_row = false;
+  for (int i = 0; i < (int)table->NumEntries; i++) {
+    MIB_UNICASTIPADDRESS_ROW *row = &table->Table[i];
+    if (!memcmp(&row->InterfaceLuid, InterfaceLuid, sizeof(NET_LUID))) {
+      if (row->PrefixOrigin == 1 && row->SuffixOrigin == 1) {
+        if (row->OnLinkPrefixLength == new_cidr && !memcmp(&row->Address.Ipv6.sin6_addr, new_address, 16)) {
+          found_row = true;
+          continue;
+        }
+        Status = DeleteUnicastIpAddressEntry(row);
+        if (Status)
+          RERROR("Error %d deleting IPv6 address: %s/%d", Status, PrintIPV6((uint8*)&row->Address.Ipv6.sin6_addr), row->OnLinkPrefixLength);
+        else
+          RINFO("Deleted IPv6 address: %s/%d", PrintIPV6((uint8*)&row->Address.Ipv6.sin6_addr), row->OnLinkPrefixLength);
+      }
+    }
+  }
+  FreeMibTable(table);
+
+  if (found_row) {
+    RINFO("Using IPv6 address: %s/%d", PrintIPV6(new_address), new_cidr);
+    return true;
+  }
+
+  MIB_UNICASTIPADDRESS_ROW Row;
+  InitializeUnicastIpAddressEntry(&Row);
+  Row.OnLinkPrefixLength = new_cidr;
+  Row.Address.si_family = AF_INET6;
+  memcpy(&Row.Address.Ipv6.sin6_addr, new_address, 16);
+  Row.InterfaceLuid = *InterfaceLuid;
+  Status = CreateUnicastIpAddressEntry(&Row);
+  if (Status != 0) {
+    RERROR("Error %d setting IPv6 address: %s/%d", Status, PrintIPV6(new_address), new_cidr);
+    return false;
+  }
+  RINFO("Set IPV6 Address to: %s/%d", PrintIPV6(new_address), new_cidr);
+  return true;
+}
+
+static bool IsIpv6AddressSet(const void *p) {
+  return (ReadLE64(p) | ReadLE64((char*)p + 8)) != 0;
+}
+
+
+static bool SetIPV6DnsOnInterface(NET_LUID *InterfaceLuid, const uint8 new_address[16]) {
+  char buf[128];
+  char ipv6[128];
+  NET_IFINDEX InterfaceIndex;
+  if (ConvertInterfaceLuidToIndex(InterfaceLuid, &InterfaceIndex))
+    return false;
+  if (IsIpv6AddressSet(new_address)) {
+    if (!inet_ntop(AF_INET6, new_address, ipv6, sizeof(ipv6)))
+      return false;
+
+    snprintf(buf, sizeof(buf), "netsh interface ipv6 set dns name=%d static %s validate=no", InterfaceIndex, ipv6);
+  } else {
+    snprintf(buf, sizeof(buf), "netsh interface ipv6 delete dns name=%d all", InterfaceIndex);
+  }
+  return RunNetsh(buf);
+}
+
+static uint32 ComputeIpv4DefaultRoute(uint32 ip, uint32 netmask) {
+  uint32 default_route_v4 = (ip & netmask) | 1;
+  if (default_route_v4 == ip)
+    default_route_v4++;
+  return default_route_v4;
+}
+
+static void ComputeIpv6DefaultRoute(const uint8 *ipv6_address, uint8 ipv6_cidr, uint8 *default_route_v6) {
+  memcpy(default_route_v6, ipv6_address, 16);
+  // clear the last bits of the ipv6 address to match the cidr.
+  size_t n = (ipv6_cidr + 7) >> 3;
+  memset(&default_route_v6[n], 0, 16 - n);
+  if (n == 0)
+    return;
+  // adjust the final byte
+  default_route_v6[n - 1] &= ~(0xff >> (ipv6_cidr & 7));
+  // set the very last byte to something
+  default_route_v6[15] |= 1;
+  // ensure it doesn't collide
+  if (memcmp(default_route_v6, ipv6_address, 16) == 0)
+    default_route_v6[15] ^= 3;
+}
+
+
+static bool AddMultipleCatchallRoutes(int inet, int bits, const uint8 *target, const NET_LUID &luid) {
+  uint8 tmp[16] = {0};
+  bool success = true;
+  for (int i = 0; i < (1 << bits); i++) {
+    tmp[0] = i << (8 - bits);
+    success &= AddRoute(inet, tmp, bits, target, &luid);
+  }
+  return success;
+}
+
+static uint8 GetInternetRouteBlockingState() {
+  if (internet_route_blocking_state == ROUTE_BLOCK_UNKNOWN) {
+    RouteInfo ri;
+    internet_route_blocking_state =
+      (GetDefaultRouteAndDeleteOldRoutes(AF_INET, NULL, TRUE, NULL, &ri) && ri.found_null_routes == 2) + ROUTE_BLOCK_OFF;
+  }
+  return internet_route_blocking_state;
+}
+
+static void SetInternetRouteBlockingState(bool want) {
+  if (want) {
+    internet_route_blocking_state = ROUTE_BLOCK_PENDING;
+  } else if (internet_route_blocking_state != ROUTE_BLOCK_OFF) {
+    RouteInfo ri;
+    GetDefaultRouteAndDeleteOldRoutes(AF_INET, NULL, FALSE, NULL, &ri);
+    GetDefaultRouteAndDeleteOldRoutes(AF_INET6, NULL, FALSE, NULL, &ri);
+    internet_route_blocking_state = ROUTE_BLOCK_OFF;
+  }
+}
+
+InternetBlockState GetInternetBlockState(bool *is_activated) {
+  int a = GetInternetRouteBlockingState();
+  int b = GetInternetFwBlockingState();
+
+  if (is_activated)
+    *is_activated = (a == ROUTE_BLOCK_ON || b == IBS_ACTIVE);
+
+  return (InternetBlockState)(
+    (a >= ROUTE_BLOCK_ON) * kBlockInternet_Route +
+    (b >= IBS_ACTIVE) * kBlockInternet_Firewall);
+}
+
+void SetInternetBlockState(InternetBlockState s) {
+  SetInternetRouteBlockingState((s & kBlockInternet_Route) != 0);
+  SetInternetFwBlockingState((s & kBlockInternet_Firewall) != 0);
+}
+
+TunWin32Adapter::TunWin32Adapter() {
+  handle_ = NULL;
+  current_dns_block_ = NULL;
+}
+
+TunWin32Adapter::~TunWin32Adapter() {
+
+}
+
+bool TunWin32Adapter::OpenAdapter(bool *exit_thread, DWORD open_flags) {
+  int retry_count = 10;
+  handle_ = OpenTunAdapter(guid_, retry_count, exit_thread, open_flags);
+  return (handle_ != NULL);
+}
+
+bool TunWin32Adapter::InitAdapter(const TunInterface::TunConfig &&config, TunInterface::TunConfigOut *out) {
+  ULONG info[3];
+  DWORD len;
+  out->enable_neighbor_discovery_spoofing = false;
+
+  if (!RunPrePostCommand(config.pre_post_commands.pre_up)) {
+    RERROR("Pre command failed!");
+    return false;
+  }
+  
+  memset(info, 0, sizeof(info));
+  if (DeviceIoControl(handle_, TAP_IOCTL_GET_VERSION, &info, sizeof(info),
+                      &info, sizeof(info), &len, NULL)) {
+    RINFO("TAP Driver Version %d.%d %s", (int)info[0], (int)info[1], (info[2] ? "(DEBUG)" : ""));
+  }
+
+  if (info[0] < 9 || info[0] == 9 && info[1] <= 8) {
+    RERROR("TAP is too old. Go to https://tunsafe.com/download to upgrade the driver");
+    return false;
+  }
+
+  //  ULONG mtu = 0;
+  //  if (DeviceIoControl(handle_, TAP_IOCTL_GET_MTU, &mtu, sizeof(mtu), &mtu, sizeof(mtu), &len, NULL))
+  //    RINFO("TAP-Win32 MTU=%d", (int)mtu);
+  //  mtu_ = mtu;
+
+  uint32 netmask = CidrToNetmaskV4(config.cidr);
+
+  // Set TAP-Windows TUN subnet mode
+  if (1) {
+    uint32 v[3];
+
+    v[0] = htonl(config.ip);
+    v[1] = htonl(config.ip & netmask);
+    v[2] = htonl(netmask);
+    if (!DeviceIoControl(handle_, TAP_IOCTL_CONFIG_TUN, v, sizeof(v), v, sizeof(v), &len, NULL)) {
+      RERROR("DeviceIoControl(TAP_IOCTL_CONFIG_TUN) failed");
+      return false;
+    }
+  }
+
+  // Set DHCP IP/netmask
+  {
+    uint32 v[4];
+    v[0] = htonl(config.ip);
+    v[1] = htonl(netmask);
+    v[2] = htonl((config.ip | ~netmask) - 1); // x.x.x.254
+    v[3] = 31536000;                         // One year
+    if (!DeviceIoControl(handle_, TAP_IOCTL_CONFIG_DHCP_MASQ, v, sizeof(v), v, sizeof(v), &len, NULL)) {
+      RERROR("DeviceIoControl(TAP_IOCTL_CONFIG_DHCP_MASQ) failed");
+      return false;
+    }
+  }
+
+  bool has_dns_setting = false;
+
+  // Set DHCP config string
+  if (config.dhcp_options_size != 0) {
+    byte output[10];
+    if (!DeviceIoControl(handle_, TAP_IOCTL_CONFIG_DHCP_SET_OPT,
+      (void*)config.dhcp_options, (DWORD)config.dhcp_options_size, output, sizeof(output), &len, NULL)) {
+      RERROR("DeviceIoControl(TAP_IOCTL_CONFIG_DHCP_SET_OPT) failed");
+      return false;
+    }
+    has_dns_setting = true;
+  }
+
+  // Get device MAC address
+  if (!DeviceIoControl(handle_, TAP_IOCTL_GET_MAC, mac_adress_, 6, mac_adress_, sizeof(mac_adress_), &len, NULL)) {
+    RERROR("DeviceIoControl(TAP_IOCTL_GET_MAC) failed");
+  } else {
+    out->enable_neighbor_discovery_spoofing = true;
+    memcpy(out->neighbor_discovery_spoofing_mac, mac_adress_, sizeof(out->neighbor_discovery_spoofing_mac));
+  }
+
+  // Set driver media status to 'connected'
+  ULONG status = TRUE;
+  if (!DeviceIoControl(handle_, TAP_IOCTL_SET_MEDIA_STATUS, &status, sizeof(status),
+                       &status, sizeof(status), &len, NULL)) {
+    RERROR("DeviceIoControl(TAP_IOCTL_SET_MEDIA_STATUS) failed");
+    return false;
+  }
+
+  NET_LUID InterfaceLuid = {0};
+  bool has_interface_luid = GetNetLuidFromGuid(guid_, &InterfaceLuid);
+
+  if (!has_interface_luid) {
+    RERROR("Unable to determine interface luid for %s.", guid_);
+    return false;
+  }
+
+  DWORD err;
+
+  if (config.mtu) {
+    err = SetMtuOnNetworkAdapter(&InterfaceLuid, AF_INET, config.mtu);
+    if (err)
+      RERROR("SetMtuOnNetworkAdapter IPv4 failed: %d", err);
+    if (config.ipv6_cidr) {
+      err = SetMtuOnNetworkAdapter(&InterfaceLuid, AF_INET6, config.mtu);
+      if (err)
+        RERROR("SetMtuOnNetworkAdapter IPv6 failed: %d", err);
+    }
+  }
+
+  if (config.ipv6_cidr) {
+    SetIPV6AddressOnInterface(&InterfaceLuid, config.ipv6_address, config.ipv6_cidr);
+    if (config.set_ipv6_dns) {
+      has_dns_setting |= IsIpv6AddressSet(config.dns_server_v6);
+      if (!SetIPV6DnsOnInterface(&InterfaceLuid, config.dns_server_v6)) {
+        RERROR("SetIPV6DnsOnInterface: failed");
+      }
+    }
+  }
+
+  if (has_dns_setting && config.block_dns_on_adapters) {
+    RINFO("Blocking standard DNS on all adapters");
+    current_dns_block_ = BlockDnsExceptOnAdapter(InterfaceLuid, config.ipv6_cidr != 0);
+
+    err = SetMetricOnNetworkAdapter(&InterfaceLuid, AF_INET, 2);
+    if (err)
+      RERROR("SetMetricOnNetworkAdapter IPv4 failed: %d", err);
+
+    if (config.ipv6_cidr) {
+      err = SetMetricOnNetworkAdapter(&InterfaceLuid, AF_INET6, 2);
+      if (err)
+        RERROR("SetMetricOnNetworkAdapter IPv6 failed: %d", err);
+    }
+  }
+
+  uint8 ibs = config.internet_blocking;
+  if (ibs == kBlockInternet_Default || ibs == kBlockInternet_DefaultOn) {
+    uint8 new_ibs = GetInternetBlockState(NULL);
+    ibs = (new_ibs == kBlockInternet_Off && ibs == kBlockInternet_DefaultOn) ? kBlockInternet_Firewall : new_ibs;
+  }
+
+  bool block_all_traffic_route = (ibs & kBlockInternet_Route) != 0;
+
+  RouteInfo ri, ri6;
+
+  uint32 default_route_endpoint_v4 = ToBE32(config.default_route_endpoint_v4);
+
+  // Delete any current /1 default routes and read some stuff from the routing table.
+  if (!GetDefaultRouteAndDeleteOldRoutes(AF_INET, &InterfaceLuid, block_all_traffic_route, config.use_ipv4_default_route ? (uint8*)&default_route_endpoint_v4 : NULL, &ri)) {
+    RERROR("Unable to read old default gateway and delete old default routes.");
+    return false;
+  }
+
+  if (config.ipv6_cidr) {
+    // Delete any current /1 default routes and read some stuff from the routing table.
+    if (!GetDefaultRouteAndDeleteOldRoutes(AF_INET6, &InterfaceLuid, block_all_traffic_route, config.use_ipv6_default_route ? (uint8*)config.default_route_endpoint_v6 : NULL, &ri6)) {
+      RERROR("Unable to read old default gateway and delete old default routes for IPv6.");
+      return false;
+    }
+  }
+
+  uint32 default_route_v4 = ComputeIpv4DefaultRoute(config.ip, netmask);
+  uint8 default_route_v6[16];
+
+  if (block_all_traffic_route) {
+    RINFO("Blocking all regular Internet traffic using routing rules");
+    NET_LUID localhost_luid;
+    if (ConvertInterfaceIndexToLuid(1, &localhost_luid) || localhost_luid.Info.IfType != 24) {
+      RERROR("Unable to get localhost luid - while adding route based blocking.");
+    } else {
+      uint32 dst[4] = {0};
+      if (!AddMultipleCatchallRoutes(AF_INET, 1, (uint8*)&dst, localhost_luid))
+        RERROR("Unable to add routes for route based blocking.");
+      if (config.ipv6_cidr) {
+        if (!AddMultipleCatchallRoutes(AF_INET6, 1, (uint8*)&dst, localhost_luid))
+          RERROR("Unable to add IPv6 routes for route based blocking.");
+      }
+    }
+  }
+
+  internet_route_blocking_state = block_all_traffic_route + ROUTE_BLOCK_OFF;
+
+  if (ibs & kBlockInternet_Firewall) {
+    RINFO("Blocking all regular Internet traffic%s", ri.found_default_adapter ? " (except DHCP)" : "");
+    AddPersistentInternetBlocking(ri.found_default_adapter ? &ri.default_adapter : NULL, InterfaceLuid, config.ipv6_cidr != 0);
+  } else {
+    SetInternetFwBlockingState(false);
+  }
+
+  // Configure default route?
+  if (config.use_ipv4_default_route) {
+    // Add a bypass route to the original gateway?
+    if (config.default_route_endpoint_v4 != 0) {
+      if (!ri.found_default_adapter) {
+        RERROR("Unable to read old ipv4 default gateway");
+        return false;
+      }
+      if (!AddRoute(AF_INET, &default_route_endpoint_v4, 32, ri.default_gw, &ri.default_adapter, &routes_to_undo_)) {
+        RERROR("Unable to add ipv4 gateway bypass route.");
+        return false;
+      }
+    }
+    // Either add 4 routes or 2 routes, depending on if we use route blocking.
+    uint32 be = ToBE32(default_route_v4);
+    if (!AddMultipleCatchallRoutes(AF_INET, block_all_traffic_route ? 2 : 1, (uint8*)&be, InterfaceLuid))
+      RERROR("Unable to add new default ipv4 route.");
+  }
+
+  if (config.ipv6_cidr) {
+    ComputeIpv6DefaultRoute(config.ipv6_address, config.ipv6_cidr, default_route_v6);
+
+    // Configure default route?
+    if (config.use_ipv6_default_route) {
+      if (IsIpv6AddressSet(config.default_route_endpoint_v6)) {
+        if (!ri6.found_default_adapter) {
+          RERROR("Unable to read old ipv6 default gateway");
+          return false;
+        }
+        if (!AddRoute(AF_INET6, config.default_route_endpoint_v6, 128, ri.default_gw, &ri6.default_adapter, &routes_to_undo_)) {
+          RERROR("Unable to add ipv6 gateway bypass route.");
+          return false;
+        }
+      }
+      if (!AddMultipleCatchallRoutes(AF_INET6, block_all_traffic_route ? 2 : 1, default_route_v6, InterfaceLuid))
+        RERROR("Unable to add new default ipv6 route.");
+    }
+  }
+
+  // Add all the extra routes
+  for (auto it = config.extra_routes.begin(); it != config.extra_routes.end(); ++it) {
+    if (it->size == 32) {
+      uint32 be = ToBE32(default_route_v4);
+      AddRoute(AF_INET, it->addr, it->cidr, &be, &InterfaceLuid);
+    } else if (it->size == 128 && config.ipv6_cidr) {
+      AddRoute(AF_INET6, it->addr, it->cidr, default_route_v6, &InterfaceLuid);
+    }
+  }
+
+  NET_IFINDEX InterfaceIndex;
+  if (ConvertInterfaceLuidToIndex(&InterfaceLuid, &InterfaceIndex)) {
+    RERROR("Unable to get index of adapter");
+    return false;
+  }
+  if ((err = FlushIpNetTable2(AF_INET, InterfaceIndex)) != NO_ERROR) {
+    RERROR("FlushIpNetTable failed: 0x%X", err);
+    return false;
+  }
+  if (config.ipv6_cidr) {
+    if ((err = FlushIpNetTable2(AF_INET6, InterfaceIndex)) != NO_ERROR) {
+      RERROR("FlushIpNetTable failed: 0x%X", err);
+      return false;
+    }
+  }
+
+  RunPrePostCommand(config.pre_post_commands.post_up);
+
+  pre_down_ = std::move(config.pre_post_commands.pre_down);
+  post_down_ = std::move(config.pre_post_commands.post_down);
+
+  return true;
+}
+
+void TunWin32Adapter::CloseAdapter() {
+  RunPrePostCommand(pre_down_);
+
+  if (handle_ != NULL) {
+    ULONG status = FALSE;
+    DWORD len;
+    DeviceIoControl(handle_, TAP_IOCTL_SET_MEDIA_STATUS, &status, sizeof(status),
+                    &status, sizeof(status), &len, NULL);
+    CloseHandle(handle_);
+    handle_ = NULL;
+  }
+
+  for (auto it = routes_to_undo_.begin(); it != routes_to_undo_.end(); ++it)
+    DeleteRoute(&*it);
+  routes_to_undo_.clear();
+
+  RestoreDnsExceptOnAdapter(current_dns_block_);
+  current_dns_block_ = NULL;
+
+  RunPrePostCommand(post_down_);
+}
+
+static bool RunOneCommand(const std::string &cmd) {
+  std::string command = "cmd.exe /C " + cmd;
+
+  STARTUPINFOA si = {0};
+  PROCESS_INFORMATION pi = {0};
+
+  HANDLE hstdout_wr = NULL, hstdout_rd = NULL;
+  HANDLE hstdin_wr = NULL, hstdin_rd = NULL;
+
+  bool result = false;
+
+  SECURITY_ATTRIBUTES saAttr;
+  saAttr.nLength = sizeof(SECURITY_ATTRIBUTES);
+  saAttr.bInheritHandle = TRUE;
+  saAttr.lpSecurityDescriptor = NULL;
+
+  if (!CreatePipe(&hstdout_rd, &hstdout_wr, &saAttr, 0) ||
+      !CreatePipe(&hstdin_rd, &hstdin_wr, &saAttr, 0) ||
+      !SetHandleInformation(hstdout_rd, HANDLE_FLAG_INHERIT, 0) ||
+      !SetHandleInformation(hstdin_wr, HANDLE_FLAG_INHERIT, 0)) {
+    goto out;
+  }
+
+  CloseHandle(hstdin_wr);
+  hstdin_wr = NULL;
+  
+  si.cb = sizeof(si);
+  si.dwFlags = STARTF_USESTDHANDLES;
+  si.hStdError = hstdout_wr;
+  si.hStdOutput = hstdout_wr;
+  si.hStdInput = hstdin_rd;
+
+  RINFO("Run: %s", cmd.c_str());
+  if (CreateProcessA(NULL, &command[0], NULL, NULL, TRUE, CREATE_NO_WINDOW, NULL, NULL, &si, &pi)) {
+    DWORD exit_code = -1;
+    char buf[1024];
+    DWORD bufend = 0, bufstart = 0;
+    
+    CloseHandle(hstdout_wr);
+    hstdout_wr = NULL;
+
+    for (;;) {
+      DWORD bytes_read = 0;
+      bool foundeof = (!ReadFile(hstdout_rd, buf + bufend, sizeof(buf) - bufend, &bytes_read, NULL) || bytes_read == 0);
+      bufend += bytes_read;
+      for(;;) {
+        char *nl = (char*)memchr(buf + bufstart, '\n', bufend - bufstart);
+        if (!nl)
+          break;
+        char *st = buf + bufstart;
+        char *nl2 = nl;
+        if (nl != buf + bufstart && nl[-1] == '\r')
+          nl--;
+        bufstart = nl2 - buf + 1;
+        RINFO("%.*s", nl - st, st);
+      }
+      if (bufend - bufstart == sizeof(buf) || foundeof) {
+        if (bufend - bufstart)
+          RINFO("%.*s", buf + bufstart, bufend - bufstart);
+        bufstart = bufend = 0;
+      }
+      if (foundeof)
+        break;
+      if (bufstart) {
+        bufend -= bufstart;
+        memmove(buf, buf + bufstart, bufend);
+        bufstart = 0;
+      }
+    }
+    WaitForSingleObject(pi.hProcess, INFINITE);
+    GetExitCodeProcess(pi.hProcess, &exit_code);
+    CloseHandle(pi.hThread);
+    CloseHandle(pi.hProcess);
+    if (exit_code != 0) {
+      RERROR("Command line failed (%d) : %s", exit_code, cmd.c_str());
+    } else {
+      result = true;
+    }
+  } else {
+    RERROR("CreateProcess failed: %s", cmd.c_str());
+  }
+  CloseHandle(hstdout_rd);
+  CloseHandle(hstdout_wr);
+  CloseHandle(hstdin_rd);
+  CloseHandle(hstdin_wr);
+out:
+  return result;
+}
+
+bool TunWin32Adapter::RunPrePostCommand(const std::vector<std::string> &vec) {
+  bool success = true;
+  for (auto it = vec.begin(); it != vec.end(); ++it) {
+    if (!g_allow_pre_post) {
+      RERROR("Pre/Post commands are disabled. Ignoring: %s", it->c_str());
+    } else {
+      success &= RunOneCommand(*it);
+    }
+  }
+  return success;
+}
+
+
+//////////////////////////////////////////////////////////////////////////////
+
+TunWin32Iocp::TunWin32Iocp() {
+  wqueue_end_ = &wqueue_;
+  wqueue_ = NULL;
+
+  thread_ = NULL;
+  completion_port_handle_ = NULL;
+  packet_handler_ = NULL;
+  InitializeCriticalSectionAndSpinCount(&mutex_, 1024);
+  exit_thread_ = false;
+}
+
+TunWin32Iocp::~TunWin32Iocp() {
+  //assert(num_reads_ == 0 && num_writes_ == 0);
+  assert(thread_ == NULL);
+  CloseTun();
+  DeleteCriticalSection(&mutex_);
+}
+
+bool TunWin32Iocp::Initialize(const TunConfig &&config, TunConfigOut *out) {
+  CloseTun();
+
+  if (!adapter_.OpenAdapter(&exit_thread_, FILE_FLAG_OVERLAPPED))
+    return false;
+  
+  completion_port_handle_ = CreateIoCompletionPort(adapter_.handle(), NULL, NULL, 0);
+  if (completion_port_handle_ == NULL)
+    return false;
+
+  return adapter_.InitAdapter(std::move(config), out);
+}
+
+void TunWin32Iocp::CloseTun() {
+  assert(thread_ == NULL);
+
+  adapter_.CloseAdapter();
+  
+  if (completion_port_handle_) {
+    CloseHandle(completion_port_handle_);
+    completion_port_handle_ = NULL;
+  }
+  
+  FreePacketList(wqueue_);
+  wqueue_ = NULL;
+  wqueue_end_ = &wqueue_;
+}
+
+enum {
+  kTunGetQueuedCompletionStatusSize = kConcurrentWriteTap + kConcurrentReadTap + 1
+};
+
+void TunWin32Iocp::ThreadMain() {
+  OVERLAPPED_ENTRY entries[kTunGetQueuedCompletionStatusSize];
+  Packet *pending_writes = NULL;
+  int num_reads = 0, num_writes = 0;
+  Packet *finished_reads = NULL, **finished_reads_end;
+  Packet *freed_packets = NULL, **freed_packets_end;
+  int freed_packets_count = 0;
+  DWORD err;
+
+  while (!exit_thread_) {
+    // Initiate more reads, reusing the Packet structures in |finished_writes|.
+    for (int i = num_reads; i < kConcurrentReadTap; i++) {
+      Packet *p;
+      if (!AllocPacketFrom(&freed_packets, &freed_packets_count, &exit_thread_, &p))
+        break;
+      memset(&p->overlapped, 0, sizeof(p->overlapped));
+      p->post_target = ThreadedPacketQueue::TARGET_PROCESSOR_TUN;
+      if (!ReadFile(adapter_.handle(), p->data, kPacketCapacity, NULL, &p->overlapped) && (err = GetLastError()) != ERROR_IO_PENDING) {
+        FreePacket(p);
+
+        RERROR("TunWin32: ReadFile failed 0x%X", err);
+
+        if (err == ERROR_OPERATION_ABORTED) {
+          packet_handler_->AbortingDriver(); 
+          RERROR("TAP driver stopped communicating. Attempting to restart.", err);
+          // This can happen if we reinstall the TAP driver while there's an active connection. Wait a bit, then attempt to
+          // restart.
+          Sleep(1000);
+          CallbackTriggerReconnect();
+          goto EXIT;
+        }
+      } else {
+        num_reads++;
+      }
+    }
+    g_tun_reads = num_reads;
+
+    assert(freed_packets_count >= 0);
+    if (freed_packets_count >= 32) {
+      FreePackets(freed_packets, freed_packets_end, freed_packets_count);
+      freed_packets_count = 0;
+      freed_packets_end = &freed_packets;
+    } else if (freed_packets == NULL) {
+      assert(freed_packets_count == 0);
+      freed_packets_end = &freed_packets;
+    }
+
+    ULONG num_entries = 0;
+    if (!GetQueuedCompletionStatusEx(completion_port_handle_, entries, kTunGetQueuedCompletionStatusSize, &num_entries, INFINITE, FALSE)) {
+      RINFO("GetQueuedCompletionStatusEx failed.");
+      break;
+    }
+    finished_reads_end = &finished_reads;
+    int finished_reads_count = 0;
+
+    // Go through the finished entries and determine which ones are reads, and which ones are writes.
+    for (ULONG i = 0; i < num_entries; i++) {
+      if (!entries[i].lpOverlapped)
+        continue; // This is the dummy entry from |PostQueuedCompletionStatus|
+      Packet *p = (Packet*)((byte*)entries[i].lpOverlapped - offsetof(Packet, overlapped));
+      if (p->post_target == ThreadedPacketQueue::TARGET_PROCESSOR_TUN) {
+        num_reads--;
+        if ((int)p->overlapped.Internal != 0) {
+          RERROR("TunWin32::ReadComplete error 0x%X", (int)p->overlapped.Internal);
+          FreePacket(p);
+          continue;
+        }
+        p->size = (int)p->overlapped.InternalHigh;
+
+        *finished_reads_end = p;
+        finished_reads_end = &p->next;
+        finished_reads_count++;
+      } else {
+        num_writes--;
+        if ((int)p->overlapped.Internal != 0) {
+          RERROR("TunWin32::WriteComplete error 0x%X", (int)p->overlapped.Internal);
+          FreePacket(p);
+          continue;
+        }
+        freed_packets_count++;
+        *freed_packets_end = p;
+        freed_packets_end = &p->next;
+      }
+    }
+    *finished_reads_end = NULL;
+    *freed_packets_end = NULL;
+
+    if (finished_reads != NULL)
+      packet_handler_->Post(finished_reads, finished_reads_end, finished_reads_count);
+
+    // Initiate more writes from |wqueue_|
+    while (num_writes < kConcurrentWriteTap) {
+      // Refill from queue if empty, avoid taking the mutex if it looks empty 
+      if (!pending_writes) {
+        if (!wqueue_)
+          break;
+        EnterCriticalSection(&mutex_);
+        pending_writes = wqueue_;
+        wqueue_end_ = &wqueue_;
+        wqueue_ = NULL;
+        LeaveCriticalSection(&mutex_);
+        if (!pending_writes)
+          break;
+      }
+      // Then issue writes
+      Packet *p = pending_writes;
+      pending_writes = p->next;
+      memset(&p->overlapped, 0, sizeof(p->overlapped));
+      p->post_target = ThreadedPacketQueue::TARGET_TUN_DEVICE;
+      if (!WriteFile(adapter_.handle(), p->data, p->size, NULL, &p->overlapped) && (err = GetLastError()) != ERROR_IO_PENDING) {
+        RERROR("TunWin32: WriteFile failed 0x%X", err);
+        FreePacket(p);
+      } else {
+        num_writes++;
+      }
+    }
+    g_tun_writes = num_writes;
+  }
+
+EXIT:
+  // Cancel all IO and wait for all completions
+  CancelIo(adapter_.handle());
+  while (num_reads + num_writes) {
+    ULONG num_entries = 0;
+    if (!GetQueuedCompletionStatusEx(completion_port_handle_, entries, 1, &num_entries, INFINITE, FALSE)) {
+      RINFO("GetQueuedCompletionStatusEx failed.");
+      break;
+    }
+    if (!entries[0].lpOverlapped)
+      continue; // This is the dummy entry from |PostQueuedCompletionStatus|
+    Packet *p = (Packet*)((byte*)entries[0].lpOverlapped - offsetof(Packet, overlapped));
+    if (p->post_target == ThreadedPacketQueue::TARGET_PROCESSOR_TUN) {
+      num_reads--;
+    } else {
+      num_writes--;
+    }
+    FreePacket(p);
+  }
+
+  FreePacketList(freed_packets);
+  FreePacketList(pending_writes);
+}
+
+DWORD WINAPI TunWin32Iocp::TunThread(void *x) {
+  TunWin32Iocp *xx = (TunWin32Iocp *)x;
+  xx->ThreadMain();
+  return 0;
+}
+
+void TunWin32Iocp::StartThread() {
+  DWORD thread_id;
+  thread_ = CreateThread(NULL, 0, &TunThread, this, 0, &thread_id);
+  SetThreadPriority(thread_, ABOVE_NORMAL_PRIORITY_CLASS);
+}
+
+void TunWin32Iocp::StopThread() {
+  exit_thread_ = true;
+  PostQueuedCompletionStatus(completion_port_handle_, NULL, NULL, NULL);
+  WaitForSingleObject(thread_, INFINITE);
+  CloseHandle(thread_);
+  thread_ = NULL;
+}
+
+void TunWin32Iocp::WriteTunPacket(Packet *packet) {
+  packet->next = NULL;
+  EnterCriticalSection(&mutex_);
+  Packet *was_empty = wqueue_;
+  *wqueue_end_ = packet;
+  wqueue_end_ = &packet->next;
+  LeaveCriticalSection(&mutex_);
+  if (was_empty == NULL) {
+    // Notify the worker thread that it should attempt more writes
+    PostQueuedCompletionStatus(completion_port_handle_, NULL, NULL, NULL);
+  }
+}
+
+
+
+//////////////////////////////////////////////////////////////////////////////
+
+TunWin32Overlapped::TunWin32Overlapped() {
+  wqueue_end_ = &wqueue_;
+  wqueue_ = NULL;
+
+  thread_ = NULL;
+
+  read_event_ = CreateEvent(NULL, TRUE, FALSE, NULL);
+  write_event_ = CreateEvent(NULL, TRUE, FALSE, NULL);
+  wake_event_ = CreateEvent(NULL, FALSE, FALSE, NULL);
+  
+  packet_handler_ = NULL;
+  InitializeCriticalSectionAndSpinCount(&mutex_, 1024);
+  exit_thread_ = false;
+}
+
+TunWin32Overlapped::~TunWin32Overlapped() {
+  CloseTun();
+  DeleteCriticalSection(&mutex_);
+  CloseHandle(read_event_);
+  CloseHandle(write_event_);
+  CloseHandle(wake_event_);
+}
+
+bool TunWin32Overlapped::Initialize(const TunConfig &&config, TunConfigOut *out) {
+  CloseTun();
+  return adapter_.OpenAdapter(&exit_thread_, FILE_FLAG_OVERLAPPED) &&
+         adapter_.InitAdapter(std::move(config), out);
+}
+
+void TunWin32Overlapped::CloseTun() {
+  assert(thread_ == NULL);
+  adapter_.CloseAdapter();
+  FreePacketList(wqueue_);
+  wqueue_ = NULL;
+  wqueue_end_ = &wqueue_;
+}
+
+void TunWin32Overlapped::ThreadMain() {
+  Packet *pending_writes = NULL;
+  DWORD err;
+  Packet *read_packet = NULL, *write_packet = NULL;
+
+  HANDLE h[3];
+  while (!exit_thread_) {
+    if (read_packet == NULL) {
+      Packet *p = AllocPacket();
+      memset(&p->overlapped, 0, sizeof(p->overlapped));
+      p->overlapped.hEvent = read_event_;
+      p->post_target = ThreadedPacketQueue::TARGET_PROCESSOR_TUN;
+      if (!ReadFile(adapter_.handle(), p->data, kPacketCapacity, NULL, &p->overlapped) && (err = GetLastError()) != ERROR_IO_PENDING) {
+        FreePacket(p);
+        RERROR("TunWin32: ReadFile failed 0x%X", err);
+      } else {
+        read_packet = p;
+      }
+    }
+
+    int n = 0;
+    if (write_packet)
+      h[n++] = write_event_;
+    if (read_packet != NULL)
+      h[n++] = read_event_;
+    h[n++] = wake_event_;
+
+    DWORD res = WaitForMultipleObjects(n, h, FALSE, INFINITE);
+
+    if (res >= WAIT_OBJECT_0 && res <= WAIT_OBJECT_0 + 2) {
+      HANDLE hx = h[res - WAIT_OBJECT_0];
+      if (hx == read_event_) {
+        read_packet->size = (int)read_packet->overlapped.InternalHigh;
+        read_packet->next = NULL;
+        packet_handler_->Post(read_packet, &read_packet->next, 1);
+        read_packet = NULL;
+      } else if (hx == write_event_) {
+        FreePacket(write_packet);
+        write_packet = NULL;
+      }
+    } else {
+      RERROR("Wait said %d", res);
+    }
+    
+    if (write_packet == NULL) {
+      if (!pending_writes) {
+        EnterCriticalSection(&mutex_);
+        pending_writes = wqueue_;
+        wqueue_end_ = &wqueue_;
+        wqueue_ = NULL;
+        LeaveCriticalSection(&mutex_);
+      }
+      if (pending_writes) {
+        // Then issue writes
+        Packet *p = pending_writes;
+        pending_writes = p->next;
+        memset(&p->overlapped, 0, sizeof(p->overlapped));
+        p->overlapped.hEvent = write_event_;
+        p->post_target = ThreadedPacketQueue::TARGET_TUN_DEVICE;
+        if (!WriteFile(adapter_.handle(), p->data, p->size, NULL, &p->overlapped) && (err = GetLastError()) != ERROR_IO_PENDING) {
+          RERROR("TunWin32: WriteFile failed 0x%X", err);
+          FreePacket(p);
+        } else {
+          write_packet = p;
+        }
+      }
+    }
+  }
+
+  // TODO: Free memory
+  CancelIo(adapter_.handle());
+  FreePacketList(pending_writes);
+}
+
+DWORD WINAPI TunWin32Overlapped::TunThread(void *x) {
+  TunWin32Overlapped *xx = (TunWin32Overlapped *)x;
+  xx->ThreadMain();
+  return 0;
+}
+
+void TunWin32Overlapped::StartThread() {
+  DWORD thread_id;
+  thread_ = CreateThread(NULL, 0, &TunThread, this, 0, &thread_id);
+  SetThreadPriority(thread_, ABOVE_NORMAL_PRIORITY_CLASS);
+}
+
+void TunWin32Overlapped::StopThread() {
+  exit_thread_ = true;
+  SetEvent(wake_event_);
+  WaitForSingleObject(thread_, INFINITE);
+  CloseHandle(thread_);
+  thread_ = NULL;
+}
+
+void TunWin32Overlapped::WriteTunPacket(Packet *packet) {
+  packet->next = NULL;
+  EnterCriticalSection(&mutex_);
+  Packet *was_empty = wqueue_;
+  *wqueue_end_ = packet;
+  wqueue_end_ = &packet->next;
+  LeaveCriticalSection(&mutex_);
+  if (was_empty == NULL)
+    SetEvent(wake_event_);
+}
+
+
+
+
+
+DWORD WINAPI TunsafeBackendWin32::WorkerThread(void *bk) {
+  TunsafeBackendWin32 *backend = (TunsafeBackendWin32*)bk;
+
+  TunWin32Iocp tun;
+  UdpSocketWin32 udp;
+  WireguardProcessor wg_proc(&udp, &tun, backend->procdel_);
+
+  ThreadedPacketQueue queues_for_processor(&wg_proc, &backend->stats_);
+
+  qs.udp_qsize1 = qs.udp_qsize2 = 0;
+
+  udp.SetPacketHandler(&queues_for_processor);
+  tun.SetPacketHandler(&queues_for_processor);
+
+  if (!ParseWireGuardConfigFile(&wg_proc, backend->config_file_, &backend->exit_flag_))
+    goto getout;
+
+  if (!wg_proc.Start())
+    goto getout;
+
+  queues_for_processor.Start();
+  udp.StartThread();
+  tun.StartThread();
+  
+  CallbackSetPublicKey(wg_proc.dev().public_key());
+   
+  while (!backend->exit_flag_) {
+    SleepEx(INFINITE, TRUE);
+  }
+
+  udp.StopThread();
+  tun.StopThread();
+  queues_for_processor.Stop();
+  
+  FreeAllPackets();
+getout:
+  return 0;
+}
+
+static void WINAPI ExitServiceAPC(ULONG_PTR a) {
+  *(bool*)a = true;
+}
+
+TunsafeBackendWin32::TunsafeBackendWin32() {
+  memset(&stats_, 0, sizeof(stats_));
+  InitPacketMutexes();
+  InitializeCriticalSectionAndSpinCount(&stats_.mutex, 1024);
+  worker_thread_ = NULL;
+}
+
+TunsafeBackendWin32::~TunsafeBackendWin32() {
+  DeleteCriticalSection(&stats_.mutex);
+}
+
+ProcessorStats TunsafeBackendWin32::GetStats() {
+  EnterCriticalSection(&stats_.mutex);
+  ProcessorStats stats = stats_.packet_stats;
+  LeaveCriticalSection(&stats_.mutex);
+  return stats;
+}
+
+void TunsafeBackendWin32::Start(ProcessorDelegate *procdel, const char *config_file) {
+  Stop();
+  procdel_ = procdel;
+  exit_flag_ = false;
+  DWORD thread_id;
+  config_file_ = _strdup(config_file);
+  worker_thread_ = CreateThread(NULL, 0, &WorkerThread, this, 0, &thread_id);
+  SetThreadPriority(worker_thread_, THREAD_PRIORITY_ABOVE_NORMAL);
+}
+
+void TunsafeBackendWin32::Stop() {
+  if (worker_thread_) {
+    QueueUserAPC(&ExitServiceAPC, worker_thread_, (ULONG_PTR)&exit_flag_);
+    WaitForSingleObject(worker_thread_, INFINITE);
+    CloseHandle(worker_thread_);
+    worker_thread_ = NULL;
+    free(config_file_);
+    config_file_ = NULL;
+  }
+}
+
diff --git a/network_win32.h b/network_win32.h
new file mode 100644
index 0000000..a67f226
--- /dev/null
+++ b/network_win32.h
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#pragma once
+
+#include "stdafx.h"
+#include "tunsafe_types.h"
+#include "netapi.h"
+#include "network_win32_api.h"
+
+struct Packet;
+class WireguardProcessor;
+
+
+class ThreadedPacketQueue {
+public:
+  explicit ThreadedPacketQueue(WireguardProcessor *wg, NetworkStats *stats);
+  ~ThreadedPacketQueue();
+
+  enum {
+    TARGET_PROCESSOR_UDP = 0,
+    TARGET_PROCESSOR_TUN = 1,
+    TARGET_UDP_DEVICE = 2,
+    TARGET_TUN_DEVICE = 3,
+  };
+
+  void Start();
+  void Stop();
+
+  void Post(Packet *packet, Packet **end, int count);
+  void AbortingDriver();
+
+private:
+  void PostTimerInterrupt();
+  static void CALLBACK TimerRoutine(LPVOID lpArgToCompletionRoutine, DWORD dwTimerLowValue, DWORD dwTimerHighValue);
+  
+  DWORD ThreadMain();
+  static DWORD WINAPI ThreadedPacketQueueLauncher(VOID *x);
+  Packet *first_;
+  Packet **last_ptr_;
+  uint32 packets_in_queue_;
+  uint32 need_notify_;
+  CRITICAL_SECTION mutex_;
+  HANDLE event_;
+
+  HANDLE timer_handle_;
+  HANDLE handle_;
+  WireguardProcessor *wg_;
+  bool exit_flag_;
+  bool timer_interrupt_;
+  NetworkStats *stats_;
+};
+
+// Encapsulates a UDP socket, optionally listening for incoming packets
+// on a specific port.
+class UdpSocketWin32 : public UdpInterface {
+public:
+  explicit UdpSocketWin32();
+  ~UdpSocketWin32();
+
+  void SetPacketHandler(ThreadedPacketQueue *packet_handler) { packet_handler_ = packet_handler; }
+
+  void StartThread();
+  void StopThread();
+
+  // -- from UdpInterface
+  virtual bool Initialize(int listen_on_port) override;
+  virtual void WriteUdpPacket(Packet *packet) override;
+
+private:
+
+  void ThreadMain();
+  static DWORD WINAPI UdpThread(void *x);
+
+  // All packets queued for writing. Locked by |mutex_|
+  Packet *wqueue_, **wqueue_end_;
+
+  CRITICAL_SECTION mutex_;
+
+  ThreadedPacketQueue *packet_handler_;
+  SOCKET socket_;
+  SOCKET socket_ipv6_;
+  HANDLE completion_port_handle_;
+  HANDLE thread_;
+
+  bool exit_thread_;
+};
+
+class TunWin32Adapter {
+public:
+  TunWin32Adapter();
+  ~TunWin32Adapter();
+
+  bool OpenAdapter(bool *exit_thread, DWORD open_flags);
+  bool InitAdapter(const TunInterface::TunConfig &&config, TunInterface::TunConfigOut *out);
+  void CloseAdapter();
+
+  HANDLE handle() { return handle_; }
+
+private:
+  bool RunPrePostCommand(const std::vector<std::string> &vec);
+
+  HANDLE handle_;
+  HANDLE current_dns_block_;
+
+  std::vector<MIB_IPFORWARD_ROW2> routes_to_undo_;
+  uint8 mac_adress_[6];
+  int mtu_;
+  char guid_[64];
+
+  std::vector<std::string> pre_down_, post_down_;
+};
+
+// Implementation of TUN interface handling using IO Completion Ports
+class TunWin32Iocp : public TunInterface {
+public:
+  explicit TunWin32Iocp();
+  ~TunWin32Iocp();
+
+  void SetPacketHandler(ThreadedPacketQueue *packet_handler) { packet_handler_ = packet_handler; }
+
+  void StartThread();
+  void StopThread();
+
+  // -- from TunInterface
+  virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) override;
+  virtual void WriteTunPacket(Packet *packet) override;
+
+private:
+  void CloseTun();
+  void ThreadMain();
+  static DWORD WINAPI TunThread(void *x);
+
+  ThreadedPacketQueue *packet_handler_;
+  HANDLE completion_port_handle_;
+  HANDLE thread_;
+
+  CRITICAL_SECTION mutex_;
+
+  bool exit_thread_;
+
+  // All packets queued for writing
+  Packet *wqueue_, **wqueue_end_;
+
+  TunWin32Adapter adapter_;
+};
+
+// Implementation of TUN interface handling using Overlapped IO
+class TunWin32Overlapped : public TunInterface {
+public:
+  explicit TunWin32Overlapped();
+  ~TunWin32Overlapped();
+
+  void SetPacketHandler(ThreadedPacketQueue *packet_handler) { packet_handler_ = packet_handler; }
+
+  void StartThread();
+  void StopThread();
+
+  // -- from TunInterface
+  virtual bool Initialize(const TunConfig &&config, TunConfigOut *out) override;
+  virtual void WriteTunPacket(Packet *packet) override;
+
+private:
+  void CloseTun();
+  void ThreadMain();
+  static DWORD WINAPI TunThread(void *x);
+
+  ThreadedPacketQueue *packet_handler_;
+  HANDLE thread_;
+
+  CRITICAL_SECTION mutex_;
+
+  HANDLE read_event_, write_event_, wake_event_;
+
+  bool exit_thread_;
+
+  Packet *wqueue_, **wqueue_end_;
+
+  TunWin32Adapter adapter_;
+};
diff --git a/network_win32_api.h b/network_win32_api.h
new file mode 100644
index 0000000..dac9856
--- /dev/null
+++ b/network_win32_api.h
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#pragma once
+
+#include "stdafx.h"
+#include "tunsafe_types.h"
+#include "wireguard.h"
+
+struct NetworkStats {
+  bool reset_stats;
+  CRITICAL_SECTION  mutex;
+  ProcessorStats packet_stats;
+};
+
+class TunsafeBackendWin32 {
+public:
+  TunsafeBackendWin32();
+  ~TunsafeBackendWin32();
+
+  void Start(ProcessorDelegate *procdel, const char *config_file);
+  void Stop();
+
+  ProcessorStats GetStats();
+  void ResetStats() { stats_.reset_stats = true; }
+
+  bool is_started() const { return worker_thread_ != NULL; }
+
+private:
+  static DWORD WINAPI WorkerThread(void *x);
+
+  NetworkStats stats_;
+  HANDLE worker_thread_;
+  bool exit_flag_;
+
+  ProcessorDelegate *procdel_;
+  char *config_file_;
+};
+
+
+
+InternetBlockState GetInternetBlockState(bool *is_activated);
+
+// Returns if reconnect is needed
+void SetInternetBlockState(InternetBlockState s);
+
+
+
+extern int tpq_last_qsize;
+extern int g_tun_reads, g_tun_writes;
diff --git a/network_win32_dnsblock.cpp b/network_win32_dnsblock.cpp
new file mode 100644
index 0000000..e17f09a
--- /dev/null
+++ b/network_win32_dnsblock.cpp
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "tunsafe_types.h"
+#include "network_win32_dnsblock.h"
+#include <fwpmu.h>
+#include <fwpmtypes.h>
+
+#pragma comment (lib, "Fwpuclnt.lib")
+
+static const GUID TUNSAFE_DNS_SUBLAYER = {0x1ce6cce2, 0xcc8f, 0x4175, { 0xac, 0x7b, 0x95, 0xfd, 0xe8, 0x95, 0x80, 0x92}};
+static const GUID TUNSAFE_GLOBAL_BLOCK_SUBLAYER = {0x1ce6cce2, 0xcc8f, 0x4175,{0xac, 0x7b, 0x95, 0xfd, 0xe8, 0x95, 0x80, 0x93}};
+
+static bool GetFwpmAppIdFromCurrentProcess(FWP_BYTE_BLOB **appid) {
+  wchar_t module_filename[MAX_PATH];
+  DWORD err = GetModuleFileNameW(NULL, module_filename, ARRAYSIZE(module_filename));
+  if (err == 0 || err == ARRAYSIZE(module_filename))
+    return false;
+  err = FwpmGetAppIdFromFileName0(module_filename, appid);
+  if (err != 0)
+    return false;
+  return true;
+}
+
+static uint8 internet_fw_blocking_state;
+
+static inline bool FwpmFilterAddCheckedAleConnect(HANDLE handle, FWPM_FILTER0 *filter, bool also_ipv6, int idx) {
+  DWORD err;
+  UINT64 dummy;
+
+  filter->layerKey = FWPM_LAYER_ALE_AUTH_CONNECT_V4;
+  err = FwpmFilterAdd0(handle, filter, NULL, &dummy);
+  if (err != 0) {
+    RERROR("FwpmFilterAdd0 #%d failed (%s): %d", idx, "ipv4", err);
+    return false;
+  }
+
+  if (also_ipv6) {
+    filter->layerKey = FWPM_LAYER_ALE_AUTH_CONNECT_V6;
+    err = FwpmFilterAdd0(handle, filter, NULL, &dummy);
+    if (err != 0) {
+      RERROR("FwpmFilterAdd0 #%d failed (%s): %d", idx, "ipv6", err);
+      return false;
+    }
+  }
+
+  return true;
+}
+
+HANDLE BlockDnsExceptOnAdapter(const NET_LUID &luid, bool also_ipv6) {
+  FWPM_SUBLAYER0 *sublayer = NULL;
+  FWP_BYTE_BLOB *fwp_appid = NULL;
+  
+  FWPM_FILTER0 filter;
+  FWPM_FILTER_CONDITION0 filter_condition[2];
+  DWORD err;
+  HANDLE handle = NULL;
+
+  {
+    FWPM_SESSION0 session = {0};
+    session.flags = FWPM_SESSION_FLAG_DYNAMIC;
+    err = FwpmEngineOpen0(NULL, RPC_C_AUTHN_WINNT, NULL, &session, &handle);
+    if (err != 0) {
+      RERROR("FwpmEngineOpen0 failed: %d", err);
+      goto getout;
+    }
+  }
+
+  {
+    FWPM_SUBLAYER0 sublayer = {0};
+    sublayer.subLayerKey = TUNSAFE_DNS_SUBLAYER;
+    sublayer.displayData.name = L"TunSafe";
+    sublayer.weight = 0x100;
+    err = FwpmSubLayerAdd0(handle, &sublayer, NULL);
+    if (err != 0) {
+      RERROR("FwpmSubLayerAdd0 failed: %d", err);
+      goto getout;
+    }
+  }
+
+  if (!GetFwpmAppIdFromCurrentProcess(&fwp_appid)) {
+    RERROR("GetFwpmAppIdFromCurrentProcess failed");
+    goto getout;
+  }
+
+  // Allow all queries to port 53 from our process
+  memset(&filter, 0, sizeof(filter));
+  filter_condition[0].fieldKey = FWPM_CONDITION_IP_REMOTE_PORT;
+  filter_condition[0].matchType = FWP_MATCH_EQUAL;
+  filter_condition[0].conditionValue.type = FWP_UINT16;
+  filter_condition[0].conditionValue.uint16 = 53;
+  filter_condition[1].fieldKey = FWPM_CONDITION_ALE_APP_ID;
+  filter_condition[1].matchType = FWP_MATCH_EQUAL;
+  filter_condition[1].conditionValue.type = FWP_BYTE_BLOB_TYPE;
+  filter_condition[1].conditionValue.byteBlob = fwp_appid;
+  filter.filterCondition = filter_condition;
+  filter.numFilterConditions = 2;
+  filter.subLayerKey = TUNSAFE_DNS_SUBLAYER;
+  filter.displayData.name = L"TunSafe";
+  filter.weight.type = FWP_UINT8;
+  filter.weight.uint8 = 15;
+  filter.action.type = FWP_ACTION_PERMIT;
+  if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 1))
+    goto getout;
+
+  // Allow DNS queries from TAP
+  filter_condition[1].fieldKey = FWPM_CONDITION_IP_LOCAL_INTERFACE;
+  filter_condition[1].conditionValue.type = FWP_UINT64;
+  filter_condition[1].conditionValue.uint64 = (uint64*)&luid.Value;
+  filter.weight.uint8 = 14;
+  if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 2))
+    goto getout;
+
+  // Block all IPv4 and IPv6
+  filter.numFilterConditions = 1;
+  filter.weight.type = FWP_EMPTY;
+  filter.action.type = FWP_ACTION_BLOCK;
+  if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 3))
+    goto getout;
+
+  goto success;
+getout:
+  if (handle != NULL) {
+    FwpmEngineClose0(handle);
+    handle = NULL;
+  }
+success:
+  if (fwp_appid)
+    FwpmFreeMemory0((void **)&fwp_appid);
+  return handle;
+}
+
+void RestoreDnsExceptOnAdapter(HANDLE h) {
+  if (h)
+    FwpmEngineClose0(h);
+}
+
+
+static bool RemovePersistentInternetBlockingInner(HANDLE handle) {
+  FWPM_FILTER_ENUM_TEMPLATE0 enum_template = {0};
+  HANDLE enum_handle = NULL;
+  DWORD err;
+  UINT32 num_returned;
+  FWPM_FILTER0 **filter = NULL;
+
+  for (int iptype = 0; iptype < 2; iptype++) {
+    enum_template.layerKey = iptype == 0 ? FWPM_LAYER_ALE_AUTH_CONNECT_V4 : FWPM_LAYER_ALE_AUTH_CONNECT_V6;
+    enum_template.actionMask = 0xffffffff;
+
+    err = FwpmFilterCreateEnumHandle0(handle, &enum_template, &enum_handle);
+    if (err != 0) {
+      RERROR("FwpmFilterCreateEnumHandle0 failed: %d", err);
+      goto getout;
+    }
+
+    do {
+      err = FwpmFilterEnum0(handle, enum_handle, 256, &filter, &num_returned);
+      if (err != 0) {
+        RERROR("FwpmFilterEnum0 failed: %d", err);
+        goto getout;
+      }
+      for (UINT32 i = 0; i < num_returned; i++) {
+        FWPM_FILTER0 *cur_filter = filter[i];
+        if (memcmp(&cur_filter->subLayerKey, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER, sizeof(GUID)) == 0) {
+          err = FwpmFilterDeleteById0(handle, cur_filter->filterId);
+          if (err != 0)
+            RERROR("FwpmFilterDeleteById0 failed: %d", err);
+        }
+      }
+      FwpmFreeMemory0((void**)&filter);
+    } while (num_returned == 256);
+
+    FwpmFilterDestroyEnumHandle0(handle, enum_handle);
+    enum_handle = NULL;
+  }
+
+  err = FwpmSubLayerDeleteByKey0(handle, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER);
+  if (err != 0 && err != FWP_E_SUBLAYER_NOT_FOUND) {
+    RERROR("FwpmSubLayerDeleteByKey0 failed: %d", err);
+    goto getout;
+  }
+
+  internet_fw_blocking_state = IBS_INACTIVE;
+
+getout:
+  if (enum_handle != NULL) {
+    FwpmFilterDestroyEnumHandle0(handle, enum_handle);
+  }
+  return false;
+}
+
+bool AddPersistentInternetBlocking(const NET_LUID *default_interface, const NET_LUID &luid_to_allow, bool also_ipv6) {
+  FWPM_SUBLAYER0 *sublayer_p = NULL;
+  FWP_BYTE_BLOB *fwp_appid = NULL;
+  FWPM_FILTER0 filter;
+  FWPM_FILTER_CONDITION0 filter_condition[3];
+  DWORD err;
+  HANDLE handle = NULL;
+  bool success = false;
+
+  {
+    FWPM_SESSION0 session = {0};
+    err = FwpmEngineOpen0(NULL, RPC_C_AUTHN_WINNT, NULL, &session, &handle);
+    if (err != 0) {
+      RERROR("FwpmEngineOpen0 failed: %d", err);
+      goto getout;
+    }
+  }
+
+  if (FwpmSubLayerGetByKey0(handle, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER, &sublayer_p) == 0) {
+    // The sublayer already exists
+    FwpmFreeMemory0((void **)&sublayer_p);
+  } else {
+    // Add new sublayer
+    FWPM_SUBLAYER0 sublayer = {0};
+    sublayer.subLayerKey = TUNSAFE_GLOBAL_BLOCK_SUBLAYER;
+    sublayer.displayData.name = L"TunSafe Global Block";
+    sublayer.weight = 0x101;
+    err = FwpmSubLayerAdd0(handle, &sublayer, NULL);
+    if (err != 0) {
+      RERROR("FwpmSubLayerAdd0 failed: %d", err);
+      goto getout;
+    }
+  }
+
+  if (!GetFwpmAppIdFromCurrentProcess(&fwp_appid)) {
+    RERROR("GetFwpmAppIdFromCurrentProcess failed");
+    goto getout;
+  }
+
+  // Allow all outgoing queries from our process
+  memset(&filter, 0, sizeof(filter));
+  filter_condition[0].fieldKey = FWPM_CONDITION_ALE_APP_ID;
+  filter_condition[0].matchType = FWP_MATCH_EQUAL;
+  filter_condition[0].conditionValue.type = FWP_BYTE_BLOB_TYPE;
+  filter_condition[0].conditionValue.byteBlob = fwp_appid;
+  filter.numFilterConditions = 1;
+  filter.filterCondition = filter_condition;
+  filter.subLayerKey = TUNSAFE_GLOBAL_BLOCK_SUBLAYER;
+  filter.displayData.name = L"TunSafe Global Block";
+  filter.weight.type = FWP_UINT8;
+  filter.weight.uint8 = 15;
+  filter.action.type = FWP_ACTION_PERMIT;
+  if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 1))
+    goto getout;
+
+  // Permit all queries going out on TUN
+  filter_condition[0].fieldKey = FWPM_CONDITION_IP_LOCAL_INTERFACE;
+  filter_condition[0].conditionValue.type = FWP_UINT64;
+  filter_condition[0].conditionValue.uint64 = (uint64*)&luid_to_allow.Value;
+  filter_condition[0].matchType = FWP_MATCH_EQUAL;
+  filter.weight.uint8 = 14;
+  if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 2))
+    goto getout;
+  // Permit everything that's loopback
+  filter_condition[0].fieldKey = FWPM_CONDITION_INTERFACE_TYPE;
+  filter_condition[0].conditionValue.type = FWP_UINT32;
+  filter_condition[0].conditionValue.uint32 = 24;
+  filter_condition[0].matchType = FWP_MATCH_EQUAL;
+  filter.weight.uint8 = 13;
+  if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 2))
+    goto getout;
+
+  // Permit all queries on the DHCP port (It uses 68 on the local side and 67 on the remote side)
+  if (default_interface) {
+    filter_condition[2].fieldKey = FWPM_CONDITION_IP_LOCAL_PORT;
+    filter_condition[2].matchType = FWP_MATCH_EQUAL;
+    filter_condition[2].conditionValue.type = FWP_UINT16;
+    filter_condition[2].conditionValue.uint16 = 68;
+    filter_condition[1].fieldKey = FWPM_CONDITION_IP_REMOTE_PORT;
+    filter_condition[1].matchType = FWP_MATCH_EQUAL;
+    filter_condition[1].conditionValue.type = FWP_UINT16;
+    filter_condition[1].conditionValue.uint16 = 67;
+    filter.numFilterConditions = 3;
+    filter_condition[0].fieldKey = FWPM_CONDITION_IP_LOCAL_INTERFACE;
+    filter_condition[0].conditionValue.type = FWP_UINT64;
+    filter_condition[0].conditionValue.uint64 = (uint64*)&default_interface->Value;
+    filter_condition[0].matchType = FWP_MATCH_EQUAL;
+    filter.weight.uint8 = 12;
+    if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 2))
+      goto getout;
+  }
+
+  // Block the rest
+  filter.numFilterConditions = 0;
+  filter.weight.type = FWP_EMPTY;
+  filter.action.type = FWP_ACTION_BLOCK;
+  if (!FwpmFilterAddCheckedAleConnect(handle, &filter, also_ipv6, 3))
+    goto getout;
+
+  success = true;
+  internet_fw_blocking_state = IBS_ACTIVE;
+
+getout:
+  if (handle != NULL) {
+    // delete the layer on failure
+    if (!success)
+      RemovePersistentInternetBlockingInner(handle);
+    FwpmEngineClose0(handle);
+    handle = NULL;
+  }
+  if (fwp_appid)
+    FwpmFreeMemory0((void **)&fwp_appid);
+  return success;
+}
+
+static bool RemovePersistentInternetBlocking() {
+  DWORD err;
+  HANDLE handle = NULL;
+  FWPM_SUBLAYER0 *sublayer_p = NULL;
+
+  {
+    FWPM_SESSION0 session = {0};
+    err = FwpmEngineOpen0(NULL, RPC_C_AUTHN_WINNT, NULL, &session, &handle);
+    if (err != 0) {
+      RERROR("FwpmEngineOpen0 failed: %d", err);
+      goto getout;
+    }
+  }
+
+  if (FwpmSubLayerGetByKey0(handle, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER, &sublayer_p) == 0) {
+    // The sublayer exists
+    FwpmFreeMemory0((void **)&sublayer_p);
+  } else {
+    // Sublayer does not exist
+    internet_fw_blocking_state = IBS_INACTIVE;
+    goto getout;
+  }
+  
+  RemovePersistentInternetBlockingInner(handle);
+
+getout:
+  if (handle != NULL) {
+    FwpmEngineClose0(handle);
+    handle = NULL;
+  }
+  return false;
+}
+
+uint8 GetInternetFwBlockingState() {
+  if (internet_fw_blocking_state != 0)
+    return internet_fw_blocking_state;
+  
+  DWORD err;
+  HANDLE handle = NULL;
+  FWPM_SUBLAYER0 *sublayer_p = NULL;
+  bool result;
+
+  {
+    FWPM_SESSION0 session = {0};
+    err = FwpmEngineOpen0(NULL, RPC_C_AUTHN_WINNT, NULL, &session, &handle);
+    if (err != 0) {
+      RERROR("FwpmEngineOpen0 failed: %d", err);
+      goto getout;
+    }
+  }
+
+  if (FwpmSubLayerGetByKey0(handle, &TUNSAFE_GLOBAL_BLOCK_SUBLAYER, &sublayer_p) == 0) {
+    // The sublayer already exists
+    FwpmFreeMemory0((void **)&sublayer_p);
+    result = true;
+  } else {
+    result = false;
+  }
+
+getout:
+  if (handle != NULL) {
+    FwpmEngineClose0(handle);
+    handle = NULL;
+  }
+
+  return internet_fw_blocking_state = result + IBS_INACTIVE;
+}
+
+void SetInternetFwBlockingState(bool want) {
+  uint8 old_state = GetInternetFwBlockingState();
+  if ((old_state >= IBS_ACTIVE) != want) {
+    if (!want) {
+      RemovePersistentInternetBlocking();
+    } else {
+      internet_fw_blocking_state = IBS_PENDING;
+    }
+  }
+}
+
diff --git a/network_win32_dnsblock.h b/network_win32_dnsblock.h
new file mode 100644
index 0000000..1da7e64
--- /dev/null
+++ b/network_win32_dnsblock.h
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#pragma once
+
+HANDLE BlockDnsExceptOnAdapter(const NET_LUID &luid, bool also_ipv6 );
+void RestoreDnsExceptOnAdapter(HANDLE h);
+
+bool AddPersistentInternetBlocking(const NET_LUID *default_interface, const NET_LUID &luid_to_allow, bool also_ipv6);
+
+
+
+enum {
+  IBS_UNKOWN,
+  IBS_INACTIVE,
+  IBS_ACTIVE,
+  IBS_PENDING,
+};
+void SetInternetFwBlockingState(bool want);
+uint8 GetInternetFwBlockingState();
+
diff --git a/readme_osx.txt b/readme_osx.txt
new file mode 100644
index 0000000..e10d754
--- /dev/null
+++ b/readme_osx.txt
@@ -0,0 +1,19 @@
+WARNING: ALPHA SOFTWARE - USE AT YOUR OWN RISK
+
+License: https://tunsafe.com/downloads/LICENSE.TXT
+
+This is the experimental OSX version of TunSafe.
+
+It is single threaded, has no UI, does not support IPv6,
+and does not support switching DNS.
+
+Still - it's roughly 2x as fast as OpenVPN. 260mbit vs 140mbit.
+
+It uses the built-in utun network adapter so you need a
+reasonably new OSX version.
+
+Usage (from a Terminal):
+sudo ./tunsafe Config.conf
+
+Press Ctrl-C to exit.
+
diff --git a/resource.h b/resource.h
new file mode 100644
index 0000000..3c10a98
Binary files /dev/null and b/resource.h differ
diff --git a/stdafx.cpp b/stdafx.cpp
new file mode 100644
index 0000000..fd4f341
--- /dev/null
+++ b/stdafx.cpp
@@ -0,0 +1 @@
+#include "stdafx.h"
diff --git a/stdafx.h b/stdafx.h
new file mode 100644
index 0000000..bd6427f
--- /dev/null
+++ b/stdafx.h
@@ -0,0 +1,33 @@
+// stdafx.h : include file for standard system include files,
+// or project specific include files that are used frequently, but
+// are changed infrequently
+//
+
+#pragma once
+
+#define WINVER 0x0A00  
+#define _WIN32_WINNT _WIN32_WINNT_VISTA  
+#define NTDDI_VERSION NTDDI_VISTA
+
+#include "build_config.h"
+
+#if defined(OS_WIN)
+#define _WINSOCK_DEPRECATED_NO_WARNINGS 1
+//#include <Winsock2.h>
+#include <Ws2tcpip.h>
+
+#include <Windows.h>
+//#include <winsock2.h>
+#include <ws2ipdef.h>
+#include <iphlpapi.h>
+#include <mstcpip.h>
+
+
+#include <tchar.h>
+#else
+#define override
+#endif
+
+#include <stdio.h>
+#include <stddef.h>
+
diff --git a/tunsafe_config.h b/tunsafe_config.h
new file mode 100644
index 0000000..2f29472
--- /dev/null
+++ b/tunsafe_config.h
@@ -0,0 +1,9 @@
+#pragma once
+
+#define TUNSAFE_VERSION_STRING "TunSafe 1.3-rc3"
+
+#define WITH_HANDSHAKE_EXT 0
+#define WITH_SHORT_HEADERS 0
+#define WITH_HEADER_OBFUSCATION 0
+#define WITH_AVX512_OPTIMIZATIONS 0
+#define WITH_BENCHMARK 0
diff --git a/tunsafe_cpu.cpp b/tunsafe_cpu.cpp
new file mode 100644
index 0000000..b1ee8cc
--- /dev/null
+++ b/tunsafe_cpu.cpp
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "tunsafe_cpu.h"
+#include "tunsafe_types.h"
+
+#if defined(COMPILER_MSVC)
+#include <intrin.h>
+#endif
+
+#include <string.h>
+
+uint32 x86_pcap[3];
+
+#if !defined(COMPILER_MSVC)
+static inline void __cpuid(int info[4], int func) {
+  __asm__ __volatile__(
+    "cpuid"
+    : "=a"(info[0]), "=b"(info[1]), "=c"(info[2]), "=d"(info[3])
+    : "a"(func), "c"(0)
+  );
+}
+#endif
+
+void InitCpuFeatures() {
+  unsigned nIds, nExIds;
+
+  {
+    int info[4];
+    __cpuid(info, 0);
+    nIds = info[0];
+    __cpuid(info, 0x80000000);
+    nExIds = info[0];
+  }
+  if (nIds >= 0x00000001) {
+    int info[4];
+    __cpuid(info, 0x00000001);
+    x86_pcap[0] = info[3];
+    x86_pcap[1] = info[2];
+  }
+  if (nIds >= 0x00000007) {
+    int info[4];
+    __cpuid(info, 0x00000007);
+    x86_pcap[2] = info[1];
+  }
+}
+
+static char *strcpy_e(char *dst, char *end, const char *copy) {
+  size_t len = strlen(copy);
+  if (len >= (size_t)(end - dst)) return end;
+  memcpy(dst, copy, len + 1);
+  return dst + len;
+}
+
+void PrintCpuFeatures() {
+  char capbuf[2048], *end = capbuf + 2048, *s = capbuf;
+
+  if (X86_PCAP_AVX) s = strcpy_e(s, end, " avx");
+  if (X86_PCAP_SSSE3) s = strcpy_e(s, end, " ssse3");
+  if (X86_PCAP_AVX2) s = strcpy_e(s, end, " avx2");
+  if (X86_PCAP_MOVBE) s = strcpy_e(s, end, " movbe");
+  if (X86_PCAP_AES) s = strcpy_e(s, end, " aes");
+  if (X86_PCAP_PCLMULQDQ) s = strcpy_e(s, end, " pclmuldqd");
+  if (X86_PCAP_AVX512F) s = strcpy_e(s, end, " avx512f");
+  if (X86_PCAP_AVX512VL) s = strcpy_e(s, end, " avx512vl");
+
+  RINFO("Using:%s", capbuf);
+}
diff --git a/tunsafe_cpu.h b/tunsafe_cpu.h
new file mode 100644
index 0000000..de97b6c
--- /dev/null
+++ b/tunsafe_cpu.h
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#ifndef TUNSAFE_CPU_H_
+#define TUNSAFE_CPU_H_
+
+#include "tunsafe_types.h"
+
+extern uint32 x86_pcap[3];
+
+// cpuid 1, edx
+#define X86_PCAP_SSE (x86_pcap[0] & (1 << 25))
+#define X86_PCAP_SSE2 (x86_pcap[0] & (1 << 26))
+// cpuid 1, ecx
+#define X86_PCAP_SSE3 (x86_pcap[1] & (1 << 0))
+#define X86_PCAP_PCLMULQDQ (x86_pcap[1] & (1 << 0))
+#define X86_PCAP_SSSE3 (x86_pcap[1] & (1 << 9))
+#define X86_PCAP_MOVBE (x86_pcap[1] & (1 << 22))
+#define X86_PCAP_AES (x86_pcap[1] & (1 << 25))
+#define X86_PCAP_AVX (x86_pcap[1] & (1 << 28))
+// cpuid 7, ebx
+#define X86_PCAP_AVX2 (x86_pcap[2] & (1 << 5))
+#define X86_PCAP_AVX512F (x86_pcap[2] & (1 << 16))
+#define X86_PCAP_AVX512VL (x86_pcap[2] & (1 << 31))
+
+void InitCpuFeatures();
+void PrintCpuFeatures();
+
+
+#endif  // TUNSAFE_CPU_H_
\ No newline at end of file
diff --git a/tunsafe_endian.h b/tunsafe_endian.h
new file mode 100644
index 0000000..32bce5e
--- /dev/null
+++ b/tunsafe_endian.h
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#ifndef TINYVPN_ENDIAN_H_
+#define TINYVPN_ENDIAN_H_
+
+#include "build_config.h"
+#include "tunsafe_types.h"
+#if defined(OS_WIN) && defined(COMPILER_MSVC)
+#include <intrin.h>
+#endif
+#include <stdint.h>
+
+#define ByteSwap32Fallback(x) (                             \
+    (((uint32)(x) & (uint32)0x000000fful) << 24) |          \
+    (((uint32)(x) & (uint32)0x0000ff00ul) <<  8) |          \
+    (((uint32)(x) & (uint32)0x00ff0000ul) >>  8) |          \
+    (((uint32)(x) & (uint32)0xff000000ul) >> 24))
+
+#define ByteSwap16Fallback(x) ((uint16)(                    \
+    (((uint16)(x) & (uint16)0x00ffu) << 8) |                \
+    (((uint16)(x) & (uint16)0xff00u) >> 8)))
+
+#define ByteSwap64Fallback(x) ((uint64)ByteSwap32Fallback(x)<<32 | ByteSwap32Fallback(x>>32))
+
+#define ReadBE32AlignedFallback(pt) (((uint32)((pt)[0] & 0xFF) << 24) ^ \
+                                    ((uint32)((pt)[1] & 0xFF) << 16) ^    \
+                                    ((uint32)((pt)[2] & 0xFF) <<  8) ^    \
+                                    ((uint32)((pt)[3] & 0xFF)))
+#define WriteBE32AlignedFallback(ct, st) {                       \
+    (ct)[0] = (char)((st) >> 24);                                \
+    (ct)[1] = (char)((st) >> 16);                                \
+    (ct)[2] = (char)((st) >>  8);                                \
+    (ct)[3] = (char)(st); }
+
+
+
+
+#if defined(OS_WIN) && defined(COMPILER_MSVC)
+#define ByteSwap16(x) _byteswap_ushort((uint16)x)
+#define ByteSwap32(x) _byteswap_ulong((uint32)x)
+#define ByteSwap64(x) _byteswap_uint64((uint64)x)
+#elif defined(COMPILER_GCC)
+#define ByteSwap16(x) __builtin_bswap16((uint16)x)
+#define ByteSwap32(x) __builtin_bswap32((uint32)x)
+#define ByteSwap64(x) __builtin_bswap64((uint64)x)
+#else
+#define ByteSwap16 ByteSwap16Fallback
+#define ByteSwap32 ByteSwap32Fallback
+#define ByteSwap64 ByteSwap64Fallback
+#endif
+
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+#define ToBE64(x) ByteSwap64(x)
+#define ToBE32(x) ByteSwap32(x)
+#define ToBE16(x) ByteSwap16(x)
+#define ToLE64(x) (x)
+#define ToLE32(x) (x)
+#define ToLE16(x) (x)
+#else
+#define ToBE64(x) (x)
+#define ToBE32(x) (x)
+#define ToBE16(x) (x)
+#define ToLE64(x) ByteSwap64(x)
+#define ToLE32(x) ByteSwap32(x)
+#define ToLE16(x) ByteSwap16(x)
+#endif
+
+#define ReadBE16Aligned(pt) ToBE16(*(uint16*)(pt))
+#define WriteBE16Aligned(ct, st) (*(uint16*)(ct) = ToBE16(st))
+#define ReadBE32Aligned(pt) ToBE32(*(uint32*)(pt))
+#define WriteBE32Aligned(ct, st) (*(uint32*)(ct) = ToBE32(st))
+
+#define ReadBE16(pt) ToBE16(*(uint16*)(pt))
+#define WriteBE16(ct, st) (*(uint16*)(ct) = ToBE16(st))
+#define ReadBE32(pt) ToBE32(*(uint32*)(pt))
+#define WriteBE32(ct, st) (*(uint32*)(ct) = ToBE32(st))
+#define ReadBE64(pt) ToBE64(*(uint64*)(pt))
+#define WriteBE64(ct, st) (*(uint64*)(ct) = ToBE64(st))
+
+#define ReadLE16(pt) ToLE16(*(uint16*)(pt))
+#define WriteLE16(ct, st) (*(uint16*)(ct) = ToLE16(st))
+#define ReadLE32(pt) ToLE32(*(uint32*)(pt))
+#define WriteLE32(ct, st) (*(uint32*)(ct) = ToLE32(st))
+#define ReadLE64(pt) ToLE64(*(uint64*)(pt))
+#define WriteLE64(ct, st) (*(uint64*)(ct) = ToLE64(st))
+
+#define Read16(pt) (*(uint16*)(pt))
+#define Write16(ct, st) (*(uint16*)(ct) = (st))
+#define Read32(pt) (*(uint32*)(pt))
+#define Write32(ct, st) (*(uint32*)(ct) = (st))
+#define Read64(pt) (*(uint64*)(pt))
+#define Write64(ct, st) (*(uint64*)(ct) = (st))
+
+
+#endif  // TINYVPN_ENDIAN_H_
diff --git a/tunsafe_types.h b/tunsafe_types.h
new file mode 100644
index 0000000..9ddabab
--- /dev/null
+++ b/tunsafe_types.h
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#ifndef TINYVPN_TYPES_H_
+#define TINYVPN_TYPES_H_
+#include <stdint.h>
+
+#include "build_config.h"
+#include "tunsafe_config.h"
+
+
+typedef uint8_t byte;
+typedef uint8_t uint8;
+typedef uint16_t uint16;
+typedef uint32_t uint32;
+typedef uint64_t uint64;
+typedef int64_t int64;
+typedef int8_t int8;
+typedef int16_t int16;
+typedef int32_t int32;
+
+typedef unsigned int in_addr_t;
+
+#define CTASTR2(pre,post) pre ## post
+#define CTASTR(pre,post) CTASTR2(pre,post)
+#define STATIC_ASSERT(cond,msg) \
+    typedef struct { int CTASTR(static_assertion_failed_,msg) : !!(cond); } \
+        CTASTR(static_assertion_failed_x_,msg)
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0]))
+#endif
+
+void printhex(const char *name, const void *a, size_t l);
+
+#if defined(COMPILER_MSVC)
+#define FORCEINLINE __forceinline
+#define NOINLINE __declspec(noinline)
+#define SAFEBUFFERS __declspec(safebuffers)
+#define __aligned(x) __declspec(align(x))
+#define rol32 _rotl
+#define rol64 _rotl64
+#elif defined(COMPILER_GCC)
+#define FORCEINLINE inline __attribute__((always_inline))
+#define NOINLINE
+#define SAFEBUFFERS
+#define _stricmp strcasecmp
+#define _strdup strdup
+#define _cdecl
+#define __aligned(x) __attribute__((__aligned__(x))) 
+#else
+#define FORCEINLINE inline
+#define NOINLINE
+#define SAFEBUFFERS
+#define __aligned(x)
+#endif
+
+#define likely(x) (x)
+#define unlikely(x) (x)
+
+#if !defined(COMPILER_MSVC)
+static inline uint64 rol64(uint64 x, int8_t r) {
+  return (x << r) | (x >> (64 - r));
+}
+static inline uint32 rol32(uint32 x, int8_t r) {
+  return (x << r) | (x >> (32 - r));
+}
+#endif  // !defined(COMPILER_MSVC)
+
+void RERROR(const char *msg, ...);
+void RINFO(const char *msg, ...);
+
+
+#endif  // TINYVPN_TYPES_H_
diff --git a/tunsafe_win32.cpp b/tunsafe_win32.cpp
new file mode 100644
index 0000000..846ce28
--- /dev/null
+++ b/tunsafe_win32.cpp
@@ -0,0 +1,1143 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "wireguard_config.h"
+#include "network_win32_api.h"
+#include "network_win32_dnsblock.h"
+#include <Commctrl.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <malloc.h>
+#include <stddef.h>
+#include "resource.h"
+#include <string.h>
+#include <Richedit.h>
+#include <vector>
+#include <Iphlpapi.h>
+#include <assert.h>
+#include <shldisp.h>
+#include <shlobj.h>
+#include <exdisp.h>
+#include "tunsafe_endian.h"
+#include "util.h"
+#include <atlbase.h>
+#include <algorithm>
+#include "crypto/curve25519-donna.h"
+
+#undef min
+#pragma comment(lib, "iphlpapi.lib")
+#pragma comment(lib, "rpcrt4.lib")
+#pragma comment(lib,"comctl32.lib")
+#pragma comment(linker,"/manifestdependency:\"type='win32' name='Microsoft.Windows.Common-Controls' version='6.0.0.0' processorArchitecture='*' publicKeyToken='6595b64144ccf1df' language='*'\"")
+
+void InitCpuFeatures();
+void PrintCpuFeatures();
+void Benchmark();
+static const char *GetCurrentConfigTitle(char *buf, size_t max_size);
+
+#pragma warning(disable: 4200)
+
+static void MyPostMessage(int msg, WPARAM wparam, LPARAM lparam);
+
+static HWND g_ui_window;
+static in_addr_t g_ui_ip;
+static HICON g_icons[2];
+static bool g_minimize_on_connect;
+
+static bool g_ui_visible;
+static char *g_current_filename;
+static HKEY g_reg_key;
+static HINSTANCE g_hinstance;
+static TunsafeBackendWin32 *g_backend;
+static bool g_last_popup_is_tray;
+
+int RegReadInt(const char *key, int def) {
+  DWORD value = def, n = sizeof(value);
+  RegQueryValueEx(g_reg_key, key, NULL, NULL, (BYTE*)&value, &n);
+  return value;
+}
+
+void RegWriteInt(const char *key, int value) {
+  RegSetValueEx(g_reg_key, key, NULL, REG_DWORD, (BYTE*)&value, sizeof(value));
+}
+
+char *RegReadStr(const char *key, const char *def) {
+  char buf[1024];
+  DWORD n = sizeof(buf) - 1;
+  DWORD type = 0;
+  if (RegQueryValueEx(g_reg_key, key, NULL, &type, (BYTE*)buf, &n) != ERROR_SUCCESS || type != REG_SZ)
+    return def ? _strdup(def) : NULL;
+  if (n && buf[n - 1] == 0)
+    n--;
+  buf[n] = 0;
+  return _strdup(buf);
+}
+
+void RegWriteStr(const char *key, const char *v) {
+  RegSetValueEx(g_reg_key, key, NULL, REG_SZ, (BYTE*)v, (DWORD)strlen(v) + 1);
+}
+
+void str_set(char **x, const char *s) {
+  free(*x);
+  *x = _strdup(s);
+}
+
+char *str_cat_alloc(const char *a, const char *b) {
+  size_t al = strlen(a);
+  size_t bl = strlen(b);
+  char *r = (char *)malloc(al + bl + 1);
+  memcpy(r, a, al);
+  r[al + bl] = 0;
+  memcpy(r + al, b, bl);
+  return r;
+}
+
+static const char *FindLastFolderSep(const char *s) {
+  size_t len = strlen(s);
+  for (;;) {
+    if (len == 0)
+      return NULL;
+    len--;
+    if (s[len] == '\\' || s[len] == '/')
+      break;
+  }
+  return s + len;
+}
+
+
+static bool GetConfigFullName(const char *basename, char *fullname, size_t fullname_size) {
+  size_t len = strlen(basename);
+
+  if (FindLastFolderSep(basename)) {
+    if (len >= fullname_size)
+      return false;
+    memcpy(fullname, basename, len + 1);
+    return true;
+  }
+  if (!GetModuleFileName(NULL, fullname, (DWORD)fullname_size))
+    return false;
+  char *last = (char *)FindLastFolderSep(fullname);
+  if (!last || last + len + 8 >= fullname + fullname_size)
+    return false;
+  memcpy(last + 1, "Config\\", 7 * sizeof(last[0]));
+  memcpy(last + 8, basename, (len + 1) * sizeof(last[0]));
+  return true;
+}
+
+
+enum UpdateIconWhy {
+  UIW_NONE = 0,
+  UIW_STOPPED_WORKING_FAIL = 1,
+  UIW_STOPPED_WORKING_RETRY = 2,
+  UIW_EXITING = 3,
+};
+static void UpdateIcon(UpdateIconWhy error);
+static void UpdateButtons();
+
+
+void StopService(UpdateIconWhy error) {
+  if (g_backend->is_started()) {
+    g_backend->Stop();
+
+    g_ui_ip = 0;
+
+    if (error != UIW_EXITING) {
+      UpdateIcon(error);
+      RINFO("Disconnecting");
+      UpdateButtons();
+      RegWriteInt("IsConnected", 0);
+    }
+  }
+}
+
+const char *print_ip(char buf[kSizeOfAddress], in_addr_t ip) {
+  snprintf(buf, kSizeOfAddress, "%d.%d.%d.%d", (ip >> 24) & 0xff, (ip >> 16) & 0xff, (ip >> 8) & 0xff, (ip >> 0) & 0xff);
+  return buf;
+}
+
+class MyProcessorDelegate : public ProcessorDelegate {
+public:
+  virtual void OnConnected(in_addr_t my_ip) {
+    if (my_ip != g_ui_ip) {
+
+      if (my_ip) {
+        char buf[kSizeOfAddress];
+        print_ip(buf, my_ip);
+        RINFO("Connection established. IP %s", buf);
+      }
+      g_ui_ip = my_ip;
+      MyPostMessage(WM_USER + 2, 0, 0);
+    }
+  }
+  virtual void OnDisconnected() {
+    MyProcessorDelegate::OnConnected(0);
+  }
+};
+
+static MyProcessorDelegate my_procdel;
+
+void StartService(bool skip_clear = false) {
+  char buf[1024];
+  if (!GetConfigFullName(g_current_filename, buf, ARRAYSIZE(buf)))
+    return;
+  
+  if (!g_backend->is_started()) {
+    if (!skip_clear)
+      PostMessage(g_ui_window, WM_USER + 6, NULL, NULL);
+   
+    g_backend->Start(&my_procdel, buf);
+
+    UpdateButtons();
+    RegWriteInt("IsConnected", 1);
+  }
+}
+
+static bool g_has_icon;
+
+static char *PrintMB(char *buf, int64 bytes) {
+  char *bo = buf;
+  if (bytes < 0) {
+    *buf++ = '-';
+    bytes = -bytes;
+  }
+  int64 big = bytes / (1024*1024);
+  int little = bytes % (1024*1024);
+  if (bytes < 10*1024*1024) {
+    // X.XXX
+    snprintf(buf, 64, "%lld.%.3d MB", big, 1000 * little / (1024*1024));
+  } else if (bytes < 100*1024*1024) {
+    // XX.XX
+    snprintf(buf, 64, "%lld.%.2d MB", big, 100 * little / (1024*1024));
+  } else {
+    // XX.X
+    snprintf(buf, 64, "%lld.%.1d MB", big, 10 * little / (1024*1024));
+  }
+  return bo;
+}
+
+static void UpdateStats() {
+  ProcessorStats stats = g_backend->GetStats();
+
+  char tmp[64], tmp2[64];
+  char buf[512];
+  snprintf(buf, 512, "%s received (%lld packets), %s sent (%lld packets)",
+    PrintMB(tmp, stats.udp_bytes_in), stats.udp_packets_in,
+    PrintMB(tmp2, stats.udp_bytes_out), stats.udp_packets_out/*, udp_qsize2 - udp_qsize1, g_tun_reads*/);
+  SetDlgItemText(g_ui_window, IDTXT_UDP, buf);
+
+  snprintf(buf, 512, "%s received (%lld packets), %s sent (%lld packets)",
+    PrintMB(tmp, stats.tun_bytes_in), stats.tun_packets_in,
+    PrintMB(tmp2, stats.tun_bytes_out), stats.tun_packets_out/*,
+          tpq_last_qsize, g_tun_writes*/);
+  SetDlgItemText(g_ui_window, IDTXT_TUN, buf);
+
+  char *d = buf;
+  if (stats.last_complete_handskake_timestamp) {
+    uint32 ago = (uint32)((OsGetMilliseconds() - stats.last_complete_handskake_timestamp) / 1000);
+    uint32 hours = ago / 3600;
+    uint32 minutes = (ago - hours * 3600) / 60;
+    uint32 seconds = (ago - hours * 3600 - minutes * 60);
+
+    if (hours)
+      d += snprintf(d, 32, hours == 1 ? "%d hour, " : "%d hours, ", hours);
+    if (minutes)
+      d += snprintf(d, 32, minutes == 1 ? "%d minute, " : "%d minutes, ", minutes);
+    if (d == buf || seconds)
+      d += snprintf(d, 32, seconds == 1 ? "%d second, " : "%d seconds, ", seconds);
+    memcpy(d - 2, " ago", 5);
+  } else {
+    memcpy(buf, "(never)", 8);
+  }
+  SetDlgItemText(g_ui_window, IDTXT_HANDSHAKE, buf);
+}
+
+void UpdatePublicKey(char *s) {
+  SetDlgItemText(g_ui_window, IDC_PUBLIC_KEY, s);
+  free(s);
+}
+
+static void UpdateButtons() {
+  bool running = g_backend->is_started();
+  SetDlgItemText(g_ui_window, ID_START, running ? "Re&connect" : "&Connect");
+  EnableWindow(GetDlgItem(g_ui_window, ID_STOP), running);
+}
+
+static void UpdateIcon(UpdateIconWhy why) {
+  in_addr_t ip = g_ui_ip;
+  NOTIFYICONDATA nid;
+  memset(&nid, 0, sizeof(nid));
+  nid.cbSize = sizeof(nid);
+  nid.hWnd = g_ui_window;
+  nid.uID = 1;
+  nid.uVersion = NOTIFYICON_VERSION;
+  nid.uCallbackMessage = WM_USER + 1;
+  nid.uFlags = NIF_MESSAGE | NIF_TIP | NIF_ICON;
+  nid.hIcon = g_icons[ip ? 0 : 1];
+  
+  char buf[kSizeOfAddress];
+  char namebuf[64];
+  if (ip != 0) {
+    snprintf(nid.szTip, sizeof(nid.szTip), "TunSafe [%s - %s]", GetCurrentConfigTitle(namebuf, sizeof(namebuf)), print_ip(buf, ip));
+    nid.uFlags |= NIF_INFO;
+    snprintf(nid.szInfoTitle, sizeof(nid.szInfoTitle), "Connected to: %s", namebuf);
+    snprintf(nid.szInfo, sizeof(nid.szInfo), "IP: %s", buf);
+    nid.uTimeout = 5000;
+    nid.dwInfoFlags = NIIF_INFO;
+  } else {
+    snprintf(nid.szTip, sizeof(nid.szTip), "TunSafe [%s]", "Disconnected");
+
+    if (why == UIW_STOPPED_WORKING_FAIL) {
+      nid.uFlags |= NIF_INFO;
+      strcpy(nid.szInfoTitle, "Disconnected!");
+      strcpy(nid.szInfo, "There was a problem with the connection. You are now disconnected.");
+      nid.uTimeout = 5000;
+      nid.dwInfoFlags = NIIF_ERROR;
+    }
+  }
+  Shell_NotifyIcon(g_has_icon ? NIM_MODIFY : NIM_ADD, &nid);
+
+  SendMessage(g_ui_window, WM_SETICON, ICON_SMALL, (LPARAM)g_icons[ip ? 0 : 1]);
+
+  g_has_icon = true;
+}
+
+static void RemoveIcon() {
+  if (g_has_icon) {
+    NOTIFYICONDATA nid;
+    memset(&nid, 0, sizeof(nid));
+    nid.cbSize = sizeof(nid);
+    nid.hWnd = g_ui_window;
+    nid.uID = 1;
+    Shell_NotifyIcon(NIM_DELETE, &nid);
+  }
+}
+
+#define MAX_CONFIG_FILES 100
+#define ID_POPUP_CONFIG_FILE 10000
+char *config_filenames[MAX_CONFIG_FILES];
+
+static void RestartService(UpdateIconWhy why, bool only_if_active) {
+  if (!only_if_active || g_backend->is_started()) {
+    StopService(why);
+    StartService(why != UIW_NONE);
+  }
+}
+
+static char *StripConfExtension(const char *src, char *target, size_t size) {
+  size_t len = strlen(src);
+  if (len >= 5 && memcmp(src + len - 5, ".conf", 5) == 0)
+    len -= 5;
+
+  len = std::min(len, size - 1);
+  target[len] = 0;
+  memcpy(target, src, len);
+  return target;
+}
+
+static const char *GetCurrentConfigTitle(char *target, size_t size) {
+  const char *ll = FindLastFolderSep(g_current_filename);
+  return StripConfExtension(ll ? ll + 1 : g_current_filename, target, size);
+}
+
+static void LoadConfigFile(const char *filename, bool save, bool force_start) {
+  str_set(&g_current_filename, filename);
+  char namebuf[64];
+  char *f = str_cat_alloc("TunSafe VPN Client - ", GetCurrentConfigTitle(namebuf, sizeof(namebuf)));
+  SetWindowText(g_ui_window, f);
+  free(f);
+  RestartService(UIW_NONE, !force_start);
+  if (save)
+    RegWriteStr("ConfigFile", filename);
+}
+
+static void AddToAvailableFilesPopup(HMENU menu, int max_num_items, bool is_settings) {
+  char buf[1024];
+  int nfiles = 0;
+  if (!GetConfigFullName("*.*", buf, ARRAYSIZE(buf)))
+    return;
+    
+  int selected_item = -1;
+  WIN32_FIND_DATA wfd;
+  HANDLE handle = FindFirstFile(buf, &wfd);
+  if (handle != INVALID_HANDLE_VALUE) {
+    do {
+      if (wfd.cFileName[0] == '.')
+        continue;
+
+      if (strcmp(g_current_filename, wfd.cFileName) == 0)
+        selected_item = nfiles;
+
+      str_set(&config_filenames[nfiles], wfd.cFileName);
+      
+      nfiles++;
+      if (nfiles == MAX_CONFIG_FILES)
+        break;
+    } while (FindNextFile(handle, &wfd));
+    FindClose(handle);
+  }
+
+  HMENU where;
+
+  bool is_connected = g_backend->is_started();
+
+  where = menu;
+  for (int i = 0; i < nfiles; i++) {
+    if (i == max_num_items) {
+      where = CreatePopupMenu();
+      AppendMenu(menu, MF_POPUP, (UINT_PTR)where, "&More");
+    }
+
+    AppendMenu(where, (i == selected_item && is_connected) ? MF_CHECKED : 0, ID_POPUP_CONFIG_FILE + i, StripConfExtension(config_filenames[i], buf, sizeof(buf)));
+
+    if (i == selected_item)
+      SetMenuDefaultItem(where, ID_POPUP_CONFIG_FILE + i, MF_BYCOMMAND);
+  }
+  if (nfiles)
+    AppendMenu(menu, MF_SEPARATOR, 0, 0);
+}
+
+static void ShowSettingsMenu(HWND wnd) {
+  HMENU menu = CreatePopupMenu();
+
+  AddToAvailableFilesPopup(menu, 10, true);
+
+  AppendMenu(menu, 0, IDSETT_OPEN_FILE, "&Import File...");
+  AppendMenu(menu, 0, IDSETT_BROWSE_FILES, "&Browse in Explorer");
+
+  AppendMenu(menu, MF_SEPARATOR, 0, 0);
+  AppendMenu(menu, 0, IDSETT_KEYPAIR, "Generate &Key Pair...");
+  AppendMenu(menu, MF_SEPARATOR, 0, 0);
+
+  HMENU blockinternet = CreatePopupMenu();
+  AppendMenu(blockinternet, 0, IDSETT_BLOCKINTERNET_OFF, "Off");
+  AppendMenu(blockinternet, MF_SEPARATOR, 0, 0);
+  AppendMenu(blockinternet, 0, IDSETT_BLOCKINTERNET_ROUTE, "Yes, with Routing Rules");
+  AppendMenu(blockinternet, 0, IDSETT_BLOCKINTERNET_FIREWALL, "Yes, with Firewall Rules");
+  AppendMenu(blockinternet, 0, IDSETT_BLOCKINTERNET_BOTH, "Yes, Both Methods");
+  bool is_activated = false;
+  int value = GetInternetBlockState(&is_activated);
+  CheckMenuRadioItem(blockinternet, IDSETT_BLOCKINTERNET_OFF, IDSETT_BLOCKINTERNET_BOTH, IDSETT_BLOCKINTERNET_OFF + value, MF_BYCOMMAND);
+  AppendMenu(menu, MF_POPUP + is_activated * MF_CHECKED, (UINT_PTR)blockinternet, "Block &All Internet Traffic");
+  
+  if (g_allow_pre_post || GetAsyncKeyState(VK_SHIFT) < 0) {
+    AppendMenu(menu, g_allow_pre_post ? MF_CHECKED : 0, IDSETT_PREPOST, "&Allow Pre/Post commands");
+  }
+
+  AppendMenu(menu, MF_SEPARATOR, 0, 0);
+  AppendMenu(menu, 0, IDSETT_WEB_PAGE, "Go to &Web Page");
+  AppendMenu(menu, 0, IDSETT_OPENSOURCE, "See Open Source Licenses");
+  AppendMenu(menu, 0, IDSETT_ABOUT, "&About TunSafe...");
+  
+  POINT pt;
+  GetCursorPos(&pt);
+
+  g_last_popup_is_tray = false;
+  int rv = TrackPopupMenu(menu, 0, pt.x, pt.y, 0, wnd, NULL);
+  DestroyMenu(menu);
+}
+
+void FindDesktopFolderView(REFIID riid, void **ppv) {
+  CComPtr<IShellWindows> spShellWindows;
+  spShellWindows.CoCreateInstance(CLSID_ShellWindows);
+
+  CComVariant vtLoc(CSIDL_DESKTOP);
+  CComVariant vtEmpty;
+  long lhwnd;
+  CComPtr<IDispatch> spdisp;
+  spShellWindows->FindWindowSW(
+    &vtLoc, &vtEmpty,
+    SWC_DESKTOP, &lhwnd, SWFO_NEEDDISPATCH, &spdisp);
+
+  CComPtr<IShellBrowser> spBrowser;
+  CComQIPtr<IServiceProvider>(spdisp)->
+    QueryService(SID_STopLevelBrowser,
+                 IID_PPV_ARGS(&spBrowser));
+
+  CComPtr<IShellView> spView;
+  spBrowser->QueryActiveShellView(&spView);
+
+  spView->QueryInterface(riid, ppv);
+}
+
+void GetDesktopAutomationObject(REFIID riid, void **ppv) {
+  CComPtr<IShellView> spsv;
+  FindDesktopFolderView(IID_PPV_ARGS(&spsv));
+  CComPtr<IDispatch> spdispView;
+  spsv->GetItemObject(SVGIO_BACKGROUND, IID_PPV_ARGS(&spdispView));
+  spdispView->QueryInterface(riid, ppv);
+}
+
+void ShellExecuteFromExplorer(
+  PCSTR pszFile,
+  PCSTR pszParameters = nullptr,
+  PCSTR pszDirectory = nullptr,
+  PCSTR pszOperation = nullptr,
+  int nShowCmd = SW_SHOWNORMAL) {
+  CComPtr<IShellFolderViewDual> spFolderView;
+  GetDesktopAutomationObject(IID_PPV_ARGS(&spFolderView));
+  CComPtr<IDispatch> spdispShell;
+  spFolderView->get_Application(&spdispShell);
+
+  CComQIPtr<IShellDispatch2>(spdispShell)
+    ->ShellExecute(CComBSTR(pszFile),
+                   CComVariant(pszParameters ? pszParameters : ""),
+                   CComVariant(pszDirectory ? pszDirectory : ""),
+                   CComVariant(pszOperation ? pszOperation : ""),
+                   CComVariant(nShowCmd));
+}
+
+static void OpenEditor() {
+  char buf[MAX_PATH];
+  if (GetConfigFullName(g_current_filename, buf, ARRAYSIZE(buf))) {
+    SHELLEXECUTEINFO shinfo = {0};
+    shinfo.cbSize = sizeof(shinfo);
+    shinfo.fMask = SEE_MASK_CLASSNAME;
+    shinfo.lpFile = buf;
+    shinfo.lpParameters = "";
+    shinfo.lpClass = ".txt";
+    shinfo.nShow = SW_SHOWNORMAL;
+    ShellExecuteEx(&shinfo);
+  }
+}
+
+static void BrowseFiles() {
+  char buf[MAX_PATH];
+  if (GetConfigFullName("", buf, ARRAYSIZE(buf))) {
+    size_t l = strlen(buf);
+    buf[l - 1] = 0;
+    ShellExecuteFromExplorer(buf, NULL, NULL, "explore");
+  }
+}
+
+bool FileExists(const CHAR *fileName) {
+  DWORD fileAttr = GetFileAttributes(fileName);
+  return (0xFFFFFFFF != fileAttr);
+}
+
+__int64 FileSize(const char* name) {
+  WIN32_FILE_ATTRIBUTE_DATA fad;
+  if (!GetFileAttributesEx(name, GetFileExInfoStandard, &fad))
+    return -1; // error condition, could call GetLastError to find out more
+  LARGE_INTEGER size;
+  size.HighPart = fad.nFileSizeHigh;
+  size.LowPart = fad.nFileSizeLow;
+  return size.QuadPart;
+}
+
+static bool is_space(uint8_t c) {
+  return c == ' ' || c == '\r' || c == '\n' || c == '\t';
+}
+
+static bool is_valid(uint8_t c) {
+  return c >= ' ' || c == '\r' || c == '\n' || c == '\t';
+}
+
+bool SanityCheckBuf(uint8 *buf, size_t n) {
+  for (size_t i = 0; i < n; i++) {
+    if (!is_space(buf[i])) {
+      if (buf[i] != '[' && buf[i] != '#')
+        return false;
+      for (; i < n; i++)
+        if (!is_valid(buf[i]))
+          return false;
+      return true;
+    }
+  }
+  return false;
+}
+
+uint8* LoadFileSane(const char *name, size_t *size) {
+  FILE *f = fopen(name, "rb");
+  uint8 *new_file = NULL, *file = NULL;
+  size_t j, i, n;
+  if (!f) return false;
+  fseek(f, 0, SEEK_END);
+  long x = ftell(f);
+  fseek(f, 0, SEEK_SET);
+  if (x < 0 || x >= 65536) goto error;
+  file = (uint8*)malloc(x + 1);
+  if (!file) goto error;
+  n = fread(file, 1, x + 1, f);
+  if (n != x || !SanityCheckBuf(file, n))
+    goto error;
+  // Convert the file to DOS new lines
+  for (i = j = 0; i < n; i++)
+    j += (file[i] == '\n');
+  new_file = (uint8*)malloc(n + 1 + j);
+  if (!new_file) goto error;
+  for (i = j = 0; i < n; i++) {
+    uint8 c = file[i];
+    if (c == '\r')
+      continue;
+    if (c == '\n')
+      new_file[j++] = '\r';
+    new_file[j++] = c;
+  }
+  new_file[j] = 0;
+  *size = j;
+
+error:
+  fclose(f);
+  free(file);
+  return new_file;
+}
+
+bool WriteOutFile(const char *filename, uint8 *filedata, size_t filesize) {
+  FILE *f = fopen(filename, "wb");
+  if (!f) return false;
+  if (fwrite(filedata, 1, filesize, f) != filesize) {
+    fclose(f);
+    return false;
+  }
+  fclose(f);
+  return true;
+}
+
+void ImportFile(const char *s) {
+  char buf[1024];
+  char mesg[1024];
+  size_t filesize;
+  const char *last = FindLastFolderSep(s);
+  if (!last || !GetConfigFullName(last + 1, buf, ARRAYSIZE(buf)) || _stricmp(buf, s) == 0)
+    return;
+
+  uint8 *filedata = LoadFileSane(s, &filesize);
+  if (!filedata) goto fail;
+
+  if (FileExists(buf)) {
+    snprintf(mesg, ARRAYSIZE(mesg), "A file already exists with the name '%s' in the configuration folder. Do you want to overwrite it?", last + 1);
+    if (MessageBoxA(g_ui_window, mesg, "TunSafe", MB_OKCANCEL | MB_ICONEXCLAMATION) != IDOK)
+      goto out;
+  } else {
+    snprintf(mesg, ARRAYSIZE(mesg), "Do you want to import '%s' into TunSafe?", last + 1);
+    if (MessageBoxA(g_ui_window, mesg, "TunSafe", MB_OKCANCEL | MB_ICONQUESTION) != IDOK)
+      goto out;
+  }
+
+  if (!WriteOutFile(buf, filedata, filesize)) {
+    DeleteFileA(buf);
+fail:
+    MessageBoxA(g_ui_window, "There was a problem importing the file.", "TunSafe", MB_ICONEXCLAMATION);
+  } else {
+    LoadConfigFile(last + 1, true, false);
+  }
+
+out:
+  free(filedata);
+}
+
+void ShowUI(HWND hWnd) {
+  g_ui_visible = true;
+  UpdateStats();
+  ShowWindow(hWnd, SW_SHOW);
+  BringWindowToTop(hWnd);
+  SetForegroundWindow(hWnd);
+}
+
+void HandleDroppedFiles(HWND wnd, HDROP hdrop) {
+  char buf[MAX_PATH];
+  if (DragQueryFile(hdrop, -1, NULL, 0) == 1) {
+    if (DragQueryFile(hdrop, 0, buf, ARRAYSIZE(buf))) {
+      SetForegroundWindow(wnd);
+      ImportFile(buf);
+    }
+  }
+  DragFinish(hdrop);
+}
+
+void BrowseFile(HWND wnd) {
+  char szFile[1024];
+
+  // open a file name
+  OPENFILENAME ofn = {0};
+  ofn.lStructSize = sizeof(ofn);
+  ofn.hwndOwner = g_ui_window;
+  ofn.lpstrFile = szFile;
+  ofn.lpstrFile[0] = '\0';
+  ofn.nMaxFile = sizeof(szFile);
+  ofn.lpstrFilter = "Config Files (*.conf)\0*.conf\0";
+  ofn.nFilterIndex = 1;
+  ofn.lpstrFileTitle = NULL;
+  ofn.nMaxFileTitle = 0;
+  ofn.lpstrInitialDir = NULL;
+  ofn.Flags = OFN_PATHMUSTEXIST | OFN_FILEMUSTEXIST;
+  if (GetOpenFileName(&ofn))
+    ImportFile(szFile);
+}
+
+static const uint8 kCurve25519Basepoint[32] = {9};
+
+static void SetKeyBox(HWND wnd, int ctr, uint8 buf[32]) {
+  uint8 *privs = base64_encode(buf, 32, NULL);
+  SetDlgItemText(wnd, ctr, (char*)privs);
+  free(privs);
+}
+
+static INT_PTR WINAPI KeyPairDlgProc(HWND hWnd, UINT message, WPARAM wParam,
+                              LPARAM lParam) {
+  switch (message) {
+  case WM_INITDIALOG:
+    return TRUE;
+  case WM_CLOSE:
+    EndDialog(hWnd, 0);
+    return TRUE;
+  case WM_COMMAND:
+    switch (wParam) {
+    case IDCANCEL:
+      EndDialog(hWnd, 0);
+      return TRUE;
+    case IDC_PRIVATE_KEY | (EN_CHANGE << 16) : {
+      char buf[128];
+      uint8 pub[32];
+      uint8 priv[32];
+      buf[0] = 0;
+      size_t len = GetDlgItemText(hWnd, IDC_PRIVATE_KEY, buf, sizeof(buf));
+      size_t olen = 32;
+      if (base64_decode((uint8*)buf, len, priv, &olen) && olen == 32) {
+        curve25519_donna(pub, priv, kCurve25519Basepoint);
+        SetKeyBox(hWnd, IDC_PUBLIC_KEY, pub);
+      } else {
+        SetDlgItemText(hWnd, IDC_PUBLIC_KEY, "(Invalid Private Key)");
+      }
+
+      return TRUE;
+    }
+    case IDRAND: {
+      uint8 priv[32];
+      uint8 pub[32];
+      OsGetRandomBytes(priv, 32);
+      curve25519_normalize(priv);
+      curve25519_donna(pub, priv, kCurve25519Basepoint);
+      SetKeyBox(hWnd, IDC_PRIVATE_KEY, priv);
+      SetKeyBox(hWnd, IDC_PUBLIC_KEY, pub);
+      return TRUE;
+    }
+    }
+  }
+  return FALSE;
+}
+
+bool wm_dropfiles_recursive;
+uint64 last_auto_service_restart;
+static INT_PTR WINAPI DlgProc(HWND hWnd, UINT message, WPARAM wParam,
+                                LPARAM lParam) {
+  switch(message) {
+  case WM_INITDIALOG:
+    return TRUE;
+  case WM_CLOSE:
+    g_ui_visible = false;
+    ShowWindow(hWnd, SW_HIDE);
+    return TRUE;
+  case WM_COMMAND:
+    if (wParam >= ID_POPUP_CONFIG_FILE && wParam < ID_POPUP_CONFIG_FILE + MAX_CONFIG_FILES) {
+      const char *new_conf = config_filenames[wParam - ID_POPUP_CONFIG_FILE];
+      if (!new_conf)
+        return TRUE;
+
+      if (g_last_popup_is_tray && strcmp(new_conf, g_current_filename) == 0 && g_backend->is_started()) {
+        StopService(UIW_NONE);
+      } else {
+        LoadConfigFile(new_conf, true, g_last_popup_is_tray);
+      }
+
+
+      return TRUE;
+    }
+    switch(wParam) {
+    case ID_START: 
+      StopService(UIW_NONE);
+      StartService();
+      break;
+    case ID_STOP:  StopService(UIW_NONE); break;
+    case ID_EXIT:  PostQuitMessage(0); break;
+    case ID_RESET: g_backend->ResetStats(); break;
+    case ID_MORE_BUTTON: ShowSettingsMenu(hWnd); break;
+    case IDSETT_WEB_PAGE: ShellExecute(NULL, NULL, "https://tunsafe.com/", NULL, NULL, 0); break;
+    case IDSETT_OPENSOURCE: ShellExecute(NULL, NULL, "https://tunsafe.com/open-source", NULL, NULL, 0); break;
+    case ID_EDITCONF: OpenEditor(); break;
+    case IDSETT_BROWSE_FILES:BrowseFiles(); break;
+    case IDSETT_OPEN_FILE: BrowseFile(hWnd); break;
+    case IDSETT_ABOUT:
+      MessageBoxA(g_ui_window, TUNSAFE_VERSION_STRING "\r\n\r\nCopyright � 2018, Ludvig Strigeus\r\n\r\nThanks for choosing TunSafe!\r\n\r\nThis version was built on " __DATE__ " " __TIME__, "About TunSafe", MB_ICONINFORMATION);
+      break;
+    case IDSETT_KEYPAIR:
+      DialogBox(g_hinstance, MAKEINTRESOURCE(IDD_DIALOG2), hWnd, &KeyPairDlgProc);
+      break;
+    case IDSETT_BLOCKINTERNET_OFF:
+    case IDSETT_BLOCKINTERNET_ROUTE:
+    case IDSETT_BLOCKINTERNET_FIREWALL: 
+    case IDSETT_BLOCKINTERNET_BOTH: {
+      InternetBlockState old_state = GetInternetBlockState(NULL);
+      InternetBlockState new_state = (InternetBlockState)(wParam - IDSETT_BLOCKINTERNET_OFF);
+
+      if (old_state == kBlockInternet_Off && new_state != kBlockInternet_Off) {
+        if (MessageBoxA(g_ui_window, "Warning! All Internet traffic will be blocked until you restart your computer. Only traffic through TunSafe will be allowed.\r\n\r\nThe blocking is activated the next time you connect to a VPN server.\r\n\r\nDo you want to continue?", "TunSafe", MB_ICONWARNING | MB_OKCANCEL) == IDCANCEL)
+          return TRUE;
+      }
+
+      SetInternetBlockState(new_state);
+
+      if ((~old_state & new_state) && g_backend->is_started()) {
+        StopService(UIW_NONE);
+        StartService();
+      }
+      return TRUE;
+    }
+    case IDSETT_PREPOST: {
+      g_allow_pre_post = !g_allow_pre_post;
+      RegWriteInt("AllowPrePost", g_allow_pre_post);
+      return TRUE;
+    }
+    }
+    break;
+  case WM_DROPFILES:
+    if (!wm_dropfiles_recursive) {
+      wm_dropfiles_recursive = true;
+      HandleDroppedFiles(hWnd, (HDROP)wParam);
+      wm_dropfiles_recursive = false;
+    }
+    break;
+  case WM_USER + 1:
+    if (lParam == WM_RBUTTONUP) {
+      HMENU menu = CreatePopupMenu();
+      AddToAvailableFilesPopup(menu, 10, false);
+
+      bool active = g_backend->is_started();
+      AppendMenu(menu, 0, ID_START, active ? "Re&connect" : "&Connect");
+      AppendMenu(menu, active ? 0 : MF_GRAYED, ID_STOP, "&Disconnect");
+      AppendMenu(menu, MF_SEPARATOR, 0, NULL);
+      AppendMenu(menu, 0, ID_EXIT, "&Exit");
+      POINT pt;
+      GetCursorPos(&pt);
+
+      SetForegroundWindow(hWnd);
+
+      g_last_popup_is_tray = true;
+
+      int rv = TrackPopupMenu(menu, 0, pt.x, pt.y, 0, hWnd, NULL);      
+      DestroyMenu(menu);
+    } else if (lParam == WM_LBUTTONDBLCLK) {
+      if (IsWindowVisible(hWnd)) {
+        g_ui_visible = false;
+        ShowWindow(hWnd, SW_HIDE);
+      } else {
+        ShowUI(hWnd);
+      }
+    }
+    return TRUE;
+  case WM_USER + 2:
+    if (g_ui_ip != 0 && g_minimize_on_connect) {
+      g_minimize_on_connect = false;
+      g_ui_visible = false;
+      ShowWindow(hWnd, SW_HIDE);
+    }
+    UpdateIcon(UIW_NONE);
+    return TRUE;
+  case WM_USER + 3: {
+    CHARRANGE cr;
+    cr.cpMin = -1;
+    cr.cpMax = -1;
+    // hwnd = rich edit hwnd
+    SendDlgItemMessage(hWnd, IDC_RICHEDIT21, EM_EXSETSEL, 0, (LPARAM)&cr);
+    SendDlgItemMessage(hWnd, IDC_RICHEDIT21, EM_REPLACESEL, 0, (LPARAM)lParam);
+    free( (void*) lParam);
+    return true;
+  }
+  case WM_USER + 6:
+    SetDlgItemText(hWnd, IDC_RICHEDIT21, "");
+    return true;
+  case WM_USER + 5:
+    UpdatePublicKey((char*)lParam);
+    return true;
+  case WM_USER + 4: {
+    UpdateStats();
+    return true;
+  }                      
+  case WM_USER + 10:
+    break;
+
+  case WM_USER + 11: {
+    uint64 now = GetTickCount64();
+    if (now < last_auto_service_restart + 5000) {
+      RERROR("Too many automatic restarts...");
+      StopService(UIW_STOPPED_WORKING_FAIL);
+    } else {
+      last_auto_service_restart = now;
+      RestartService(UIW_STOPPED_WORKING_RETRY, true);
+    }
+    break;
+  }
+  }
+  return FALSE;
+}
+
+struct PostMsg {
+  int msg;
+  WPARAM wparam;
+  LPARAM lparam;
+  PostMsg(int a, WPARAM b, LPARAM c) : msg(a), wparam(b), lparam(c) {}
+};
+
+static HANDLE msg_event;
+static CRITICAL_SECTION msg_section;
+static std::vector<PostMsg> msgvect;
+
+static DWORD WINAPI MessageThread(void *x) {
+  std::vector<PostMsg> proc;
+  for(;;) {
+    WaitForSingleObject(msg_event, INFINITE);
+    proc.clear();
+    EnterCriticalSection(&msg_section);
+    std::swap(proc, msgvect);
+    LeaveCriticalSection(&msg_section);
+    for(size_t i = 0; i != proc.size(); i++)
+      PostMessage(g_ui_window, proc[i].msg, proc[i].wparam, proc[i].lparam);
+  }
+}
+
+static void MyPostMessage(int msg, WPARAM wparam, LPARAM lparam) {
+  size_t count;
+  EnterCriticalSection(&msg_section);
+  count = msgvect.size();
+  msgvect.emplace_back(msg, wparam, lparam);
+  LeaveCriticalSection(&msg_section);
+  if (count == 0) SetEvent(msg_event);
+}
+
+static void InitMyPostMessage() {
+  msg_event = CreateEvent(NULL, FALSE, FALSE, NULL);
+  InitializeCriticalSection(&msg_section);
+  DWORD thread_id;
+  CloseHandle(CreateThread(NULL, 0, &MessageThread, NULL, 0, &thread_id));
+}
+
+
+void OsGetRandomBytes(uint8 *data, size_t data_size) {
+#if defined(OS_WIN)
+  static BOOLEAN(APIENTRY *pfn)(void*, ULONG);
+  static bool resolved;
+  if (!resolved) {
+    pfn = (BOOLEAN(APIENTRY *)(void*, ULONG))GetProcAddress(LoadLibrary("ADVAPI32.DLL"), "SystemFunction036");
+    resolved = true;
+  }
+  if (pfn && pfn(data, (ULONG)data_size))
+    return;
+  int r = 0;
+#else
+  int fd = open("/dev/urandom", O_RDONLY);
+  int r = read(fd, data, data_size);
+  if (r < 0) r = 0;
+  close(fd);
+#endif
+  for (; r < data_size; r++)
+    data[r] = rand() >> 6;
+}
+
+void OsInterruptibleSleep(int millis) {
+  SleepEx(millis, TRUE);
+}
+
+
+uint64 OsGetMilliseconds() {
+  return GetTickCount64();
+}
+
+void OsGetTimestampTAI64N(uint8 dst[12]) {
+  SYSTEMTIME systime;
+  uint64 file_time_uint64 = 0;
+  GetSystemTime(&systime);
+  SystemTimeToFileTime(&systime, (FILETIME*)&file_time_uint64);
+  uint64 time_since_epoch_100ns = (file_time_uint64 - 116444736000000000);
+  uint64 secs_since_epoch = time_since_epoch_100ns / 10000000 + 0x400000000000000a;
+  uint32 nanos = (uint32)(time_since_epoch_100ns % 10000000) * 100;
+  WriteBE64(dst, secs_since_epoch);
+  WriteBE32(dst + 8, nanos);
+}
+
+
+
+void PushLine(const char *s) {
+  size_t l = strlen(s);
+  char buf[64];
+  SYSTEMTIME t;
+
+  GetLocalTime(&t);
+
+  snprintf(buf, sizeof(buf), "[%.2d:%.2d:%.2d] ", t.wHour, t.wMinute, t.wSecond);
+  size_t tl = strlen(buf);
+
+  char *x = (char*)malloc(tl + l + 3);
+  if (!x) return;
+  memcpy(x, buf, tl);
+  memcpy(x + tl, s, l);
+  x[l + tl] = '\r';
+  x[l + tl + 1] = '\n';
+  x[l + tl + 2] = '\0';
+  MyPostMessage(WM_USER + 3, 0, (LPARAM)x);
+}
+
+void EnsureConfigDirCreated() {
+  char fullname[1024];
+  if (GetConfigFullName("", fullname, sizeof(fullname)))
+    CreateDirectory(fullname, NULL);
+}
+
+void EnableControl(int wnd, bool b) {
+  EnableWindow(GetDlgItem(g_ui_window, wnd), b);
+}
+
+
+LRESULT CALLBACK NotifyWndProc(HWND  hwnd, UINT  uMsg, WPARAM wParam, LPARAM lParam) {
+  switch (uMsg) {
+  case WM_USER + 10:
+    if (wParam == 1) {
+      PostQuitMessage(0);
+      return 31337;
+    } else if (wParam == 0) {
+      ShowUI(g_ui_window);
+      return 31337;
+    }
+    break;
+  }
+  return DefWindowProc(hwnd, uMsg, wParam, lParam);
+}
+
+void CreateNotificationWindow() {
+  WNDCLASSEX wce = {0};
+  wce.cbSize = sizeof(wce);
+  wce.lpfnWndProc = &NotifyWndProc;
+  wce.hInstance = g_hinstance;
+  wce.lpszClassName = "TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90";
+  RegisterClassEx(&wce);
+  CreateWindow("TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90", "TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90", 0, 0, 0, 0, 0, 0, 0, g_hinstance, NULL);
+}
+
+
+void CallbackUpdateUI() {
+  if (g_ui_visible)
+    MyPostMessage(WM_USER + 4, NULL, NULL);
+}
+
+void CallbackTriggerReconnect() {
+  PostMessage(g_ui_window, WM_USER + 11, 0, 0);
+}
+
+void CallbackSetPublicKey(const uint8 public_key[32]) {
+  char *str = (char*)base64_encode(public_key, 32, NULL);
+  PostMessage(g_ui_window, WM_USER + 5, NULL, (LPARAM)str);
+}
+
+int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) {
+  g_hinstance = hInstance;
+  InitCpuFeatures();
+
+  // Check if the app is already running.
+  CreateMutexA(0, FALSE, "TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90");
+  if (GetLastError() == ERROR_ALREADY_EXISTS) {
+    HWND window = FindWindow("TunSafe-f19e092db01cbe0fb6aee132f8231e5b71c98f90", NULL);
+    DWORD_PTR result;
+    if (!window || !SendMessageTimeout(window, WM_USER + 10, 0, 0, SMTO_BLOCK, 3000, &result) || result != 31337) {
+      MessageBoxA(NULL, "It looks like TunSafe is already running, but not responding. Please kill the old process first.", "TunSafe", MB_ICONWARNING);
+    }
+    return 1;
+  }
+  CreateNotificationWindow();
+
+  WSADATA wsaData = {0};
+  if (WSAStartup(MAKEWORD(2, 2), &wsaData) != 0) {
+    RERROR("WSAStartup failed");
+    return 1;
+  }
+
+  LoadLibrary(TEXT("Riched20.dll"));
+
+  g_backend = new TunsafeBackendWin32();
+
+  InitMyPostMessage();
+  InitCommonControls();
+
+  g_icons[0] = LoadIcon(GetModuleHandle(NULL), MAKEINTRESOURCE(IDI_ICON1));
+  g_icons[1] = LoadIcon(GetModuleHandle(NULL), MAKEINTRESOURCE(IDI_ICON0));
+  g_ui_window = CreateDialog(GetModuleHandle(NULL), MAKEINTRESOURCE(IDD_DIALOG1), NULL, &DlgProc);
+
+  if (!g_ui_window)
+    return 1;
+
+  RegCreateKeyEx(HKEY_CURRENT_USER, "Software\\TunSafe", NULL, NULL, 0, KEY_ALL_ACCESS, NULL, &g_reg_key, NULL);
+  DragAcceptFiles(g_ui_window, TRUE);
+
+  ChangeWindowMessageFilter(WM_DROPFILES, MSGFLT_ADD);
+  ChangeWindowMessageFilter(WM_COPYDATA, MSGFLT_ADD);
+  ChangeWindowMessageFilter(0x0049, MSGFLT_ADD);
+
+  static const int ctrls[] = {IDTXT_UDP, IDTXT_TUN, IDTXT_HANDSHAKE};
+  for (int i = 0; i < 3; i++) {
+    HWND w = GetDlgItem(g_ui_window, ctrls[i]);
+    SetWindowLong(w, GWL_EXSTYLE, GetWindowLong(w, GWL_EXSTYLE) | WS_EX_COMPOSITED);
+  }
+
+  g_allow_pre_post = RegReadInt("AllowPrePost", 0) != 0;
+
+  bool minimize = false;
+  const char *filename = NULL;
+
+  for (size_t i = 1; i < __argc; i++) {
+    const char *arg = __argv[i];
+
+    if (_stricmp(arg, "/minimize") == 0) {
+      minimize = true;
+    } else if (_stricmp(arg, "/minimize_on_connect") == 0) {
+      g_minimize_on_connect = true;
+    } else if (_stricmp(arg, "/allow_pre_post") == 0) {
+      g_allow_pre_post = true;
+    } else {
+      filename = arg;
+      break;
+    }
+  }
+
+  if (!minimize) {
+    g_ui_visible = true;
+    ShowWindow(g_ui_window, SW_SHOW);
+  }
+
+  UpdateIcon(UIW_NONE);
+
+  g_logger = &PushLine;
+
+  EnsureConfigDirCreated();
+
+  if (filename) {
+    LoadConfigFile(filename, false, false);
+  } else {
+    char *conf = RegReadStr("ConfigFile", "TunSafe.conf");
+    LoadConfigFile(conf, false, false);
+    free(conf);
+  }
+  
+  //  PrintCpuFeatures();
+
+//  Benchmark();
+
+  if (filename != NULL || RegReadInt("IsConnected", 0)) {
+    StartService();
+  } else {
+    RINFO("Press Connect to initiate a connection to the WireGuard server.");
+  }
+  
+  MSG msg;
+
+  while (GetMessage(&msg, NULL, 0, 0)) {
+    if (!IsDialogMessage(g_ui_window, &msg)) {
+      TranslateMessage(&msg);
+      DispatchMessage(&msg);
+    }
+  }
+  StopService(UIW_EXITING);
+  RemoveIcon();
+
+  return 0;
+}
+
+
+
diff --git a/util.cpp b/util.cpp
new file mode 100644
index 0000000..a601a0b
--- /dev/null
+++ b/util.cpp
@@ -0,0 +1,267 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+
+#include <assert.h>
+#include <stdlib.h>
+#include <stdarg.h>
+#include <string.h>
+#include <string>
+
+#if defined(OS_POSIX)
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#endif
+
+#include "tunsafe_types.h"
+
+static char base64_alphabet[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
+
+uint8 *base64_encode(const uint8 *input, size_t length, size_t *out_length) {
+  uint32 a;
+  size_t size;
+  uint8 *result, *r;
+  const uint8 *end;
+
+  size = length * 4 / 3 + 4 + 1;
+  r = result = (byte*)malloc(size);
+
+  end = input + length - 3;
+
+  // Encode full blocks
+  while (input <= end) {
+    a = (input[0] << 16) + (input[1] << 8) + input[2];
+    input += 3;
+
+    r[0] = base64_alphabet[(a >> 18)/* & 0x3F*/];
+    r[1] = base64_alphabet[(a >> 12) & 0x3F];
+    r[2] = base64_alphabet[(a >> 6) & 0x3F];
+    r[3] = base64_alphabet[(a) & 0x3F];
+    r += 4;
+  }
+
+  if (input == end + 2) {
+    a = input[0] << 4;
+    r[0] = base64_alphabet[(a >> 6) /*& 0x3F*/];
+    r[1] = base64_alphabet[(a) & 0x3F];
+    r[2] = '=';
+    r[3] = '=';
+    r += 4;
+  } else if (input == end + 1) {
+    a = (input[0] << 10) + (input[1] << 2);
+    r[0] = base64_alphabet[(a >> 12) /*& 0x3F*/];
+    r[1] = base64_alphabet[(a >> 6) & 0x3F];
+    r[2] = base64_alphabet[(a) & 0x3F];
+    r[3] = '=';
+    r += 4;
+  }
+  if (out_length)
+    *out_length = r - result;
+  *r = 0;
+  return result;
+}
+
+#define WHITESPACE 64
+#define EQUALS     65
+#define INVALID    66
+
+static const unsigned char d[] = {
+  66,66,66,66,66,66,66,66,66,66,64,66,66,66,66,66,66,66,66,66,66,66,66,66,66,
+  66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,62,66,66,66,63,52,53,
+  54,55,56,57,58,59,60,61,66,66,66,65,66,66,66, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
+  10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,66,66,66,66,66,66,26,27,28,
+  29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,66,66,
+  66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,
+  66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,
+  66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,
+  66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,
+  66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,
+  66,66,66,66,66,66
+};
+
+bool base64_decode(uint8 *in, size_t inLen, uint8 *out, size_t *outLen) {
+  uint8 *end = in + inLen;
+  uint8 iter = 0;
+  uint32_t buf = 0;
+  size_t len = 0;
+
+  while (in < end) {
+    unsigned char c = d[*in++];
+
+    switch (c) {
+    case WHITESPACE: continue;   /* skip whitespace */
+    case INVALID:    return false;   /* invalid input, return error */
+    case EQUALS:                 /* pad character, end of data */
+      in = end;
+      continue;
+    default:
+      buf = buf << 6 | c;
+      iter++;
+      if (iter == 4) {
+        if ((len += 3) > *outLen) return 0; /* buffer overflow */
+        *(out++) = (buf >> 16) & 255;
+        *(out++) = (buf >> 8) & 255;
+        *(out++) = buf & 255;
+        buf = 0; iter = 0;
+
+      }
+    }
+  }
+  if (iter == 3) {
+    if ((len += 2) > *outLen) return 0; /* buffer overflow */
+    *(out++) = (buf >> 10) & 255;
+    *(out++) = (buf >> 2) & 255;
+  } else if (iter == 2) {
+    if (++len > *outLen) return 0; /* buffer overflow */
+    *(out++) = (buf >> 4) & 255;
+  }
+  *outLen = len;
+  return true;
+}
+
+
+
+int RunCommand(const char *fmt, ...) {
+  const char *fmt_org = fmt;
+  va_list va;
+  std::string tmp;
+  char buf[32], c;
+  char *args[33];
+  char *envp[1] = {NULL};
+  int nargs = 0;
+  va_start(va, fmt);
+  for (;;) {
+    c = *fmt++;
+    if (c == '%') {
+      c = *fmt++;
+      if (c == 0) goto ZERO;
+      if (c == 's') {
+        tmp += va_arg(va, char*);
+      } else if (c == 'd') {
+        snprintf(buf, 32, "%d", va_arg(va, int));
+        tmp += buf;
+      } else if (c == 'u') {
+        snprintf(buf, 32, "%u", va_arg(va, int));
+        tmp += buf;
+      } else if (c == '%') {
+        tmp += '%';
+      } else if (c == 'A') {
+        struct in_addr in;
+        in.s_addr = htonl(va_arg(va, in_addr_t));
+        tmp += inet_ntoa(in);
+      }
+    } else if (c == ' ' || c == 0) {
+ZERO:
+      args[nargs++] = _strdup(tmp.c_str());
+      tmp.clear();
+      if (nargs == 32 || c == 0) break;
+    } else {
+      tmp += c;
+    }
+  }
+  args[nargs] = 0;
+
+  fprintf(stderr, "Run:");
+  for (int i = 0; args[i]; i++)
+    fprintf(stderr, " %s", args[i]);
+  fprintf(stderr, "\n");
+
+  int ret = -1;
+
+
+#if defined(OS_POSIX)
+  pid_t pid = fork();
+  if (pid == 0) {
+    execve(args[0], args, envp);
+    exit(127);
+  }
+  if (pid < 0) {
+    RERROR("Fork failed");
+  } else if (waitpid(pid, &ret, 0) != pid) {
+    ret = -1;
+  }
+#endif
+
+  if (ret != 0)
+    RERROR("Command %s failed %d!", fmt_org, ret);
+
+  return ret;
+}
+
+bool IsOnlyZeros(const uint8 *data, size_t data_size) {
+  for (size_t i = 0; i != data_size; i++)
+    if (data[i])
+      return false;
+  return true;
+}
+
+
+#ifdef _MSC_VER
+void printhex(const char *name, const void *a, size_t l) {
+  char buf[256];
+  snprintf(buf, 256, "%s (%d):", name, (int)l); OutputDebugString(buf);
+  for (size_t i = 0; i < l; i++) {
+    if (i % 4 == 0) printf(" ");
+    snprintf(buf, 256, "%.2X", *((uint8*)a + i)); OutputDebugString(buf);
+  }
+  OutputDebugString("\n");
+}
+
+#else
+void printhex(const char *name, const void *a, size_t l) {
+  printf("%s (%d):", name, (int)l);
+  for (size_t i = 0; i < l; i++) {
+    if (i % 4 == 0) printf(" ");
+    printf("%.2X", *((uint8*)a + i));
+  }
+  printf("\n");
+}
+#endif
+
+typedef void Logger(const char *msg);
+Logger *g_logger;
+
+#undef RERROR
+#undef void 
+
+void RERROR(const char *msg, ...);
+
+void RERROR(const char *msg, ...) {
+  va_list va;
+  char buf[512];
+  va_start(va, msg);
+  vsnprintf(buf, sizeof(buf), msg, va);
+  va_end(va);
+  if (g_logger) {
+    g_logger(buf);
+  } else {
+    fputs(buf, stderr);
+    fputs("\n", stderr);
+  }
+}
+
+void rinfo(const char *msg, ...) {
+  printf("muu");
+}
+
+void rinfo2(const char *msg) {
+  printf("muu2");
+}
+
+void RINFO(const char *msg, ...) {
+  va_list va;
+  char buf[512];
+  va_start(va, msg);
+  vsnprintf(buf, sizeof(buf), msg, va);
+  va_end(va);
+  if (g_logger) {
+    g_logger(buf);
+  } else {
+    fputs(buf, stderr);
+    fputs("\n", stderr);
+  }
+}
diff --git a/util.h b/util.h
new file mode 100644
index 0000000..48b8324
--- /dev/null
+++ b/util.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#pragma once
+#include "tunsafe_types.h"
+
+uint8 *base64_encode(const uint8 *input, size_t length, size_t *out_length);
+bool base64_decode(uint8 *in, size_t inLen, uint8 *out, size_t *outLen);
+bool IsOnlyZeros(const uint8 *data, size_t data_size);
+
+int RunCommand(const char *fmt, ...);
+typedef void Logger(const char *msg);
+extern Logger *g_logger;
+
+
diff --git a/wireguard.cpp b/wireguard.cpp
new file mode 100644
index 0000000..ab9b393
--- /dev/null
+++ b/wireguard.cpp
@@ -0,0 +1,998 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "wireguard.h"
+#include "netapi.h"
+#include "wireguard_proto.h"
+#include "crypto/chacha20poly1305.h"
+#include "crypto/blake2s.h"
+#include "crypto/siphash.h"
+#include "tunsafe_endian.h"
+#include <algorithm>
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+#include "wireguard.h"
+
+uint64 OsGetMilliseconds();
+
+enum {
+  IPV4_HEADER_SIZE = 20,
+  IPV6_HEADER_SIZE = 40,
+};
+
+WireguardProcessor::WireguardProcessor(UdpInterface *udp, TunInterface *tun, ProcessorDelegate *procdel) {
+  tun_addr_.size = 0;
+  tun6_addr_.size = 0;
+  udp_ = udp;
+  tun_ = tun;
+  procdel_ = procdel;
+  mtu_ = 1420;
+  memset(&stats_, 0, sizeof(stats_));
+  listen_port_ = 0;
+  network_discovery_spoofing_ = false;
+  add_routes_mode_ = true;
+  dns_blocking_ = true;
+  internet_blocking_ = kBlockInternet_Default;
+  dns6_addr_.sin.sin_family = dns_addr_.sin.sin_family = 0;
+}
+
+WireguardProcessor::~WireguardProcessor() {
+}
+
+bool WireguardProcessor::AddDnsServer(const IpAddr &sin) {
+  IpAddr *target = (sin.sin.sin_family == AF_INET6) ? &dns6_addr_ : &dns_addr_;
+  if (target->sin.sin_family != 0)
+    return false;
+  *target = sin;
+  return true;
+}
+
+
+bool WireguardProcessor::SetTunAddress(const WgCidrAddr &addr) {
+  WgCidrAddr *target = (addr.size == 128) ? &tun6_addr_ : &tun_addr_;
+  if (target->size != 0)
+    return false;
+  *target = addr;
+  return true;
+}
+
+
+ProcessorStats WireguardProcessor::GetStats() {
+  stats_.last_complete_handskake_timestamp = dev_.last_complete_handskake_timestamp();
+  return stats_;
+}
+
+void WireguardProcessor::ResetStats() {
+  memset(&stats_, 0, sizeof(stats_));
+}
+
+void WireguardProcessor::SetupCompressionHeader(WgPacketCompressionVer01 *c) {
+  memset(c, 0, sizeof(WgPacketCompressionVer01));
+  // Windows uses a ttl of 128 while other platforms use 64
+#if defined(OS_WIN)
+  c->ttl = 128;
+#else // defined(OS_WIN)
+  c->ttl = 64;
+#endif  // defined(OS_WIN)
+  WriteLE16(&c->version, EXT_PACKET_COMPRESSION_VER);
+  memcpy(c->ipv4_addr, &tun_addr_.addr, 4);
+  if (tun6_addr_.size == 128)
+    memcpy(c->ipv6_addr, &tun6_addr_.addr, 16);
+  c->flags = ((tun_addr_.cidr >> 3) & 3);
+}
+
+static inline bool CheckFirstNbitsEquals(const byte *a, const byte *b, size_t n) {
+  return memcmp(a, b, n >> 3) == 0 && ((n & 7) == 0 || !((a[n >> 3] ^ b[n >> 3]) & (0xff << (8 - (n & 7)))));
+}
+
+static bool IsWgCidrAddrSubsetOf(const WgCidrAddr &inner, const WgCidrAddr &outer) {
+  return inner.size == outer.size && inner.cidr >= outer.cidr &&
+         CheckFirstNbitsEquals(inner.addr, outer.addr, outer.cidr);
+}
+
+bool WireguardProcessor::Start() {
+  if (!udp_->Initialize(listen_port_))
+    return false;
+
+  if (tun_addr_.size != 32) {
+    RERROR("No IPv4 address configured");
+    return false;
+  }
+
+  if (tun_addr_.cidr >= 31) {
+    RERROR("The TAP driver is not compatible with Address using CIDR /31 or /32. Changing to /24");
+    tun_addr_.cidr = 24;
+  }
+
+  TunInterface::TunConfig config = {0};
+  config.ip = ReadBE32(tun_addr_.addr);
+  config.cidr = tun_addr_.cidr;
+  config.mtu = mtu_;
+  config.pre_post_commands = pre_post_;
+  
+  uint32 netmask = tun_addr_.cidr == 32 ? 0xffffffff : 0xffffffff << (32 - tun_addr_.cidr);
+
+  uint32 ipv4_broadcast_addr = (netmask == 0xffffffff) ? 0xffffffff : config.ip | ~netmask;
+
+  if (tun6_addr_.size == 128) {
+    if (tun6_addr_.cidr > 126) {
+      RERROR("IPv6 /127 or /128 not supported. Changing to 120");
+      tun6_addr_.cidr = 120;
+    }
+    config.ipv6_cidr = tun6_addr_.cidr;
+    memcpy(&config.ipv6_address, tun6_addr_.addr, 16);
+  }
+
+  if (add_routes_mode_) {
+    WgPeer *peer = (WgPeer *)dev_.ip_to_peer_map().LookupV4DefaultPeer();
+    if (peer != NULL && peer->endpoint_.sin.sin_family != 0) {
+      config.default_route_endpoint_v4 = (peer->endpoint_.sin.sin_family == AF_INET) ? ReadBE32(&peer->endpoint_.sin.sin_addr) : 0;
+      // Set the default route to something
+      config.use_ipv4_default_route = true;
+    }
+
+    // Also configure ipv6 gw?
+    if (config.ipv6_cidr != 0) {
+      peer = (WgPeer*)dev_.ip_to_peer_map().LookupV6DefaultPeer();
+      if (peer != NULL && peer->endpoint_.sin.sin_family != 0) {
+        if (peer->endpoint_.sin.sin_family == AF_INET6)
+          memcpy(&config.default_route_endpoint_v6, &peer->endpoint_.sin6.sin6_addr, 16);
+        config.use_ipv6_default_route = true;
+      }
+    }
+
+    // For each peer, add the extra routes to the extra routes table
+    for (WgPeer *peer = dev_.first_peer(); peer; peer = peer->next_peer_) {
+      for (auto it = peer->allowed_ips_.begin(); it != peer->allowed_ips_.end(); ++it) {
+        // Don't add an entry if it's identical to my address or it's a default route
+        if (IsWgCidrAddrSubsetOf(*it, tun_addr_) || IsWgCidrAddrSubsetOf(*it, tun6_addr_) || it->cidr == 0)
+          continue;
+        // Don't add an entry if we have no ipv6 address configured
+        if (config.ipv6_cidr == 0 && it->size != 32)
+          continue;
+        config.extra_routes.push_back(*it);
+      }
+    }
+  }
+
+  uint8 dhcp_options[6];
+
+  config.block_dns_on_adapters = dns_blocking_;
+  config.internet_blocking = internet_blocking_;
+
+  if (dns_addr_.sin.sin_family == AF_INET) {
+    dhcp_options[0] = 6;
+    dhcp_options[1] = 4;
+    memcpy(&dhcp_options[2], &dns_addr_.sin.sin_addr, 4);
+    config.dhcp_options = dhcp_options;
+    config.dhcp_options_size = sizeof(dhcp_options);
+  }
+
+  if (dns6_addr_.sin6.sin6_family == AF_INET6) {
+    config.set_ipv6_dns = true;
+    memcpy(&config.dns_server_v6, &dns6_addr_.sin6.sin6_addr, 16);
+  }
+
+  TunInterface::TunConfigOut config_out;
+  if (!tun_->Initialize(std::move(config), &config_out))
+    return false;
+
+  SetupCompressionHeader(dev_.compression_header());
+
+  network_discovery_spoofing_ = config_out.enable_neighbor_discovery_spoofing;
+  memcpy(network_discovery_mac_, config_out.neighbor_discovery_spoofing_mac, 6);
+  
+  for (WgPeer *peer = dev_.first_peer(); peer; peer = peer->next_peer_) {
+    peer->ipv4_broadcast_addr_ = ipv4_broadcast_addr;
+    if (peer->endpoint_.sin.sin_family != 0) {
+      RINFO("Sending handshake...");
+      SendHandshakeInitiationAndResetRetries(peer);
+    }
+  }
+
+  return true;
+}
+
+static uint8 kIcmpv6NeighborMulticastPrefix[] = {0xff, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,0x00, 0x00, 0x00, 0x01, 0xff};
+
+enum {
+  kIpProto_ICMPv6 = 0x3A,
+  kICMPv6_NeighborSolicitation = 135,
+};
+
+#pragma pack(push, 1)
+struct ICMPv6NaPacket {
+  uint8 type;
+  uint8 code;
+  uint16 checksum;
+  uint8 rso;
+  uint8 reserved[3];
+  uint8 target[16];
+  uint8 opt_type;
+  uint8 opt_length;
+  uint8 target_mac[6];
+};
+
+struct ICMPv6NaPacketWithoutTarget {
+  uint8 type;
+  uint8 code;
+  uint16 checksum;
+  uint8 rso;
+  uint8 reserved[3];
+  uint8 target[16];
+};
+
+#pragma pack (pop)
+
+
+static uint16 ComputeIcmpv6Checksum(const uint8 *buf, int buf_size, const uint8 src_addr[16], const uint8 dst_addr[16]) {
+  uint32 sum = 0;
+  for (int i = 0; i < buf_size - 1; i += 2)
+    sum += ReadBE16(&buf[i]);
+  if (buf_size & 1)
+    sum += buf[buf_size - 1];
+  for (int i = 0; i < 16; i += 2)
+    sum += ReadBE16(&src_addr[i]);
+  for (int i = 0; i < 16; i += 2)
+    sum += ReadBE16(&dst_addr[i]);
+  sum += (uint16)IPPROTO_ICMPV6 + (uint16)buf_size;
+  while (sum >> 16)
+    sum = (sum & 0xFFFF) + (sum >> 16);
+  return ((uint16)~sum);
+}
+
+
+bool WireguardProcessor::HandleIcmpv6NeighborSolicitation(const byte *data, size_t data_size) {
+  if (data_size < 48 + 16)
+    return false;
+
+  // Filter out neighbor solicitation
+  if (data[40] != kICMPv6_NeighborSolicitation || data[41] != 0)
+    return false;
+
+  if (!network_discovery_spoofing_)
+    return false;
+
+  bool is_broadcast = true;
+
+  if (memcmp(data + 24, kIcmpv6NeighborMulticastPrefix, sizeof(kIcmpv6NeighborMulticastPrefix)) != 0) {
+    if (memcmp(data + 24, data + 48, 16) != 0)
+      return false;
+    is_broadcast = false;
+  }
+   
+  // Target address must match a peer's range.
+  WgPeer *peer = (WgPeer*)dev_.ip_to_peer_map().LookupV6(data + 48);
+  if (peer == NULL)
+    return false;
+
+  // Build response packet
+  Packet *out = AllocPacket();
+  if (out == NULL)
+    return false;
+
+  byte *odata = out->data;
+
+  int packet_size = is_broadcast ? sizeof(ICMPv6NaPacket) : sizeof(ICMPv6NaPacketWithoutTarget);
+
+  memcpy(odata, data, 4);
+  WriteBE16(odata + 4, packet_size);
+  odata[6] = 58; // next = icmp
+  odata[7] = 255; // HopLimit
+  memcpy(odata + 8, data + 48, 16); // Source Address
+  memcpy(odata + 24, data + 8, 16); // Dest addr
+
+  ((ICMPv6NaPacket*)(odata + 40))->type = 136; // NA
+  ((ICMPv6NaPacket*)(odata + 40))->code = 0;
+  ((ICMPv6NaPacket*)(odata + 40))->checksum = 0;
+  ((ICMPv6NaPacket*)(odata + 40))->rso = 0x60; // solicited
+  memset(((ICMPv6NaPacket*)(odata + 40))->reserved, 0, 3);
+  memcpy(((ICMPv6NaPacket*)(odata + 40))->target, odata + 8, 16);
+  if (is_broadcast) {
+    ((ICMPv6NaPacket*)(odata + 40))->opt_type = 2;
+    ((ICMPv6NaPacket*)(odata + 40))->opt_length = 1;
+
+    memcpy(((ICMPv6NaPacket*)(odata + 40))->target_mac, network_discovery_mac_, 6);
+
+    // For some reason this is openvpn's 'related mac'
+    ((ICMPv6NaPacket*)(odata + 40))->target_mac[2] += 1;
+  }
+  uint16 checksum = ComputeIcmpv6Checksum(odata + 40, packet_size, odata + 8, odata + 24);
+  WriteBE16(&((ICMPv6NaPacket*)(odata + 40))->checksum, checksum);
+
+  out->size = 40 + packet_size;
+  tun_->WriteTunPacket(out);
+  return true;
+}
+
+static inline bool IsIpv6Multicast(const uint8 dst[16]) {
+  return dst[0] == 0xff;
+}
+
+// On incoming packet to the tun interface.
+void WireguardProcessor::HandleTunPacket(Packet *packet) {
+  uint8 *data = packet->data;
+  size_t data_size = packet->size;
+  unsigned ip_version, size_from_header;
+  WgPeer *peer;
+
+  stats_.tun_bytes_in += data_size;
+  stats_.tun_packets_in++;
+
+  // Sanity check that it looks like a valid ipv4 or ipv6 packet,
+  // and determine the destination peer from the ip header
+  if (data_size < IPV4_HEADER_SIZE)
+    goto getout;
+  
+  ip_version = *data >> 4;
+  if (ip_version == 4) {
+    uint32 ip = ReadBE32(data + 16);
+    peer = (WgPeer*)dev_.ip_to_peer_map().LookupV4(ip);
+    if (peer == NULL)
+      goto getout;
+    if ((ip >= (224 << 24) || ip == peer->ipv4_broadcast_addr_) && !peer->allow_multicast_through_peer_)
+      goto getout;
+
+    size_from_header = ReadBE16(data + 2);
+    if (size_from_header < IPV4_HEADER_SIZE)
+      goto getout;
+  } else if (ip_version == 6) {
+    if (data_size < IPV6_HEADER_SIZE)
+      goto getout;
+
+    // Check if the packet is a Neighbor solicitation ICMP6 packet, in that case fake
+    // a reply.
+    if (data[6] == kIpProto_ICMPv6 && HandleIcmpv6NeighborSolicitation(data, data_size))
+      goto getout;
+
+    peer = (WgPeer*)dev_.ip_to_peer_map().LookupV6(data + 24);
+    if (peer == NULL)
+      goto getout;
+    
+    if (IsIpv6Multicast(data + 24) && !peer->allow_multicast_through_peer_)
+      goto getout;
+
+    size_from_header = IPV6_HEADER_SIZE + ReadBE16(data + 4);
+  } else {
+    goto getout;
+  }
+  if (size_from_header > data_size)
+    goto getout;
+  if (peer->endpoint_.sin.sin_family == 0)
+    goto getout;
+
+  WritePacketToUdp(peer, packet);
+  return;
+  
+getout:
+  // send ICMP?
+  FreePacket(packet);
+}
+
+void WireguardProcessor::WritePacketToUdp(WgPeer *peer, Packet *packet) {
+  byte *data = packet->data;
+  size_t size = packet->size;
+  bool want_handshake;
+  uint64 send_ctr;
+  WgKeypair *keypair = peer->curr_keypair_;
+
+  if (keypair == NULL ||
+      keypair->send_key_state == WgKeypair::KEY_INVALID ||
+      keypair->send_ctr >= REJECT_AFTER_MESSAGES)
+    goto getout_handshake;
+
+  want_handshake = (keypair->send_ctr >= REKEY_AFTER_MESSAGES ||
+                    keypair->send_key_state == WgKeypair::KEY_WANT_REFRESH);
+
+  // Ensure packet will fit including the biggest padding
+  if (size > kPacketCapacity - 15 - CHACHA20POLY1305_AUTHTAGLEN)
+    goto getout_discard;
+
+  if (size == 0) {
+    peer->OnKeepaliveSent();
+  } else {
+    peer->OnDataSent();
+
+#if WITH_HANDSHAKE_EXT
+    // Attempt to compress the packet headers using ipzip.
+    if (keypair->enabled_features[WG_FEATURE_ID_IPZIP]) {
+      uint32 rv = IpzipCompress(data, (uint32)size, &keypair->ipzip_state_, 0);
+      if (rv == (uint32)-1)
+        goto getout_discard;
+      if (rv == 0)
+        goto add_padding;
+      stats_.compression_hdr_saved_out += (int32)(size - rv);
+      data += (int32)(size - rv);
+      size = rv;
+    } else {
+add_padding:
+#else
+    {
+#endif  // WITH_HANDSHAKE_EXT
+      // Pad packet to a multiple of 16 bytes, but no more than the mtu bytes.
+      unsigned padding = std::min<unsigned>((0 - size) & 15, (unsigned)mtu_ - (unsigned)size);
+      memset(data + size, 0, padding);
+      size += padding;
+    }
+  }
+  send_ctr = keypair->send_ctr++;
+
+#if WITH_SHORT_HEADERS
+  if (keypair->enabled_features[WG_FEATURE_ID_SHORT_HEADER]) {
+    size_t header_size;
+    byte *write = data;
+    uint8 tag = WG_SHORT_HEADER_BIT, inner_tag;
+    // For every 16 incoming packets, send out an ack.
+    if (keypair->incoming_packet_count >= 16) {
+      keypair->incoming_packet_count = 0;
+      uint64 next_expected_packet = keypair->replay_detector.expected_seq_nr();
+      if (next_expected_packet < 0x10000) {
+        WriteLE16(write -= 2, (uint16)next_expected_packet);
+        inner_tag = WG_ACK_HEADER_COUNTER_2;
+      } else if (next_expected_packet < 0x100000000ull) {
+        WriteLE32(write -= 4, (uint32)next_expected_packet);
+        inner_tag = WG_ACK_HEADER_COUNTER_4;
+      } else {
+        WriteLE64(write -= 8, next_expected_packet);
+        inner_tag = WG_ACK_HEADER_COUNTER_8;
+      }
+      if (keypair->broadcast_short_key != 0) {
+        inner_tag += keypair->addr_entry_slot;
+        keypair->broadcast_short_key = 2;
+      }
+      *--write = inner_tag;
+      tag += WG_SHORT_HEADER_ACK;
+    } else if (keypair->broadcast_short_key == 1) {
+      keypair->broadcast_short_key = 2;
+      *--write = keypair->addr_entry_slot;
+      tag += WG_SHORT_HEADER_ACK;
+    }
+
+    // Determine the distance from the most recently acked packet,
+    // be conservative when picking a suitable packet length to send.
+    uint64 distance = send_ctr - keypair->send_ctr_acked;
+    if (distance < (1 << 6)) {
+      *(write -= 1) = (uint8)send_ctr;
+      tag += WG_SHORT_HEADER_CTR1;
+    } else if (distance < (1 << 14)) {
+      WriteLE16(write -= 2, (uint16)send_ctr);
+      tag += WG_SHORT_HEADER_CTR2;
+    } else if (distance < (1 << 30)) {
+      WriteLE32(write -= 4, (uint32)send_ctr);
+      tag += WG_SHORT_HEADER_CTR4;
+    } else {
+      // Too far ahead. Can't use short packets.
+      goto need_big_packet;
+    }
+
+    tag += keypair->can_use_short_key_for_outgoing;
+    if (!keypair->can_use_short_key_for_outgoing)
+      WriteLE32(write -= 4, keypair->remote_key_id);
+    *--write = tag;
+
+
+    header_size = data - write;
+
+    stats_.compression_wg_saved_out += (int64)16 - header_size;
+
+    packet->data = data - header_size;
+    packet->size = (int)(size + header_size + keypair->auth_tag_length);
+    WgKeypairEncryptPayload(data, size, write, data - write, send_ctr, keypair);
+  } else {
+need_big_packet:
+#else
+  {
+#endif  // #if WITH_SHORT_HEADERS
+    ((MessageData*)data)[-1].type = ToLE32(MESSAGE_DATA);
+    ((MessageData*)data)[-1].receiver_id = keypair->remote_key_id;
+    ((MessageData*)data)[-1].counter = ToLE64(send_ctr);
+    packet->data = data - sizeof(MessageData);
+    packet->size = (int)(size + sizeof(MessageData) + keypair->auth_tag_length);
+    WgKeypairEncryptPayload(data, size, NULL, 0, send_ctr, keypair);
+  }
+
+  packet->addr = peer->endpoint_;
+  DoWriteUdpPacket(packet);
+  if (want_handshake)
+    SendHandshakeInitiationAndResetRetries(peer);
+  return;
+
+getout_discard:
+  FreePacket(packet);
+  return;
+
+getout_handshake:
+  // Keep only the first MAX_QUEUED_PACKETS packets.
+  while (peer->num_queued_packets_ >= MAX_QUEUED_PACKETS_PER_PEER) {
+    Packet *packet = peer->first_queued_packet_;
+    peer->first_queued_packet_ = packet->next;
+    peer->num_queued_packets_--;
+    FreePacket(packet);
+  }
+  // Add the packet to the out queue that will get sent once handshake completes
+  *peer->last_queued_packet_ptr_ = packet;
+  peer->last_queued_packet_ptr_ = &packet->next;
+  packet->next = NULL;
+  peer->num_queued_packets_++;
+
+  SendHandshakeInitiationAndResetRetries(peer);
+}
+
+// This scrambles the initial 16 bytes of the packet with the
+// trailing 8 bytes of the packet.
+static void ScrambleUnscramblePacket(Packet *packet, ScramblerSiphashKeys *keys) {
+  uint8 *data = packet->data;
+  size_t data_size = packet->size;
+
+  if (data_size < 8)
+    return;
+
+  uint64 last_uint64 = ReadLE64(data_size >= 24 ? data + 16 : data + data_size - 8);
+  uint64 a = siphash_u64_u32(last_uint64, (uint32)data_size, (siphash_key_t*)&keys->keys[0]);
+  uint64 b = siphash_u64_u32(last_uint64, (uint32)data_size, (siphash_key_t*)&keys->keys[2]);
+  a = ToLE64(a);
+  b = ToLE64(b);
+  if (data_size >= 24) {
+    ((uint64*)data)[0] ^= a;
+    ((uint64*)data)[1] ^= b;
+  } else {
+    struct { uint64 a, b; } scramblers = {a, b};
+    uint8 *s = (uint8*)&scramblers;
+    for (size_t i = 0; i < data_size - 8; i++)
+      data[i] ^= s[i];
+  }
+}
+
+static NOINLINE void ScrambleUnscrambleAndWrite(Packet *packet, ScramblerSiphashKeys *keys, UdpInterface *udp) {
+#if WITH_HEADER_OBFUSCATION
+  ScrambleUnscramblePacket(packet, keys);
+  udp->WriteUdpPacket(packet);
+#endif // WITH_HEADER_OBFUSCATION
+}
+
+void WireguardProcessor::DoWriteUdpPacket(Packet *packet) {
+  stats_.udp_packets_out++;
+  stats_.udp_bytes_out += packet->size;
+  if (!dev_.header_obfuscation_)
+    udp_->WriteUdpPacket(packet);
+  else
+    ScrambleUnscrambleAndWrite(packet, &dev_.header_obfuscation_key_, udp_); 
+}
+
+void WireguardProcessor::SendHandshakeInitiationAndResetRetries(WgPeer *peer) {
+  peer->handshake_attempts_ = 0;
+  SendHandshakeInitiation(peer);
+}
+
+void WireguardProcessor::SendHandshakeInitiation(WgPeer *peer) {
+  // Send out a handshake init packet to trigger the handshake procedure
+  if (!peer->CheckHandshakeRateLimit())
+    return;
+  Packet *packet = AllocPacket();
+  if (!packet)
+    return;
+  peer->CreateMessageHandshakeInitiation(packet);
+
+  packet->addr = peer->endpoint_;
+  DoWriteUdpPacket(packet);
+  peer->OnHandshakeInitSent();
+}
+
+// Handles an incoming WireGuard packet from the UDP side, decrypt etc.
+void WireguardProcessor::HandleUdpPacket(Packet *packet, bool overload) {
+  uint32 type;
+
+  stats_.udp_bytes_in += packet->size;
+  stats_.udp_packets_in++;
+
+  // Unscramble incoming packets
+#if WITH_HEADER_OBFUSCATION
+  if (dev_.header_obfuscation_)
+    ScrambleUnscramblePacket(packet, &dev_.header_obfuscation_key_);
+#endif  // WITH_HEADER_OBFUSCATION
+
+  if (packet->size < sizeof(uint32))
+    goto invalid_size;
+  type = ReadLE32((uint32*)packet->data);
+  if (type == MESSAGE_DATA) {
+    if (packet->size < sizeof(MessageData))
+      goto invalid_size;
+    HandleDataPacket(packet);
+#if WITH_SHORT_HEADERS
+  } else if (type & WG_SHORT_HEADER_BIT) {
+    HandleShortHeaderFormatPacket(type, packet);
+#endif  // WITH_SHORT_HEADERS
+  } else if (type == MESSAGE_HANDSHAKE_COOKIE) {
+    if (packet->size != sizeof(MessageHandshakeCookie))
+      goto invalid_size;
+    HandleHandshakeCookiePacket(packet);
+  } else if (type == MESSAGE_HANDSHAKE_INITIATION) {
+    if (WITH_HANDSHAKE_EXT ? (packet->size < sizeof(MessageHandshakeInitiation)) : (packet->size != sizeof(MessageHandshakeInitiation)))
+      goto invalid_size;
+
+    if (!CheckIncomingHandshakeRateLimit(packet, overload))
+      return;
+    HandleHandshakeInitiationPacket(packet);
+  } else if (type == MESSAGE_HANDSHAKE_RESPONSE) {
+    if (WITH_HANDSHAKE_EXT ? (packet->size < sizeof(MessageHandshakeResponse)) : (packet->size != sizeof(MessageHandshakeResponse)))
+      goto invalid_size;
+    if (!CheckIncomingHandshakeRateLimit(packet, overload))
+      return;
+    HandleHandshakeResponsePacket(packet);
+  } else {
+    // unknown packet
+invalid_size:
+    FreePacket(packet);
+  }
+}
+
+// Returns nonzero if two endpoints are different.
+static uint32 CompareEndpoint(const IpAddr *a, const IpAddr *b) {
+  uint32 rv = b->sin.sin_family ^ a->sin.sin_family;
+  if (b->sin.sin_family != AF_INET6) {
+    rv |= b->sin.sin_addr.s_addr ^ a->sin.sin_addr.s_addr;
+    rv |= b->sin.sin_port ^ a->sin.sin_port;
+  } else {
+    uint64 rx = ((uint64*)&b->sin6.sin6_addr)[0] ^ ((uint64*)&a->sin6.sin6_addr)[0];
+    rx |= ((uint64*)&b->sin6.sin6_addr)[1] ^ ((uint64*)&a->sin6.sin6_addr)[1];
+    rv |= rx | (rx >> 32);
+    rv |= b->sin6.sin6_port ^ a->sin6.sin6_port;
+  }
+  return rv;
+}
+
+void WgPeer::CopyEndpointToPeer(WgKeypair *keypair, const IpAddr *addr) {
+  // Remember how to send packets to this peer
+  if (CompareEndpoint(&keypair->peer->endpoint_, addr)) {
+#if WITH_SHORT_HEADERS
+    // When the endpoint changes, forget about using the short key.
+    keypair->broadcast_short_key = 0;
+    keypair->can_use_short_key_for_outgoing = false;
+#endif  // WITH_SHORT_HEADERS
+    keypair->peer->endpoint_ = *addr;
+  }
+}
+
+#if WITH_SHORT_HEADERS
+void WireguardProcessor::HandleShortHeaderFormatPacket(uint32 tag, Packet *packet) {
+  uint8 *data = packet->data + 1;
+  size_t bytes_left = packet->size - 1;
+  WgKeypair *keypair;
+  uint64 counter, acked_counter;
+  uint8 ack_tag;
+
+  if ((tag & WG_SHORT_HEADER_KEY_ID_MASK) == 0x00) {
+    // The key_id is explicitly included in the packet.
+    if (bytes_left < 4) goto getout;
+    uint32 key_id = ReadLE32(data);
+    data += 4, bytes_left -= 4;
+    auto it = dev_.key_id_lookup().find(key_id);
+    if (it == dev_.key_id_lookup().end()) goto getout;
+    keypair = it->second.second;
+  } else {
+    // Lookup the packet source ip and port in the address mapping
+    uint64 addr_id = packet->addr.sin.sin_addr.s_addr | ((uint64)packet->addr.sin.sin_port << 32);
+    auto it = dev_.addr_entry_map().find(addr_id);
+    if (it == dev_.addr_entry_map().end())
+      goto getout;
+    WgAddrEntry *addr_entry = it->second;
+    keypair = addr_entry->keys[((tag / WG_SHORT_HEADER_KEY_ID) & 3) - 1];
+  }
+
+  if (!keypair || keypair->recv_key_state == WgKeypair::KEY_INVALID ||
+      !keypair->enabled_features[WG_FEATURE_ID_SHORT_HEADER])
+    goto getout;
+
+  // Pick the closest possible counter value with the same low bits.
+  counter = keypair->replay_detector.expected_seq_nr();
+  switch (tag & WG_SHORT_HEADER_TYPE_MASK) {
+  case WG_SHORT_HEADER_CTR1:
+    if (bytes_left < 1) goto getout;
+    counter += (int8)(*data - counter);
+    data += 1, bytes_left -= 1;
+    break;
+  case WG_SHORT_HEADER_CTR2:
+    if (bytes_left < 2) goto getout;
+    counter += (int16)(ReadLE16(data) - counter);
+    data += 2, bytes_left -= 2;
+    break;
+  case WG_SHORT_HEADER_CTR4:
+    if (bytes_left < 4) goto getout;
+    counter += (int32)(ReadLE32(data) - counter);
+    data += 4, bytes_left -= 4;
+    break;
+  default:
+    goto getout; // invalid packet
+  }
+
+  acked_counter = 0;
+  ack_tag = 0;
+
+  // If the acknowledge header is present, then parse it so we may
+  // get an ack for the highest seen packet.
+  if (tag & WG_SHORT_HEADER_ACK) {
+    if (bytes_left == 0) goto getout;
+    ack_tag = *data;
+    data += 1, bytes_left -= 1;
+
+    switch (ack_tag & WG_ACK_HEADER_COUNTER_MASK) {
+    case WG_ACK_HEADER_COUNTER_2:
+      if (bytes_left < 2) goto getout;
+      acked_counter = ReadLE16(data);
+      data += 2, bytes_left -= 2;
+      break;
+    case WG_ACK_HEADER_COUNTER_4:
+      if (bytes_left < 4) goto getout;
+      acked_counter = ReadLE32(data);
+      data += 4, bytes_left -= 4;
+      break;
+    case WG_ACK_HEADER_COUNTER_8:
+      if (bytes_left < 8) goto getout;
+      acked_counter = ReadLE64(data);
+      data += 8, bytes_left -= 8;
+      break;
+    default:
+      break;
+    }
+  }
+  if (counter >= REJECT_AFTER_MESSAGES)
+    goto getout;
+  // Authenticate the packet before we can apply the state changes.
+  if (!WgKeypairDecryptPayload(data, bytes_left, packet->data, data - packet->data, counter, keypair))
+    goto getout;
+
+  if (!keypair->replay_detector.CheckReplay(counter))
+    goto getout;
+
+  stats_.compression_wg_saved_in += 16 - (data - packet->data);
+
+  keypair->send_ctr_acked = std::max<uint64>(keypair->send_ctr_acked, acked_counter);
+  keypair->incoming_packet_count++;
+
+  WgPeer::CopyEndpointToPeer(keypair, &packet->addr);
+
+  // Periodically broadcast out the short key 
+  if ((tag & WG_SHORT_HEADER_KEY_ID_MASK) == 0x00 && !keypair->did_attempt_remember_ip_port) {
+    keypair->did_attempt_remember_ip_port = true;
+    if (keypair->enabled_features[WG_FEATURE_ID_SKIP_KEYID_IN]) {
+      uint64 addr_id = packet->addr.sin.sin_addr.s_addr | ((uint64)packet->addr.sin.sin_port << 32);
+      dev_.UpdateKeypairAddrEntry(addr_id, keypair);
+    }
+  }
+
+  // Ack header may also signal that we can omit the key id in packets from now on.
+  if (tag & WG_SHORT_HEADER_ACK)
+    keypair->can_use_short_key_for_outgoing = (ack_tag & WG_ACK_HEADER_KEY_MASK) * WG_SHORT_HEADER_KEY_ID;
+
+  HandleAuthenticatedDataPacket(keypair, packet, data, bytes_left - keypair->auth_tag_length);
+  return;
+getout:
+  FreePacket(packet);
+  return;
+}
+#endif  // WITH_SHORT_HEADERS
+
+void WireguardProcessor::HandleAuthenticatedDataPacket(WgKeypair *keypair, Packet *packet, uint8 *data, size_t data_size) {
+  WgPeer *peer = keypair->peer;
+
+  // Promote the next key to the current key when we receive a data packet,
+  // the handshake is now complete.
+  if (peer->CheckSwitchToNextKey(keypair)) {
+    if (procdel_) {
+      procdel_->OnConnected(ReadBE32(tun_addr_.addr));
+    }
+    peer->OnHandshakeFullyComplete();
+    SendQueuedPackets(peer);
+  }
+
+  // Refresh when current key gets too old
+  if (peer->curr_keypair_ && peer->curr_keypair_->recv_key_state == WgKeypair::KEY_WANT_REFRESH) {
+    peer->curr_keypair_->recv_key_state = WgKeypair::KEY_DID_REFRESH;
+    SendHandshakeInitiationAndResetRetries(peer);
+  }
+
+  if (data_size == 0) {
+    peer->OnKeepaliveReceived();
+    goto getout;
+  }
+  peer->OnDataReceived();
+
+#if WITH_HANDSHAKE_EXT
+  // Unpack the packet headers using ipzip
+  if (keypair->enabled_features[WG_FEATURE_ID_IPZIP]) {
+    uint32 rv = IpzipDecompress(data, (uint32)data_size, &keypair->ipzip_state_, IPZIP_RECV_BY_CLIENT);
+    if (rv == (uint32)-1)
+      goto getout; // ipzip failed decompress
+    stats_.compression_hdr_saved_in += (int64)rv - data_size;
+    data -= (int64)rv - data_size, data_size = rv;
+  }
+#endif  // WITH_HANDSHAKE_EXT
+
+  // Verify that the packet is a valid ipv4 or ipv6 packet of proper length,
+  // with a source address that belongs to the peer.
+  WgPeer *peer_from_header;
+  unsigned int ip_version, size_from_header;
+
+  ip_version = *data >> 4;
+  if (ip_version == 4) {
+    if (data_size < IPV4_HEADER_SIZE) {
+      // too small ipv4 header
+      goto getout;
+    }
+    peer_from_header = (WgPeer*)dev_.ip_to_peer_map().LookupV4(ReadBE32(data + 12));
+    size_from_header = ReadBE16(data + 2);
+    if (size_from_header < IPV4_HEADER_SIZE) {
+      // too small packet?
+      goto getout;
+    }
+  } else if (ip_version == 6) {
+    if (data_size < IPV6_HEADER_SIZE) {
+      // too small ipv6 header
+      goto getout;
+    }
+    peer_from_header = (WgPeer*)dev_.ip_to_peer_map().LookupV6(data + 8);
+    size_from_header = IPV6_HEADER_SIZE + ReadBE16(data + 4);
+  } else {
+    // invalid ip version
+    goto getout;
+  }
+  if (size_from_header > data_size) {
+    // oversized packet?
+    goto getout;
+  }
+  if (peer_from_header != peer) {
+    // source address mismatch?
+    goto getout;
+  }
+  //RINFO("Outgoing TUN packet of size %d", (int)size_from_header);
+  packet->data = data;
+  packet->size = size_from_header;
+
+  stats_.tun_bytes_out += packet->size;
+  stats_.tun_packets_out++;
+
+  tun_->WriteTunPacket(packet);
+  return;
+
+getout:
+  FreePacket(packet);
+  return;
+}
+
+void WireguardProcessor::HandleDataPacket(Packet *packet) {
+  uint8 *data = packet->data;
+  size_t data_size = packet->size;
+  uint32 key_id = ((MessageData*)data)->receiver_id;
+  uint64 counter = ToLE64((((MessageData*)data)->counter));
+  WgKeypair *keypair;
+
+  auto it = dev_.key_id_lookup().find(key_id);
+  if (it == dev_.key_id_lookup().end() ||
+      (keypair = it->second.second) == NULL ||
+      keypair->recv_key_state == WgKeypair::KEY_INVALID) {
+getout:
+    FreePacket(packet);
+    return;
+  }
+
+  if (counter >= REJECT_AFTER_MESSAGES)
+    goto getout;
+
+  if (!WgKeypairDecryptPayload(data + sizeof(MessageData), data_size - sizeof(MessageData),
+                        NULL, 0, counter, keypair)) {
+    goto getout;
+  }
+  if (!keypair->replay_detector.CheckReplay(counter))
+    goto getout;
+
+  WgPeer::CopyEndpointToPeer(keypair, &packet->addr);
+  HandleAuthenticatedDataPacket(keypair, packet, data + sizeof(MessageData), data_size - sizeof(MessageData) - keypair->auth_tag_length);
+}
+
+static uint64 GetIpForRateLimit(Packet *packet) {
+  if (packet->addr.sin.sin_family == AF_INET) {
+    return ReadLE32(&packet->addr.sin.sin_addr);
+  } else {
+    return ReadLE64(&packet->addr.sin6.sin6_addr);
+  }
+}
+
+bool WireguardProcessor::CheckIncomingHandshakeRateLimit(Packet *packet, bool overload) {
+  WgRateLimit::RateLimitResult rr = dev_.rate_limiter()->CheckRateLimit(GetIpForRateLimit(packet));
+  if ((overload && rr.is_rate_limited()) || !dev_.CheckCookieMac1(packet)) {
+    FreePacket(packet);
+    return false;
+  }
+  if (overload && !rr.is_first_ip() && !dev_.CheckCookieMac2(packet)) {
+    dev_.rate_limiter()->CommitResult(rr);
+    dev_.CreateCookieMessage((MessageHandshakeCookie*)packet->data, packet, ((MessageHandshakeInitiation*)packet->data)->sender_key_id);
+    packet->size = sizeof(MessageHandshakeCookie);
+    DoWriteUdpPacket(packet);
+    return false;
+  }
+  dev_.rate_limiter()->CommitResult(rr);
+  return true;
+}
+
+// server receives this when client wants to setup a session
+void WireguardProcessor::HandleHandshakeInitiationPacket(Packet *packet) {
+  WgPeer *peer = WgPeer::ParseMessageHandshakeInitiation(&dev_, packet);
+  if (!peer) {
+    FreePacket(packet);
+    return;
+  }
+  peer->OnHandshakeAuthComplete();
+  DoWriteUdpPacket(packet);
+}
+
+// client receives this after session is established
+void WireguardProcessor::HandleHandshakeResponsePacket(Packet *packet) {
+  WgPeer *peer = WgPeer::ParseMessageHandshakeResponse(&dev_, packet);
+  if (!peer) {
+    FreePacket(packet);
+    return;
+  }
+  peer->endpoint_ = packet->addr;
+  FreePacket(packet);
+  peer->OnHandshakeAuthComplete();
+  peer->OnHandshakeFullyComplete();
+  if (procdel_)
+    procdel_->OnConnected(ReadBE32(tun_addr_.addr));
+  SendKeepalive(peer);
+}
+
+void WireguardProcessor::SendKeepalive(WgPeer *peer) {
+  // can't send keepalive if no endpoint is configured
+  if (peer->endpoint_.sin.sin_family == 0)
+    return;
+
+  // If nothing is queued, insert a keepalive packet
+  if (peer->first_queued_packet_ == NULL) {
+    Packet *packet = AllocPacket();
+    if (!packet)
+      return;
+    packet->size = 0;
+    packet->next = NULL;
+    peer->first_queued_packet_ = packet;
+  }
+  SendQueuedPackets(peer);
+}
+
+void WireguardProcessor::SendQueuedPackets(WgPeer *peer) {
+  // Steal the packets
+  Packet *packet = peer->first_queued_packet_;
+  peer->first_queued_packet_ = NULL;
+  peer->last_queued_packet_ptr_ = &peer->first_queued_packet_;
+  peer->num_queued_packets_ = 0;
+  while (packet) {
+    Packet *next = packet->next;
+    WritePacketToUdp(peer, packet);
+    packet = next;
+  }
+}
+
+void WireguardProcessor::HandleHandshakeCookiePacket(Packet *packet) {
+  WgPeer::ParseMessageHandshakeCookie(&dev_, (MessageHandshakeCookie *)packet->data);
+}
+
+void WireguardProcessor::SecondLoop() {
+  uint64 now = OsGetMilliseconds();
+  for (WgPeer *peer = dev_.first_peer(); peer; peer = peer->next_peer_) {
+
+    // Allow ip/port to be remembered again for this keypair
+    if (peer->curr_keypair_)
+      peer->curr_keypair_->did_attempt_remember_ip_port = false;
+
+    uint32 mask = peer->CheckTimeouts(now);
+    if (mask == 0)
+      continue;
+    if (mask & WgPeer::ACTION_SEND_KEEPALIVE)
+      SendKeepalive(peer);
+    if (mask & WgPeer::ACTION_SEND_HANDSHAKE)
+      SendHandshakeInitiation(peer);
+  }
+
+  dev_.SecondLoop(now);
+}
+
diff --git a/wireguard.h b/wireguard.h
new file mode 100644
index 0000000..ef050c5
--- /dev/null
+++ b/wireguard.h
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#pragma once
+
+#include "tunsafe_types.h"
+#include "wireguard_proto.h"
+
+struct ProcessorStats {
+  // Number of bytes sent/received over the physical UDP connections
+  int64 udp_bytes_in, udp_bytes_out;
+  int64 udp_packets_in, udp_packets_out;
+  // Number of bytes sent/received over the TUN interface
+  int64 tun_bytes_in, tun_bytes_out;
+  int64 tun_packets_in, tun_packets_out;
+  uint64 last_complete_handskake_timestamp;
+
+  int64 compression_hdr_saved_in, compression_hdr_saved_out;
+
+  int64 compression_wg_saved_in, compression_wg_saved_out;
+};
+
+class ProcessorDelegate {
+public:
+  virtual void OnConnected(in_addr_t my_ip) = 0;
+  virtual void OnDisconnected() = 0;
+};
+
+enum InternetBlockState {
+  kBlockInternet_Off,
+  kBlockInternet_Route,
+  kBlockInternet_Firewall,
+  kBlockInternet_Both,
+
+  // An unspecified value that uses either route or firewall
+  kBlockInternet_DefaultOn = 254,
+
+  kBlockInternet_Default = 255,
+};
+
+class WireguardProcessor {
+public:
+  WireguardProcessor(UdpInterface *udp, TunInterface *tun, ProcessorDelegate *procdel);
+  ~WireguardProcessor();
+
+  void SetListenPort(int listen_port) {
+    listen_port_ = listen_port;
+  }
+
+  bool SetTunAddress(const WgCidrAddr &addr);
+
+  bool AddDnsServer(const IpAddr &sin);
+
+  void SetMtu(int mtu) {
+    if (mtu >= 576 && mtu <= 10000)
+      mtu_ = mtu;
+  }
+
+  void SetAddRoutesMode(bool mode) {
+    add_routes_mode_ = mode;
+  }
+
+  void SetDnsBlocking(bool dns_blocking) {
+    dns_blocking_ = dns_blocking;
+  }
+
+  void SetInternetBlocking(InternetBlockState internet_blocking) {
+    internet_blocking_ = internet_blocking;
+  }
+
+  void SetHeaderObfuscation(const char *key) {
+    dev_.SetHeaderObfuscation(key);
+  }
+  
+  void HandleTunPacket(Packet *packet);
+  void HandleUdpPacket(Packet *packet, bool overload);
+  void SecondLoop();
+
+  ProcessorStats GetStats();
+  void ResetStats();
+
+  bool Start();
+
+  WgDevice &dev() { return dev_; }
+
+  TunInterface::PrePostCommands &prepost() { return pre_post_; }
+
+private:
+  void DoWriteUdpPacket(Packet *packet);
+  void WritePacketToUdp(WgPeer *peer, Packet *packet);
+  void SendHandshakeInitiation(WgPeer *peer);
+  void SendHandshakeInitiationAndResetRetries(WgPeer *peer);
+  void SendKeepalive(WgPeer *peer);
+  void SendQueuedPackets(WgPeer *peer);
+
+  void HandleHandshakeInitiationPacket(Packet *packet);
+  void HandleHandshakeResponsePacket(Packet *packet);
+  void HandleHandshakeCookiePacket(Packet *packet);
+  void HandleDataPacket(Packet *packet);
+  
+  void HandleAuthenticatedDataPacket(WgKeypair *keypair, Packet *packet, uint8 *data, size_t data_size);
+
+  void HandleShortHeaderFormatPacket(uint32 tag, Packet *packet);
+
+  bool CheckIncomingHandshakeRateLimit(Packet *packet, bool overload);
+
+  bool HandleIcmpv6NeighborSolicitation(const byte *data, size_t data_size);
+
+  void SetupCompressionHeader(WgPacketCompressionVer01 *c);
+
+  int listen_port_;
+
+  ProcessorDelegate *procdel_;
+  TunInterface *tun_;
+  UdpInterface *udp_;
+  int mtu_;
+  ProcessorStats stats_;
+
+  bool dns_blocking_;
+  uint8 internet_blocking_;
+  bool add_routes_mode_;
+  bool network_discovery_spoofing_;
+  uint8 network_discovery_mac_[6];
+
+  WgDevice dev_;
+
+  WgCidrAddr tun_addr_;
+  WgCidrAddr tun6_addr_;
+
+  IpAddr dns_addr_, dns6_addr_;
+
+  TunInterface::PrePostCommands pre_post_;
+};
+
diff --git a/wireguard_config.cpp b/wireguard_config.cpp
new file mode 100644
index 0000000..3d51f62
--- /dev/null
+++ b/wireguard_config.cpp
@@ -0,0 +1,444 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "wireguard_config.h"
+#include "netapi.h"
+#include "tunsafe_endian.h"
+#include "wireguard.h"
+#include "util.h"
+#include <stdlib.h>
+#include <stdarg.h>
+#include <string.h>
+#include <assert.h>
+#include <vector>
+
+#if defined(OS_POSIX)
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <netdb.h>
+#endif
+
+const char *print_ip_prefix(char buf[kSizeOfAddress], int family, const void *ip, int prefixlen) {
+  if (!inet_ntop(family, ip, buf, kSizeOfAddress - 8)) {
+    memcpy(buf, "unknown", 8);
+  }
+  if (prefixlen >= 0)
+    snprintf(buf + strlen(buf), 8, "/%d", prefixlen);
+  return buf;
+}
+
+struct Addr {
+  byte addr[4];
+  uint8 cidr;
+};
+
+static bool ParseCidrAddr(char *s, WgCidrAddr *out) {
+  char *slash = strchr(s, '/');
+  if (!slash)
+    return false;
+
+  *slash = 0;
+  int e = atoi(slash + 1);
+  if (e < 0) return false;
+
+  if (inet_pton(AF_INET, s, out->addr) == 1) {
+    if (e > 32) return false;
+    out->cidr = e;
+    out->size = 32;
+    return true;
+  }
+  if (inet_pton(AF_INET6, s, out->addr) == 1) {
+    if (e > 128) return false;
+    out->cidr = e;
+    out->size = 128;
+    return true;
+  }
+  return false;
+}
+
+struct hostent *gethostbyname_retry_on_failure(const char * name, bool *exit_flag) {
+  int attempt = 0;
+  static const uint8 retry_delays[] = {1, 2, 3, 5, 10, 20, 40, 60};
+
+  for (;;) {
+    hostent *he = gethostbyname(name);
+    if (he || exit_flag == NULL || *exit_flag)
+      return he;
+
+    RINFO("Unable to resolve %s. Trying again in %d second(s)", name, retry_delays[attempt]);
+    OsInterruptibleSleep(retry_delays[attempt] * 1000);
+    if (*exit_flag)
+      return NULL;
+
+    if (attempt != ARRAY_SIZE(retry_delays) - 1)
+      attempt++;
+  }
+}
+
+
+static bool ParseSockaddrInWithPort(char *s, IpAddr *sin, bool *exit_flag) {
+  memset(sin, 0, sizeof(IpAddr));
+  if (*s == '[') {
+    char *end = strchr(s, ']');
+    if (end == NULL)
+      return false;
+    *end = 0;
+    if (inet_pton(AF_INET6, s + 1, &sin->sin6.sin6_addr) != 1)
+      return false;
+    char *x = strchr(end + 1, ':');
+    if (!x)
+      return false;
+    sin->sin.sin_family = AF_INET6;
+    sin->sin.sin_port = htons(atoi(x + 1));
+    return true;
+  }
+  char *x = strchr(s, ':');
+  if (!x) return false;
+  *x = 0;
+  hostent *he = gethostbyname_retry_on_failure(s, exit_flag);
+  if (!he) {
+    RERROR("Unable to resolve %s", s);
+    return false;
+  }
+  sin->sin.sin_family = AF_INET;
+  sin->sin.sin_port = htons(atoi(x + 1));
+  memcpy(&sin->sin.sin_addr, he->h_addr_list[0], 4);
+  return true;
+}
+
+static bool ParseSockaddrInWithoutPort(char *s, IpAddr *sin, bool *exit_flag) {
+  memset(sin, 0, sizeof(IpAddr));
+  if (inet_pton(AF_INET6, s, &sin->sin6.sin6_addr) == 1) {
+    sin->sin.sin_family = AF_INET6;
+    return true;
+  }
+  hostent *he = gethostbyname_retry_on_failure(s, exit_flag);
+  if (!he) {
+    RERROR("Unable to resolve %s", s);
+    return false;
+  }
+  sin->sin.sin_family = AF_INET;
+  memcpy(&sin->sin.sin_addr, he->h_addr_list[0], 4);
+  return true;
+}
+
+static bool ParseBase64Key(const char *s, uint8 key[32]) {
+  size_t size = 32;
+  return base64_decode((uint8*)s, strlen(s), key, &size) && size == 32;
+}
+
+class WgFileParser {
+public:
+  WgFileParser(WireguardProcessor *wg, bool *exit_flag) : wg_(wg), exit_flag_(exit_flag) {}
+  bool ParseFlag(const char *group, const char *key, char *value);
+  WireguardProcessor *wg_;
+
+  void FinishGroup();
+  struct Peer {
+    uint8 pub[32];
+    uint8 psk[32];
+  };
+  Peer pi_;
+  WgPeer *peer_ = NULL;
+  bool *exit_flag_;
+  bool had_interface_ = false;
+};
+
+bool is_space(uint8_t c) {
+  return c == ' ' || c == '\r' || c == '\n' || c == '\t';
+}
+
+
+void SplitString(char *s, int separator, std::vector<char*> *components) {
+  for (;;) {
+    while (is_space(*s)) s++;
+    char *d = strchr(s, separator);
+    if (d == NULL) {
+      if (*s)
+        components->push_back(s);
+      return;
+    }
+    *d = 0;
+    char *e = d;
+    while (e > s && is_space(e[-1]))
+      *--e = 0;
+    components->push_back(s);
+    s = d + 1;
+  }
+}
+
+static bool ParseBoolean(const char *str, bool *value) {
+  if (_stricmp(str, "true") == 0 ||
+      _stricmp(str, "yes") == 0 ||
+      _stricmp(str, "1") == 0 ||
+      _stricmp(str, "on") == 0) {
+    *value = true;
+    return true;
+  }
+  if (_stricmp(str, "false") == 0 ||
+      _stricmp(str, "no") == 0 ||
+      _stricmp(str, "0") == 0 ||
+      _stricmp(str, "off") == 0) {
+    *value = false;
+    return true;
+  }
+  return false;
+}
+
+static int ParseFeature(const char *str) {
+  size_t len = strlen(str);
+  int what = WG_BOOLEAN_FEATURE_WANTS;
+  if (len > 0) {
+    if (str[len - 1] == '?')
+      what = WG_BOOLEAN_FEATURE_SUPPORTS, len--;
+    else if (str[len - 1] == '!')
+      what = WG_BOOLEAN_FEATURE_ENFORCES, len--;
+  }
+  if (len == 5 && memcmp(str, "mac64", 5) == 0)
+    return what + WG_FEATURE_ID_SHORT_MAC * 16;
+  if (len == 12 && memcmp(str, "short_header", 12) == 0)
+    return what + WG_FEATURE_ID_SHORT_HEADER * 16;
+  if (len == 5 && memcmp(str, "ipzip", 5) == 0)
+    return what + WG_FEATURE_ID_IPZIP * 16;
+  if (len == 10 && memcmp(str, "skip_keyid", 10) == 0)
+    return what + WG_FEATURE_ID_SKIP_KEYID_IN * 16 + 1 * 4;
+  if (len == 13 && memcmp(str, "skip_keyid_in", 13) == 0)
+    return what + WG_FEATURE_ID_SKIP_KEYID_IN * 16;
+  if (len == 14 && memcmp(str, "skip_keyid_out", 14) == 0)
+    return what + WG_FEATURE_ID_SKIP_KEYID_OUT * 16;
+  return -1;
+}
+
+static int ParseCipherSuite(const char *cipher) {
+  if (!strcmp(cipher, "chacha20-poly1305"))
+    return EXT_CIPHER_SUITE_CHACHA20POLY1305;
+  if (!strcmp(cipher, "aes128-gcm"))
+    return EXT_CIPHER_SUITE_AES128_GCM;
+  if (!strcmp(cipher, "aes256-gcm"))
+    return EXT_CIPHER_SUITE_AES256_GCM;
+  if (!strcmp(cipher, "none"))
+    return EXT_CIPHER_SUITE_NONE_POLY1305;
+  return -1;
+}
+
+void WgFileParser::FinishGroup() {
+  if (peer_) {
+    peer_->Initialize(pi_.pub, pi_.psk);
+    peer_ = NULL;
+  }
+}
+
+bool WgFileParser::ParseFlag(const char *group, const char *key, char *value) {
+  uint8 binkey[32];
+  WgCidrAddr addr;
+  IpAddr sin;
+  std::vector<char*> ss;
+  bool ciphermode = false;
+
+  if (strcmp(group, "[Interface]") == 0) {
+    if (key == NULL) return true;
+    if (strcmp(key, "PrivateKey") == 0) {
+      if (!ParseBase64Key(value, binkey))
+        return false;
+      had_interface_ = true;
+      wg_->dev().Initialize(binkey);
+    } else if (strcmp(key, "ListenPort") == 0) {
+      wg_->SetListenPort(atoi(value));
+    } else if (strcmp(key, "Address") == 0) {
+      SplitString(value, ',', &ss);
+      for (size_t i = 0; i < ss.size(); i++) {
+        if (!ParseCidrAddr(ss[i], &addr))
+          return false;
+        if (!wg_->SetTunAddress(addr)) {
+          RERROR("Multiple Address not allowed");
+          return false;
+        }
+      }
+    } else if (strcmp(key, "MTU") == 0) {
+      wg_->SetMtu(atoi(value));
+    } else if (strcmp(key, "Table") == 0) {
+      bool mode;
+      if (!strcmp(value, "off")) {
+        mode = false;
+      } else if (!strcmp(value, "auto")) {
+        mode = true;
+      } else {
+        goto err;
+      }
+      wg_->SetAddRoutesMode(mode);
+    } else if (strcmp(key, "DNS") == 0) {
+      SplitString(value, ',', &ss);
+      for (size_t i = 0; i < ss.size(); i++) {
+        if (!ParseSockaddrInWithoutPort(ss[i], &sin, exit_flag_))
+          return false;
+        if (!wg_->AddDnsServer(sin)) {
+          RERROR("Multiple DNS not allowed.");
+          return false;
+        }
+      }
+    } else if (strcmp(key, "BlockDNS") == 0) {
+      bool v;
+      if (!ParseBoolean(value, &v))
+        goto err;
+      wg_->SetDnsBlocking(v);
+    } else if (strcmp(key, "BlockInternet") == 0) {
+      uint8 v = kBlockInternet_Default;
+      
+      SplitString(value, ',', &ss);
+      for (size_t i = 0; i < ss.size(); i++) {
+        if (strcmp(ss[i], "route") == 0) {
+          if (v & 128) v = 0;
+          v |= kBlockInternet_Route;
+        } else if (strcmp(ss[i], "firewall") == 0) {
+          if (v & 128) v = 0;
+          v |= kBlockInternet_Firewall;
+        } else if (strcmp(ss[i], "off") == 0)
+          v = 0;
+        else if (strcmp(ss[i], "on") == 0)
+          v = kBlockInternet_DefaultOn;
+        else if (strcmp(ss[i], "default") == 0)
+          v = kBlockInternet_Default;
+        else
+          RERROR("Unknown mode in BlockInternet: %s", ss[i]);
+      }
+      
+      wg_->SetInternetBlocking((InternetBlockState)v);
+    } else if (strcmp(key, "HeaderObfuscation") == 0) {
+      wg_->SetHeaderObfuscation(value);
+    } else if (strcmp(key, "PostUp") == 0) {
+      wg_->prepost().post_up.emplace_back(value);
+    } else if (strcmp(key, "PostDown") == 0) {
+      wg_->prepost().post_down.emplace_back(value);
+    } else if (strcmp(key, "PreUp") == 0) {
+      wg_->prepost().pre_up.emplace_back(value);
+    } else if (strcmp(key, "PreDown") == 0) {
+      wg_->prepost().pre_down.emplace_back(value);
+    } else {
+      goto err;
+    }
+  } else if (strcmp(group, "[Peer]") == 0) {
+    if (key == NULL) { 
+      if (!had_interface_) {
+        RERROR("Missing [Interface].PrivateKey.");
+        return false;
+      }
+      FinishGroup();
+      peer_ = wg_->dev().AddPeer();
+      memset(&pi_, 0, sizeof(pi_));
+      return true;
+    }
+    if (strcmp(key, "PublicKey") == 0) {
+      if (!ParseBase64Key(value, pi_.pub))
+        return false;
+    } else if (strcmp(key, "PresharedKey") == 0) {
+      if (!ParseBase64Key(value, pi_.psk))
+        return false;
+    } else if (strcmp(key, "AllowedIPs") == 0) {
+      SplitString(value, ',', &ss);
+      for (size_t i = 0; i < ss.size(); i++) {
+        if (!ParseCidrAddr(ss[i], &addr))
+          return false;
+        if (!peer_->AddIp(addr))
+          return false;
+      }
+    } else if (strcmp(key, "Endpoint") == 0) {
+      if (!ParseSockaddrInWithPort(value, &sin, exit_flag_))
+        return false;
+      peer_->SetEndpoint(sin);
+    } else if (strcmp(key, "PersistentKeepalive") == 0) {
+      peer_->SetPersistentKeepalive(atoi(value));
+    } else if (strcmp(key, "AllowMulticast") == 0) {
+      bool b;
+      if (!ParseBoolean(value, &b))
+        return false;
+      peer_->SetAllowMulticast(b);
+    } else if (strcmp(key, "Features") == 0) {
+      SplitString(value, ',', &ss);
+      for (size_t i = 0; i < ss.size(); i++) {
+        int v = ParseFeature(ss[i]);
+        if (v < 0)
+          return false;
+        for (;; v += 12) {
+          peer_->SetFeature(v >> 4, v & 3);
+          if (!(v & 12))
+            break;
+        }
+      }
+    } else if (strcmp(key, "Ciphers") == 0 || (ciphermode = true, strcmp(key, "Ciphers!") == 0)) {
+      SplitString(value, ',', &ss);
+      peer_->SetCipherPrio(ciphermode);
+      for (size_t i = 0; i < ss.size(); i++) {
+        int v = ParseCipherSuite(ss[i]);
+        if (v < 0 || !peer_->AddCipher(v))
+          return false;
+      }
+    } else {
+      goto err;
+    }
+  } else {
+err:
+    return false;
+  }
+  return true;
+}
+
+bool ParseWireGuardConfigFile(WireguardProcessor *wg, const char *filename, bool *exit_flag) {
+  char buf[1024];
+  char group[32] = {0};
+
+  WgFileParser file_parser(wg, exit_flag);
+
+  RINFO("Loading file: %s", filename);
+
+  FILE *f = fopen(filename, "r");
+  if (!f) {
+    RERROR("Unable to open: %s", filename);
+    return false;
+  }
+
+  while (fgets(buf, sizeof(buf), f)) {
+    size_t l = strlen(buf);
+    while (l && is_space(buf[l - 1]))
+      buf[--l] = 0;
+    if (buf[0] == '#' || buf[0] == '\0')
+      continue;
+
+    if (buf[0] == '[') {
+      size_t len = strlen(buf);
+      if (len < sizeof(group)) {
+        memcpy(group, buf, len + 1);
+        if (!file_parser.ParseFlag(group, NULL, NULL)) {
+          RERROR("Error parsing %s", group);
+          fclose(f);
+          return false;
+        }
+      }
+      continue;
+    }
+    char *sep = strchr(buf, '=');
+    if (!sep) {
+      RERROR("Missing = on line: %s", buf);
+      continue;
+    }
+    char *sepe = sep;
+    while (sepe > buf && is_space(sepe[-1]))
+      sepe--;
+    *sepe = 0;
+
+    // trim space after =
+    sep++;
+    while (is_space(*sep))
+      sep++;
+
+    if (!file_parser.ParseFlag(group, buf, sep)) {
+      RERROR("Error parsing %s.%s = %s", group, buf, sep);
+      fclose(f);
+      return false;
+    }
+  }
+  file_parser.FinishGroup();
+  fclose(f);
+  return true;
+}
diff --git a/wireguard_config.h b/wireguard_config.h
new file mode 100644
index 0000000..03d7899
--- /dev/null
+++ b/wireguard_config.h
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#ifndef TINYVPN_TINYVPN_H_
+#define TINYVPN_TINYVPN_H_
+
+class WireguardProcessor;
+
+bool ParseWireGuardConfigFile(WireguardProcessor *wg, const char *filename, bool *exit_flag);
+
+#define kSizeOfAddress 64
+const char *print_ip_prefix(char buf[kSizeOfAddress], int family, const void *ip, int prefixlen);
+
+
+
+#endif  // TINYVPN_TINYVPN_H_
diff --git a/wireguard_proto.cpp b/wireguard_proto.cpp
new file mode 100644
index 0000000..ad20a53
--- /dev/null
+++ b/wireguard_proto.cpp
@@ -0,0 +1,1307 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "wireguard_proto.h"
+#include "crypto/chacha20poly1305.h"
+#include "crypto/blake2s.h"
+#include "crypto/curve25519-donna.h"
+#include "crypto/aesgcm/aes.h"
+#include "crypto/siphash.h"
+#include "tunsafe_endian.h"
+#include "util.h"
+#include "crypto_ops.h"
+#include "bit_ops.h"
+#include  "tunsafe_cpu.h"
+#include <algorithm>
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+static const uint8 kLabelCookie[] = {'c', 'o', 'o', 'k', 'i', 'e', '-', '-'};
+static const uint8 kLabelMac1[] = {'m', 'a', 'c', '1', '-', '-', '-', '-'};
+static const uint8 kWgInitHash[WG_HASH_LEN] = {0x22,0x11,0xb3,0x61,0x08,0x1a,0xc5,0x66,0x69,0x12,0x43,0xdb,0x45,0x8a,0xd5,0x32,0x2d,0x9c,0x6c,0x66,0x22,0x93,0xe8,0xb7,0x0e,0xe1,0x9c,0x65,0xba,0x07,0x9e,0xf3};
+static const uint8 kWgInitChainingKey[WG_HASH_LEN] = {0x60,0xe2,0x6d,0xae,0xf3,0x27,0xef,0xc0,0x2e,0xc3,0x35,0xe2,0xa0,0x25,0xd2,0xd0,0x16,0xeb,0x42,0x06,0xf8,0x72,0x77,0xf5,0x2d,0x38,0xd1,0x98,0x8b,0x78,0xcd,0x36};
+static const uint8 kCurve25519Basepoint[32] = {9};
+
+IpToPeerMap::IpToPeerMap() {
+
+}
+
+IpToPeerMap::~IpToPeerMap() {
+}
+
+bool IpToPeerMap::InsertV4(const void *addr, int cidr, void *peer) {
+  uint32 mask = cidr == 32 ? 0xffffffff : ~(0xffffffff >> cidr);
+  Entry4 e = {ReadBE32(addr) & mask, mask, peer};
+  ipv4_.push_back(e);
+  return true;
+}
+
+bool IpToPeerMap::InsertV6(const void *addr, int cidr, void *peer) {
+  Entry6 e;
+  e.cidr_len = cidr;
+  e.peer = peer;
+  memcpy(e.ip, addr, 16);
+  ipv6_.push_back(e);
+  return true;
+}
+
+void *IpToPeerMap::LookupV4(uint32 ip) {
+  uint32 best_mask = 0;
+  void *best_peer = NULL;
+  for (auto it = ipv4_.begin(); it != ipv4_.end(); ++it) {
+    if (it->ip == (ip & it->mask) && it->mask >= best_mask) {
+      best_mask = it->mask;
+      best_peer = it->peer;
+    }
+  }
+  return best_peer;
+}
+
+void *IpToPeerMap::LookupV4DefaultPeer() {
+  for (auto it = ipv4_.begin(); it != ipv4_.end(); ++it) {
+    if (it->mask == 0)
+      return it->peer;
+  }
+  return NULL;
+}
+
+void *IpToPeerMap::LookupV6DefaultPeer() {
+  for (auto it = ipv6_.begin(); it != ipv6_.end(); ++it) {
+    if (it->cidr_len == 0)
+      return it->peer;
+  }
+  return NULL;
+}
+
+static int CalculateIPv6CommonPrefix(const uint8 *a, const uint8 *b) {
+  uint64 x = ToBE64(*(uint64*)&a[0] ^ *(uint64*)&b[0]);
+  uint64 y = ToBE64(*(uint64*)&a[8] ^ *(uint64*)&b[8]);
+  return x ? 64 - FindHighestSetBit64(x) : 128 - FindHighestSetBit64(y);
+}
+
+void *IpToPeerMap::LookupV6(const void *addr) {
+  int best_len = 0;
+  void *best_peer = NULL;
+  for (auto it = ipv6_.begin(); it != ipv6_.end(); ++it) {
+    int len = CalculateIPv6CommonPrefix((const uint8*)addr, it->ip);
+    if (len >= it->cidr_len && len >= best_len) {
+      best_len = len;
+      best_peer = it->peer;
+    }
+  }
+  return best_peer;
+}
+
+void IpToPeerMap::RemovePeer(void *peer) {
+  {
+    size_t n = ipv4_.size();
+    Entry4 *r = &ipv4_[0], *w = r;
+    for (size_t i = 0; i != n; i++, r++) {
+      if (r->peer != peer)
+        *w++ = *r;
+    }
+    ipv4_.resize(w - &ipv4_[0]);
+  }
+  {
+    size_t n = ipv6_.size();
+    Entry6 *r = &ipv6_[0], *w = r;
+    for (size_t i = 0; i != n; i++, r++) {
+      if (r->peer != peer)
+        *w++ = *r;
+    }
+    ipv6_.resize(w - &ipv6_[0]);
+  }
+}
+
+ReplayDetector::ReplayDetector() {
+  expected_seq_nr_ = 0;
+  memset(bitmap_, 0, sizeof(bitmap_));
+}
+
+ReplayDetector::~ReplayDetector() {
+}
+
+bool ReplayDetector::CheckReplay(uint64 seq_nr) {
+  uint64 slot = seq_nr / BITS_PER_ENTRY;
+  if (seq_nr >= expected_seq_nr_) {
+    uint64 prev_slot = (expected_seq_nr_ + BITS_PER_ENTRY - 1) / BITS_PER_ENTRY - 1, n;
+    if ((n = slot - prev_slot) != 0) {
+      size_t nn = (size_t)std::min<uint64>(n, BITMAP_SIZE);
+      do {
+        bitmap_[(prev_slot + nn) & BITMAP_MASK] = 0;
+      } while (--nn);
+    }
+    expected_seq_nr_ = seq_nr + 1;
+  } else if (seq_nr + WINDOW_SIZE <= expected_seq_nr_) {
+    return false;
+  }
+  uint32 mask = 1 << (seq_nr & (BITS_PER_ENTRY - 1)), prev;
+  prev = bitmap_[slot & BITMAP_MASK];
+  bitmap_[slot & BITMAP_MASK] = prev | mask;
+  return (prev & mask) == 0;
+}
+
+WgDevice::WgDevice() {
+  peers_ = NULL;
+  header_obfuscation_ = false;
+  next_rng_slot_ = 0;
+  last_complete_handskake_timestamp_ = 0;
+  memset(&compression_header_, 0, sizeof(compression_header_));
+
+  low_resolution_timestamp_ = cookie_secret_timestamp_ = OsGetMilliseconds();
+  OsGetRandomBytes(cookie_secret_, sizeof(cookie_secret_));
+  OsGetRandomBytes((uint8*)random_number_input_, sizeof(random_number_input_));
+
+}
+
+WgDevice::~WgDevice() {
+}
+
+void WgDevice::SecondLoop(uint64 now) {
+  low_resolution_timestamp_ = now;
+
+  if (rate_limiter_.is_used()) {
+    uint32 k[5];
+    for (size_t i = 0; i < ARRAY_SIZE(k); i++)
+      k[i] = GetRandomNumber();
+    rate_limiter_.Periodic(k);
+  }
+}
+
+uint32 WgDevice::InsertInKeyIdLookup(WgPeer *peer, WgKeypair *kp) {
+  assert(peer);
+  for (;;) {
+    uint32 v = GetRandomNumber();
+    if (v == 0)
+      continue;
+    std::pair<WgPeer*, WgKeypair*> &peer_and_keypair = key_id_lookup_[v];
+    if (peer_and_keypair.first == NULL) {
+      peer_and_keypair = std::make_pair(peer, kp);
+      uint32 &x = (kp ? kp->local_key_id : peer->local_key_id_during_hs_);
+      uint32 old = x;
+      x = v;
+      if (old)
+        key_id_lookup_.erase(old);
+      return v;
+    }
+  }
+}
+
+uint32 WgDevice::GetRandomNumber() {
+  size_t slot;
+  if ((slot = next_rng_slot_) == 0) {
+    blake2s(random_number_output_, sizeof(random_number_output_), random_number_input_, sizeof(random_number_input_), NULL,  0);
+    random_number_input_[0]++;
+    slot = BLAKE2S_OUTBYTES / 4;
+  }
+  next_rng_slot_ = (uint8) --slot;
+  return random_number_output_[slot];
+}
+
+static void BlakeX2(uint8 *dst, size_t dst_size, const uint8 *a, size_t a_size, const uint8 *b, size_t b_size) {
+  blake2s_state b2s;
+  blake2s_init(&b2s, dst_size);
+  blake2s_update(&b2s, a, a_size);
+  blake2s_update(&b2s, b, b_size);
+  blake2s_final(&b2s, dst, dst_size);
+}
+
+static inline void BlakeMix(uint8 dst[WG_HASH_LEN], const uint8 *a, size_t a_size) {
+  BlakeX2(dst, WG_HASH_LEN, dst, WG_HASH_LEN, a, a_size);
+}
+
+static inline void ComputeHKDF2DH(uint8 ci[WG_HASH_LEN], uint8 k[WG_SYMMETRIC_KEY_LEN], const uint8 priv[WG_PUBLIC_KEY_LEN], const uint8 pub[WG_PUBLIC_KEY_LEN]) {
+  uint8 dh[WG_PUBLIC_KEY_LEN];
+  curve25519_donna(dh, priv, pub);
+  blake2s_hkdf(ci, WG_HASH_LEN, k, WG_SYMMETRIC_KEY_LEN, NULL, 32,  dh, sizeof(dh), ci, WG_HASH_LEN);
+  memzero_crypto(dh, sizeof(dh));
+}
+
+void WgDevice::Initialize(const uint8 private_key[WG_PUBLIC_KEY_LEN]) {
+  // Derive the public key from the private key.
+  memcpy(s_priv_, private_key, sizeof(s_priv_));
+  curve25519_donna(s_pub_, s_priv_, kCurve25519Basepoint);
+
+  // Precompute: precomputed_cookie_label_hash_ := HASH(LABEL-COOKIE || Spub_m)
+  //             precomputed_label_mac1_hash_ := HASH(MAC1-COOKIE || Spub_m)
+  BlakeX2(precomputed_cookie_key_, sizeof(precomputed_cookie_key_),
+          kLabelCookie, sizeof(kLabelCookie), s_pub_, sizeof(s_pub_));
+  BlakeX2(precomputed_mac1_key_, sizeof(precomputed_mac1_key_),
+          kLabelMac1, sizeof(kLabelMac1), s_pub_, sizeof(s_pub_));
+}
+
+WgPeer *WgDevice::AddPeer() {
+  WgPeer *peer = new WgPeer(this);
+  WgPeer **pp = &peers_;
+  while (*pp) 
+    pp = &(*pp)->next_peer_;
+  *pp = peer;
+  return peer;
+}
+
+WgPeer *WgDevice::GetPeerFromPublicKey(uint8 public_key[WG_PUBLIC_KEY_LEN]) {
+  for (WgPeer *peer = peers_; peer; peer = peer->next_peer_) {
+    if (memcmp(peer->s_remote_, public_key, WG_PUBLIC_KEY_LEN) == 0)
+      return peer;
+  }
+  return NULL;
+}
+
+bool WgDevice::CheckCookieMac1(Packet *packet) {
+  uint8 mac[WG_COOKIE_LEN];
+  const uint8 *data = packet->data;
+  size_t data_size = packet->size;
+
+  blake2s(mac, sizeof(mac), data, data_size - WG_COOKIE_LEN * 2, precomputed_mac1_key_, sizeof(precomputed_mac1_key_));
+  return !memcmp_crypto(mac, data + data_size - WG_COOKIE_LEN * 2, WG_COOKIE_LEN);
+}
+
+void WgDevice::MakeCookie(uint8 cookie[WG_COOKIE_LEN], Packet *packet) {
+  blake2s_state b2s;
+  uint64 now = OsGetMilliseconds();
+  if (now - cookie_secret_timestamp_ >= COOKIE_SECRET_MAX_AGE_MS) {
+    cookie_secret_timestamp_ = now;
+    OsGetRandomBytes(cookie_secret_, sizeof(cookie_secret_));
+  }
+  blake2s_init_key(&b2s, WG_COOKIE_LEN, cookie_secret_, sizeof(cookie_secret_));
+  if (packet->addr.sin.sin_family == AF_INET)
+    blake2s_update(&b2s, &packet->addr.sin.sin_addr, 4);
+  else if (packet->addr.sin.sin_family == AF_INET6)
+    blake2s_update(&b2s, &packet->addr.sin6.sin6_addr, sizeof(packet->addr.sin6.sin6_addr));
+  blake2s_update(&b2s, &packet->addr.sin6.sin6_port, 2);
+  blake2s_final(&b2s, cookie, WG_COOKIE_LEN);
+}
+
+bool WgDevice::CheckCookieMac2(Packet *packet) {
+  uint8 cookie[WG_COOKIE_LEN];
+  uint8 mac[WG_COOKIE_LEN];
+  MakeCookie(cookie, packet);
+  blake2s(mac, sizeof(mac), packet->data, packet->size - WG_COOKIE_LEN, cookie, sizeof(cookie));
+  return !memcmp_crypto(mac, packet->data + packet->size - WG_COOKIE_LEN, WG_COOKIE_LEN);
+}
+
+void WgDevice::CreateCookieMessage(MessageHandshakeCookie *dst, Packet *packet, uint32 remote_key_id) {
+  dst->type = MESSAGE_HANDSHAKE_COOKIE;
+  dst->receiver_key_id = remote_key_id;
+  MakeCookie(dst->cookie_enc, packet);
+  OsGetRandomBytes(dst->nonce, sizeof(dst->nonce));
+  MessageMacs *mac = (MessageMacs *)(packet->data + packet->size - sizeof(MessageMacs));
+  xchacha20poly1305_encrypt(dst->cookie_enc, dst->cookie_enc, WG_COOKIE_LEN, mac->mac1, WG_COOKIE_LEN, dst->nonce, precomputed_cookie_key_);
+}
+
+void WgDevice::EraseKeypairAddrEntry(WgKeypair *kp) {
+  WgAddrEntry *ae = kp->addr_entry;
+
+  assert(ae->ref_count >= 1);
+  assert(ae->ref_count == !!ae->keys[0] + !!ae->keys[1] + !!ae->keys[2]);
+  assert(ae->keys[kp->addr_entry_slot - 1] == kp);
+
+  kp->addr_entry = NULL;
+
+  ae->keys[kp->addr_entry_slot - 1] = NULL;
+  kp->addr_entry_slot = 0;
+  
+  if (ae->ref_count-- == 1) {
+    addr_entry_lookup_.erase(ae->addr_entry_id);
+    delete ae;
+  }
+}
+
+void WgDevice::UpdateKeypairAddrEntry(uint64 addr_id, WgKeypair *keypair) {
+  if (keypair->addr_entry != NULL && keypair->addr_entry->addr_entry_id == addr_id) {
+    keypair->broadcast_short_key = 1;
+    return;
+  }
+
+  if (keypair->addr_entry != NULL)
+    EraseKeypairAddrEntry(keypair);
+
+  WgAddrEntry **aep = &addr_entry_lookup_[addr_id], *ae;
+
+  if ((ae = *aep) == NULL) {
+    *aep = ae = new WgAddrEntry(addr_id);
+  } else {
+    // Ensure we don't insert new things in this addr entry too often.
+    if (ae->time_of_last_insertion + 1000 * 60 > low_resolution_timestamp_)
+      return;
+  }
+
+  ae->time_of_last_insertion = low_resolution_timestamp_;
+
+  // Update slot #
+  uint32 next_slot = ae->next_slot;
+  ae->next_slot = (next_slot == 2) ? 0 : next_slot + 1;
+
+  WgKeypair *old_keypair = ae->keys[next_slot];
+  ae->keys[next_slot] = keypair;
+  keypair->addr_entry = ae;
+  keypair->addr_entry_slot = next_slot + 1;
+  if (old_keypair != NULL) {
+    old_keypair->addr_entry = NULL;
+    old_keypair->addr_entry_slot = 0;
+  } else {
+    ae->ref_count++;
+  }
+  assert(ae->ref_count == !!ae->keys[0] + !!ae->keys[1] + !!ae->keys[2]);
+
+  keypair->broadcast_short_key = 1;
+}
+
+//>> > hashlib.sha256('TunSafe Header Obfuscation Key').hexdigest()
+//'2444423e33eb5bb875961224c6441f54c5dea95a3a4e1139509ffa6992bdb278'
+static const uint8 kHeaderObfuscationKey[32] = {36, 68, 66, 62, 51, 235, 91, 184, 117, 150, 18, 36, 198, 68, 31, 84, 197, 222, 169, 90, 58, 78, 17, 57, 80, 159, 250, 105, 146, 189, 178, 120};
+
+void WgDevice::SetHeaderObfuscation(const char *key) {
+#if WITH_HEADER_OBFUSCATION
+  header_obfuscation_ = (key != NULL);
+  if (key)
+    blake2s_hmac((uint8*)&header_obfuscation_key_, sizeof(header_obfuscation_key_), (uint8*)key, strlen(key), kHeaderObfuscationKey, sizeof(kHeaderObfuscationKey));
+#endif  // WITH_HEADER_OBFUSCATION
+}
+
+
+WgPeer::WgPeer(WgDevice *dev) {
+  dev_ = dev;
+  endpoint_.sin.sin_family = 0;
+  next_peer_ = NULL;
+  curr_keypair_ = next_keypair_ = prev_keypair_ = NULL;
+  expect_cookie_reply_ = false;
+  has_mac2_cookie_ = false;
+  allow_multicast_through_peer_ = false;
+  supports_handshake_extensions_ = true;
+  local_key_id_during_hs_ = 0;
+  last_handshake_init_timestamp_ = -1000000ll;
+  last_handshake_init_recv_timestamp_ = 0;
+  last_complete_handskake_timestamp_ = 0;
+  persistent_keepalive_ms_ = 0;
+  timers_ = 0;
+  first_queued_packet_ = NULL;
+  last_queued_packet_ptr_ = &first_queued_packet_;
+  num_queued_packets_ = 0;
+  handshake_attempts_ = 0;
+  num_ciphers_ = 0;
+  cipher_prio_ = 0;
+  memset(last_timestamp_, 0, sizeof(last_timestamp_));
+  ipv4_broadcast_addr_ = 0xffffffff;
+  memset(features_, 0, sizeof(features_));
+}
+
+WgPeer::~WgPeer() {
+  ClearKeys();
+  ClearHandshake();
+  ClearPacketQueue();
+}
+
+void WgPeer::ClearPacketQueue() {
+  Packet *packet;
+  while ((packet = first_queued_packet_) != NULL) {
+    first_queued_packet_ = packet->next;
+    FreePacket(packet);
+  }
+  last_queued_packet_ptr_ = &first_queued_packet_;
+  num_queued_packets_ = 0;
+}
+
+void WgPeer::Initialize(const uint8 spub[WG_PUBLIC_KEY_LEN], const uint8 preshared_key[WG_SYMMETRIC_KEY_LEN]) {
+  // Optionally use a preshared key, it defaults to all zeros.
+  if (preshared_key)
+    memcpy(preshared_key_, preshared_key, sizeof(preshared_key_));
+  else
+    memset(preshared_key_, 0, sizeof(preshared_key_));
+  // Precompute: s_priv_pub_ := DH(sprivr, spubi)
+  memcpy(s_remote_, spub, sizeof(s_remote_));
+  curve25519_donna(s_priv_pub_, dev_->s_priv_, s_remote_);
+  // Precompute: precomputed_cookie_key_ := HASH(LABEL-COOKIE || Spub_m)
+  //             precomputed_mac1_key_   := HASH(MAC1-COOKIE || Spub_m)
+  BlakeX2(precomputed_cookie_key_, sizeof(precomputed_cookie_key_),
+          kLabelCookie, sizeof(kLabelCookie), spub, WG_PUBLIC_KEY_LEN);
+  BlakeX2(precomputed_mac1_key_, sizeof(precomputed_mac1_key_),
+          kLabelMac1, sizeof(kLabelMac1), spub, WG_PUBLIC_KEY_LEN);
+}
+
+// run on the client
+void WgPeer::CreateMessageHandshakeInitiation(Packet *packet) {
+  uint8 k[WG_SYMMETRIC_KEY_LEN];
+  MessageHandshakeInitiation *dst = (MessageHandshakeInitiation *)packet->data;
+
+  // Ci := HASH(CONSTRUCTION)
+  memcpy(hs_.ci, kWgInitChainingKey, sizeof(hs_.ci));
+  // Hi := HASH(Ci || IDENTIFIER)
+  memcpy(hs_.hi, kWgInitHash, sizeof(hs_.hi));
+  // Hi := HASH(Hi || Spub_r)
+  BlakeMix(hs_.hi, s_remote_, sizeof(s_remote_));
+  // (Epriv_r, Epub_r) := DH-GENERATE()
+  // msg.ephemeral = Epub_r
+  OsGetRandomBytes(hs_.e_priv, sizeof(hs_.e_priv));
+  curve25519_normalize(hs_.e_priv);
+  curve25519_donna(dst->ephemeral, hs_.e_priv, kCurve25519Basepoint);
+  // Ci := KDF_1(Ci, msg.ephemeral)
+  blake2s_hkdf(hs_.ci, sizeof(hs_.ci), NULL, 32, NULL, 32, dst->ephemeral, sizeof(dst->ephemeral), hs_.ci, WG_HASH_LEN);
+  // Hi := HASH(Hi || msg.ephemeral)
+  BlakeMix(hs_.hi, dst->ephemeral, sizeof(dst->ephemeral));
+  // (Ci, K) := KDF2(Ci, DH(epriv, spub_r))
+  ComputeHKDF2DH(hs_.ci, k, hs_.e_priv, s_remote_);
+  // msg.static = AEAD(K, 0, Spub_i, Hi)
+  chacha20poly1305_encrypt(dst->static_enc, dev_->s_pub_, sizeof(dev_->s_pub_), hs_.hi, sizeof(hs_.hi), 0, k);
+  // Hi := HASH(Hi || msg.static)
+  BlakeMix(hs_.hi, dst->static_enc, sizeof(dst->static_enc));
+  // (Ci, K) := KDF2(Ci, DH(sprivr, spubi))
+  blake2s_hkdf(hs_.ci, sizeof(hs_.ci), k, sizeof(k), NULL, 32, s_priv_pub_, sizeof(s_priv_pub_), hs_.ci, WG_HASH_LEN);
+  // TAI64N
+  OsGetTimestampTAI64N(dst->timestamp_enc);
+
+  size_t extfield_size = 0;
+#if WITH_HANDSHAKE_EXT
+  if (supports_handshake_extensions_)
+    extfield_size = WriteHandshakeExtension(dst->timestamp_enc + WG_TIMESTAMP_LEN, NULL);
+#endif  // WITH_HANDSHAKE_EXT
+  // msg.timestamp := AEAD(K, 0, timestamp, hi)
+  chacha20poly1305_encrypt(dst->timestamp_enc, dst->timestamp_enc, extfield_size + WG_TIMESTAMP_LEN, hs_.hi, sizeof(hs_.hi), 0, k);
+  // Hi := HASH(Hi || msg.timestamp)
+  BlakeMix(hs_.hi, dst->timestamp_enc, extfield_size + WG_TIMESTAMP_LEN + WG_MAC_LEN);
+
+  packet->size = (unsigned)(sizeof(MessageHandshakeInitiation) + extfield_size);
+
+  // Insert a pointer to this object, 
+  dst->sender_key_id = dev_->InsertInKeyIdLookup(this, NULL);
+  dst->type = MESSAGE_HANDSHAKE_INITIATION;
+  memzero_crypto(k, sizeof(k));
+  WriteMacToPacket((uint8*)dst, (MessageMacs*)((uint8*)&dst->mac + extfield_size));
+}
+
+// Parsed by server
+WgPeer *WgPeer::ParseMessageHandshakeInitiation(WgDevice *dev, Packet *packet) { // const MessageHandshakeInitiation *src, MessageHandshakeResponse *dst) {
+  // Copy values into handshake once we've validated it all.
+  uint8 ci[WG_HASH_LEN];
+  uint8 hi[WG_HASH_LEN];
+  union {
+    uint8 k[WG_SYMMETRIC_KEY_LEN];
+    uint8 e_priv[WG_PUBLIC_KEY_LEN];
+  };
+  union {
+    uint8 spubi[WG_PUBLIC_KEY_LEN];
+    uint8 e_remote[WG_PUBLIC_KEY_LEN];
+    uint8 hi2[WG_HASH_LEN];
+  };
+  uint8 t[WG_HASH_LEN];
+  WgPeer *peer;
+  WgKeypair *keypair;
+  uint32 remote_key_id;
+  uint64 now;
+  uint8 extbuf[MAX_SIZE_OF_HANDSHAKE_EXTENSION + WG_TIMESTAMP_LEN];
+  MessageHandshakeInitiation *src = (MessageHandshakeInitiation *)packet->data;
+  MessageHandshakeResponse *dst;
+  size_t extfield_size;
+
+  // Ci := HASH(CONSTRUCTION)
+  memcpy(ci, kWgInitChainingKey, sizeof(ci));
+  // Hi := HASH(Ci || IDENTIFIER)
+  memcpy(hi, kWgInitHash, sizeof(hi));
+  // Hi := HASH(Hi || Spub_r)
+  BlakeMix(hi, dev->s_pub_, sizeof(dev->s_pub_));
+  // Ci := KDF_1(Ci, msg.ephemeral)
+  blake2s_hkdf(ci, sizeof(ci), NULL, 32, NULL, 32, src->ephemeral, sizeof(src->ephemeral), ci, WG_HASH_LEN);
+  // Hi := HASH(Hi || msg.ephemeral)
+  BlakeMix(hi, src->ephemeral, sizeof(src->ephemeral));
+  // (Ci, K) := KDF2(Ci, DH(spriv, msg.ephemeral))
+  ComputeHKDF2DH(ci, k, dev->s_priv_, src->ephemeral);
+  // Spub_i = AEAD_DEC(K, 0, msg.static, Hi)
+  if (!chacha20poly1305_decrypt(spubi, src->static_enc, sizeof(src->static_enc), hi, sizeof(hi), 0, k))
+    goto getout;
+  // Hi := HASH(Hi || msg.static)
+  BlakeMix(hi, src->static_enc, sizeof(src->static_enc));
+  // Lookup the peer with this ID
+  if (!(peer = dev->GetPeerFromPublicKey(spubi)))
+    goto getout;
+  // (Ci, K) := KDF2(Ci, DH(sprivr, spubi))
+  blake2s_hkdf(ci, sizeof(ci), k, sizeof(k), NULL, 32, peer->s_priv_pub_, sizeof(peer->s_priv_pub_), ci, WG_HASH_LEN);
+  // Hi2 := Hi
+  memcpy(hi2, hi, sizeof(hi2));
+  extfield_size = packet->size - sizeof(MessageHandshakeInitiation);
+  if (extfield_size > MAX_SIZE_OF_HANDSHAKE_EXTENSION || (extfield_size && !peer->supports_handshake_extensions_))
+    goto getout;
+  // Hi := HASH(Hi || msg.timestamp)
+  BlakeMix(hi, src->timestamp_enc, extfield_size + WG_TIMESTAMP_LEN + WG_MAC_LEN);
+  // TIMESTAMP := AEAD_DEC(K, 0, msg.timestamp, hi2)
+  if (!chacha20poly1305_decrypt(extbuf, src->timestamp_enc, extfield_size + WG_TIMESTAMP_LEN + WG_MAC_LEN, hi2, sizeof(hi2), 0, k))
+    goto getout;
+  // Replay attack?
+  if (memcmp(extbuf, peer->last_timestamp_, WG_TIMESTAMP_LEN) <= 0)
+    goto getout;
+  // Flood attack?
+  now = OsGetMilliseconds();
+  if (now < peer->last_handshake_init_recv_timestamp_ + MIN_HANDSHAKE_INTERVAL_MS)
+    goto getout;
+
+  // Remember all the information we need to produce a response cause we cannot touch src again
+  peer->last_handshake_init_recv_timestamp_ = now;
+  memcpy(peer->last_timestamp_, extbuf, sizeof(peer->last_timestamp_));
+  
+  memcpy(e_remote, src->ephemeral, sizeof(e_remote));
+  remote_key_id = src->sender_key_id;
+  
+  dst = (MessageHandshakeResponse *)src;
+  
+  // (Epriv_r, Epub_r) := DH-GENERATE()
+  // msg.ephemeral = Epub_r
+  OsGetRandomBytes(e_priv, sizeof(e_priv));
+  curve25519_normalize(e_priv);
+  curve25519_donna(dst->ephemeral, e_priv, kCurve25519Basepoint);
+  // Hr := HASH(Hr || msg.ephemeral)
+  BlakeMix(hi, dst->ephemeral, sizeof(dst->ephemeral));
+  // Ci := KDF_1(Ci, msg.ephemeral)
+  blake2s_hkdf(ci, sizeof(ci), NULL, 32, NULL, 32, dst->ephemeral, sizeof(dst->ephemeral), ci, WG_HASH_LEN);
+  // Ci : = KDF2(Ci, DH(epriv, epub))
+  ComputeHKDF2DH(ci, NULL, e_priv, e_remote);
+  // Ci : = KDF2(Ci, DH(epriv, spub))
+  ComputeHKDF2DH(ci, NULL, e_priv, peer->s_remote_);
+  // (Ci, T, K) := KDF3(Ci, Q)
+  blake2s_hkdf(ci, sizeof(ci), t, sizeof(t), k, sizeof(k), peer->preshared_key_, sizeof(preshared_key_), ci, WG_HASH_LEN);
+  // Hr := HASH(Hr || T)
+  BlakeMix(hi, t, sizeof(t));
+
+  dst->receiver_key_id = remote_key_id;
+  keypair = peer->CreateNewKeypair(false, ci, remote_key_id, extbuf + WG_TIMESTAMP_LEN, extfield_size);
+  if (keypair) {
+    peer->InsertKeypairInPeer(keypair);
+    dst->sender_key_id = dev->InsertInKeyIdLookup(peer, keypair);
+
+    size_t extfield_out_size = 0;
+#if WITH_HANDSHAKE_EXT
+    if (extfield_size)
+      extfield_out_size = peer->WriteHandshakeExtension(dst->empty_enc, keypair);
+#endif  // WITH_HANDSHAKE_EXT
+    packet->size = (unsigned)(sizeof(MessageHandshakeResponse) + extfield_out_size);
+
+    // msg.empty := AEAD(K, 0, "", Hr)
+    chacha20poly1305_encrypt(dst->empty_enc, dst->empty_enc, extfield_out_size, hi, sizeof(hi), 0, k);
+    // Hr := HASH(Hr || "")
+    //BlakeMix(hi, dst->empty_enc, extfield_out_size);
+    
+    dst->type = MESSAGE_HANDSHAKE_RESPONSE;
+    peer->WriteMacToPacket((uint8*)dst, (MessageMacs*)((uint8*)&dst->mac + extfield_out_size));
+  } else {
+getout:
+    peer = NULL;
+  }
+  memzero_crypto(hi, sizeof(hi));
+  memzero_crypto(ci, sizeof(ci));
+  memzero_crypto(k, sizeof(k));
+  memzero_crypto(t, sizeof(t));
+  return peer;
+}
+
+WgPeer *WgPeer::ParseMessageHandshakeResponse(WgDevice *dev, const Packet *packet) {
+  MessageHandshakeResponse *src = (MessageHandshakeResponse *)packet->data;
+  uint8 t[WG_HASH_LEN];
+  uint8 k[WG_SYMMETRIC_KEY_LEN];
+  WgKeypair *keypair;
+  auto it = dev->key_id_lookup().find(src->receiver_key_id);
+  if (it == dev->key_id_lookup().end() || it->second.second != NULL)
+    return NULL;
+  WgPeer *peer = it->second.first;
+
+  assert(src->receiver_key_id == peer->local_key_id_during_hs_);
+
+  HandshakeState hs = peer->hs_;
+  // Hr := HASH(Hr || msg.ephemeral)
+  BlakeMix(hs.hi, src->ephemeral, sizeof(src->ephemeral));
+  // Ci := KDF_1(Ci, msg.ephemeral)
+  blake2s_hkdf(hs.ci, sizeof(hs.ci), NULL, 32, NULL, 32, src->ephemeral, sizeof(src->ephemeral), hs.ci, sizeof(hs.ci));
+  // Ci : = KDF2(Ci, DH(epriv, epub))
+  ComputeHKDF2DH(hs.ci, NULL, hs.e_priv, src->ephemeral);
+  // Ci : = KDF2(Ci, DH(spriv, epub))
+  ComputeHKDF2DH(hs.ci, NULL, peer->dev_->s_priv_, src->ephemeral);
+  // (Ci, T, K) := KDF3(Ci, Q)
+  blake2s_hkdf(hs.ci, sizeof(hs.ci), t, sizeof(t), k, sizeof(k), peer->preshared_key_, sizeof(peer->preshared_key_), hs.ci, sizeof(hs.ci));
+  // Hr := HASH(Hr || T)
+  BlakeMix(hs.hi, t, sizeof(t));
+
+  size_t extfield_size = packet->size - sizeof(MessageHandshakeResponse);
+  if (extfield_size > MAX_SIZE_OF_HANDSHAKE_EXTENSION)
+    goto getout;
+
+  // "" := AEAD_DEC(K, 0, msg.empty, Hr)
+  if (!chacha20poly1305_decrypt(src->empty_enc, src->empty_enc, extfield_size + sizeof(src->empty_enc), hs.hi, sizeof(hs.hi), 0, k))
+    goto getout;
+
+  keypair = peer->CreateNewKeypair(true, hs.ci, src->sender_key_id, src->empty_enc, extfield_size);
+  if (!keypair)
+    goto getout;
+
+  peer->InsertKeypairInPeer(keypair);
+
+  // Re-map the entry in the id table so it points at this keypair instead.
+  keypair->local_key_id = peer->local_key_id_during_hs_;
+  peer->local_key_id_during_hs_ = 0;
+  it->second.second = keypair;
+
+  if (0) {
+getout:
+    peer = NULL;
+  }
+  memzero_crypto(t, sizeof(t));
+  memzero_crypto(k, sizeof(k));
+  memzero_crypto(&hs, sizeof(hs));
+ 
+  return peer;
+}
+
+// This is parsed by the initiator, when it needs to re-send the handshake message with a better mac.
+void WgPeer::ParseMessageHandshakeCookie(WgDevice *dev, const MessageHandshakeCookie *src) {
+  uint8 cookie[WG_COOKIE_LEN];
+  auto it = dev->key_id_lookup().find(src->receiver_key_id);
+  if (it == dev->key_id_lookup().end() || it->second.second != NULL)
+    return;
+  WgPeer *peer = it->second.first;
+  if (!peer->expect_cookie_reply_)
+    return;
+  if (!xchacha20poly1305_decrypt(cookie, src->cookie_enc, sizeof(src->cookie_enc), 
+                                 peer->sent_mac1_, sizeof(peer->sent_mac1_), src->nonce, peer->precomputed_cookie_key_))
+    return;
+  peer->expect_cookie_reply_ = false;
+  peer->has_mac2_cookie_ = true;
+  peer->mac2_cookie_timestamp_ = OsGetMilliseconds();
+  memcpy(peer->mac2_cookie_, cookie, sizeof(peer->mac2_cookie_));
+}
+
+#if WITH_HANDSHAKE_EXT
+
+size_t WgPeer::WriteHandshakeExtension(uint8 *dst, WgKeypair *keypair) {
+  uint8 *dst_org = dst, value = 0;
+  // Include the supported features extension
+  if (!IsOnlyZeros(features_, sizeof(features_))) {
+    *dst++ = EXT_BOOLEAN_FEATURES;
+    *dst++ = (WG_FEATURES_COUNT + 3) >> 2;
+    for (size_t i = 0; i != WG_FEATURES_COUNT; i++) {
+      if ((i & 3) == 0)
+        value = 0;
+      dst[i >> 2] = (value += (features_[i] << ((i * 2) & 7)));
+    }
+    // swap WG_FEATURE_ID_SKIP_KEYID_IN and WG_FEATURE_ID_SKIP_KEYID_OUT
+    dst[1] = (dst[1] & 0xF0) + ((dst[1] >> 2) & 0x03) + ((dst[1] << 2) & 0x0C);
+    dst += (WG_FEATURES_COUNT + 3) >> 2;
+  }
+  // Ordered list of cipher suites
+  size_t ciphers = num_ciphers_;
+  if (ciphers) {
+    *dst++ = EXT_CIPHER_SUITES + cipher_prio_;
+    if (keypair) {
+      *dst++ = 1;
+      *dst++ = keypair->cipher_suite;
+    } else {
+      *dst++ = (uint8)ciphers;
+      memcpy(dst, ciphers_, ciphers);
+      dst += ciphers;
+    }
+  }
+  if (features_[WG_FEATURE_ID_IPZIP]) {
+    // Include the packet compression extension
+    *dst++ = EXT_PACKET_COMPRESSION;
+    *dst++ = sizeof(WgPacketCompressionVer01);
+    memcpy(dst, &dev_->compression_header_, sizeof(WgPacketCompressionVer01));
+    dst += sizeof(WgPacketCompressionVer01);
+  }
+  return dst - dst_org;
+}
+
+static bool ResolveBooleanFeatureValue(uint8 other, uint8 self, bool *result) {
+  uint8 both = other * 4 + self;
+  *result = (0xfec0 >> both) & 1;
+  return (0xeff7 >> both) & 1;
+}
+
+static const uint8 cipher_strengths[EXT_CIPHER_SUITE_COUNT] = {4,2,3,1};
+
+static uint32 ResolveCipherSuite(int tie, const uint8 *a, size_t a_size, const uint8 *b, size_t b_size) {
+  uint32 abits[8] = {0}, bbits[8] = {0}, found_a = 0, found_b = 0;
+  for (size_t i = 0; i < a_size; i++)
+    abits[a[i] >> 5] |= 1 << (a[i] & 31);
+  for (size_t i = 0; i < b_size; i++)
+    bbits[b[i] >> 5] |= 1 << (b[i] & 31);
+  for (size_t i = 0; i < a_size; i++)
+    if (bbits[a[i] >> 5] & (1 << (a[i] & 31))) {
+      found_a = a[i];
+      break;
+    }
+  for (size_t i = 0; i < b_size; i++)
+    if (abits[b[i] >> 5] & (1 << (b[i] & 31))) {
+      found_b = b[i];
+      break;
+    }
+  return (tie > 0 ||
+          (tie == 0 && cipher_strengths[found_a] > cipher_strengths[found_b])) ? found_a : found_b;
+}
+
+void WgKeypairSetupCompressionExtension(WgKeypair *keypair, const WgPacketCompressionVer01 *remotec) {
+  const WgPacketCompressionVer01 *localc = keypair->peer->dev_->compression_header();
+  IpzipState *state = &keypair->ipzip_state_;
+
+  // Use is_initiator as tie-breaker on who's going to be the client side.
+  int flags_xor = 0;
+  if ((localc->flags & ~3) + 2 * keypair->is_initiator - 1 <= (remotec->flags & ~3))
+    std::swap(localc, remotec), flags_xor = 1;
+  state->flags_xor = flags_xor;
+
+  memcpy(state->client_addr_v4, localc->ipv4_addr, 4);
+  memcpy(state->client_addr_v6, localc->ipv6_addr, 16);
+  state->guess_ttl[0] = localc->ttl;
+  state->client_addr_v4_subnet_bytes = (localc->flags & 3);
+  WriteLE32(&state->client_addr_v4_netmask, 0xffffffff >> ((localc->flags & 3) * 8));
+
+  memcpy(state->server_addr_v4, remotec->ipv4_addr, 4);
+  memcpy(state->server_addr_v6, remotec->ipv6_addr, 16);
+  state->guess_ttl[1] = remotec->ttl;
+  state->server_addr_v4_subnet_bytes = (remotec->flags & 3);
+  WriteLE32(&state->server_addr_v4_netmask, 0xffffffff >> ((remotec->flags & 3) * 8));
+}
+bool WgKeypairParseExtendedHandshake(WgKeypair *keypair, const uint8 *data, size_t data_size) {
+  bool did_setup_compression = false;
+
+  while (data_size >= 2) {
+    uint8 type = data[0], size = data[1];
+    data += 2, data_size -= 2;
+    if (size > data_size)
+      return false;
+    switch (type) {
+    case EXT_CIPHER_SUITES_PRIO:
+    case EXT_CIPHER_SUITES:
+      keypair->cipher_suite = ResolveCipherSuite(keypair->peer->cipher_prio_ - (type - EXT_CIPHER_SUITES),
+                                                 keypair->peer->ciphers_, keypair->peer->num_ciphers_,
+                                                 data, data_size);
+      break;
+    case EXT_BOOLEAN_FEATURES:
+      for (size_t i = 0, j = std::max<uint32>(WG_FEATURES_COUNT, size * 4); i != j; i++) {
+        uint8 value = (i < size * 4) ? (data[i >> 2] >> ((i * 2) & 7)) & 3 : 0;
+        if (i >= WG_FEATURES_COUNT ? (value == WG_BOOLEAN_FEATURE_ENFORCES) : 
+            !ResolveBooleanFeatureValue(value, keypair->peer->features_[i], &keypair->enabled_features[i]))
+          return false;
+      }
+      break;
+    case EXT_PACKET_COMPRESSION:
+      if (size == sizeof(WgPacketCompressionVer01)) {
+        WgPacketCompressionVer01 *c = (WgPacketCompressionVer01*)data;
+        if (ReadLE16(&c->version) == EXT_PACKET_COMPRESSION_VER) {
+          WgKeypairSetupCompressionExtension(keypair, c);
+          did_setup_compression = true;
+        }
+      }
+      break;
+    }
+    data += size, data_size -= size;
+  }
+  if (data_size != 0)
+    return false;
+  
+  keypair->enabled_features[WG_FEATURE_ID_IPZIP] &= did_setup_compression;
+  keypair->auth_tag_length = (keypair->enabled_features[WG_FEATURE_ID_SHORT_MAC] ? 8 : CHACHA20POLY1305_AUTHTAGLEN);
+
+//  RINFO("Cipher Suite = %d", keypair->cipher_suite);
+
+  return true;
+}
+
+#endif  // WITH_HANDSHAKE_EXT
+
+void WgPeer::ClearKeys() {
+  DeleteKeypair(&curr_keypair_);
+  DeleteKeypair(&next_keypair_);
+  DeleteKeypair(&prev_keypair_);
+}
+
+void WgPeer::ClearHandshake() {
+  uint32 v = local_key_id_during_hs_;
+  if (v != 0) {
+    local_key_id_during_hs_ = 0;
+    dev_->key_id_lookup_.erase(v);
+  }
+}
+
+void WgPeer::DeleteKeypair(WgKeypair **kp) {
+  WgKeypair *t = *kp;
+  *kp = NULL;
+  if (t) {
+    if (t->addr_entry)
+      dev_->EraseKeypairAddrEntry(t);
+
+    if (t->local_key_id)
+      dev_->key_id_lookup_.erase(t->local_key_id);
+
+    if (t->aes_gcm128_context_)
+      free(t->aes_gcm128_context_);
+    delete t;
+  }
+}
+
+WgKeypair *WgPeer::CreateNewKeypair(bool is_initiator, const uint8 chaining_key[WG_HASH_LEN], uint32 remote_key_id, const uint8 *extfield, size_t extfield_size) {
+  WgKeypair *kp = new WgKeypair;
+  uint8 *first_key, *second_key;
+  if (!kp)
+    return NULL;
+  memset(kp, 0, offsetof(WgKeypair, replay_detector));
+  kp->peer = this;
+  kp->is_initiator = is_initiator;
+  kp->remote_key_id = remote_key_id;
+  kp->auth_tag_length = CHACHA20POLY1305_AUTHTAGLEN;
+  
+#if WITH_HANDSHAKE_EXT
+  if (!WgKeypairParseExtendedHandshake(kp, extfield, extfield_size))
+    goto fail;
+#endif  // WITH_HANDSHAKE_EXT
+
+  first_key = kp->send_key, second_key = kp->recv_key;
+  if (!is_initiator)
+    std::swap(first_key, second_key);
+  blake2s_hkdf(first_key, sizeof(kp->send_key), second_key, sizeof(kp->recv_key), 
+               kp->auth_tag_length != CHACHA20POLY1305_AUTHTAGLEN ? (uint8*)kp->compress_mac_keys : NULL, 32, NULL, 0, chaining_key, WG_HASH_LEN);
+
+  if (!is_initiator) {
+    std::swap(kp->compress_mac_keys[0][0], kp->compress_mac_keys[1][0]);
+    std::swap(kp->compress_mac_keys[0][1], kp->compress_mac_keys[1][1]);
+  }
+
+#if WITH_HANDSHAKE_EXT
+  if (kp->cipher_suite >= EXT_CIPHER_SUITE_AES128_GCM && kp->cipher_suite <= EXT_CIPHER_SUITE_AES256_GCM) {
+#if WITH_AESGCM
+    kp->aes_gcm128_context_ = (AesGcm128StaticContext *)malloc(sizeof(*kp->aes_gcm128_context_) * 2);
+    if (!kp->aes_gcm128_context_)
+      goto fail;
+    int key_size = (kp->cipher_suite == EXT_CIPHER_SUITE_AES128_GCM) ? 128 : 256;
+    CRYPTO_gcm128_init(&kp->aes_gcm128_context_[0], kp->send_key, key_size);
+    CRYPTO_gcm128_init(&kp->aes_gcm128_context_[1], kp->recv_key, key_size);
+#else
+    goto fail;
+#endif
+  }
+#endif  // WITH_HANDSHAKE_EXT
+
+  kp->send_key_state = kp->recv_key_state = WgKeypair::KEY_VALID;
+  time_of_next_key_event_ = 0;
+  kp->key_timestamp = OsGetMilliseconds();
+
+  return kp;
+
+fail:
+  delete kp;
+  return NULL;
+}
+
+void WgPeer::InsertKeypairInPeer(WgKeypair *kp) {
+  assert(kp->peer == this);
+  DeleteKeypair(&prev_keypair_);
+  if (kp->is_initiator) {
+    // When we're the initator then we got the handshake and we can
+    // use the keypair right away.
+    if (next_keypair_) {
+      prev_keypair_ = next_keypair_;
+      next_keypair_ = NULL;
+      DeleteKeypair(&curr_keypair_);
+    } else {
+      prev_keypair_ = curr_keypair_;
+    }
+    curr_keypair_ = kp;
+  } else {
+    // The keypair will be moved to curr when we get the first data packet.
+    DeleteKeypair(&next_keypair_);
+    next_keypair_ = kp;
+  }
+}
+
+bool WgPeer::CheckSwitchToNextKey(WgKeypair *keypair) {
+  if (keypair != next_keypair_)
+    return false;
+  DeleteKeypair(&prev_keypair_);
+  prev_keypair_ = curr_keypair_;
+  curr_keypair_ = next_keypair_;
+  next_keypair_ = NULL;
+  time_of_next_key_event_ = 0;
+  return true;
+}
+
+bool WgPeer::CheckHandshakeRateLimit() {
+  uint64 now = OsGetMilliseconds();
+  if (now - last_handshake_init_timestamp_ < REKEY_TIMEOUT_MS)
+    return false;
+  last_handshake_init_timestamp_ = now;
+  return true;
+}
+
+void WgPeer::WriteMacToPacket(const uint8 *data, MessageMacs *dst) {
+  expect_cookie_reply_ = true;
+  blake2s(dst->mac1, sizeof(dst->mac1), data, (uint8*)dst->mac1 - data, precomputed_mac1_key_, sizeof(precomputed_mac1_key_));
+  memcpy(sent_mac1_, dst->mac1, sizeof(sent_mac1_));
+  if (has_mac2_cookie_ && OsGetMilliseconds() - mac2_cookie_timestamp_ < COOKIE_SECRET_MAX_AGE_MS - COOKIE_SECRET_LATENCY_MS) {
+    blake2s(dst->mac2, sizeof(dst->mac2), data, (uint8*)dst->mac2 - data, mac2_cookie_,  sizeof(mac2_cookie_));
+  } else {
+    has_mac2_cookie_ = false;
+
+    if (dev_->header_obfuscation_) {
+      // when obfuscation is enabled just make the top bits random
+      for (size_t i = 0; i < 4; i++)
+        ((uint32*)dst->mac2)[i] = dev_->GetRandomNumber();
+    } else {
+      memset(dst->mac2, 0, sizeof(dst->mac2));
+    }
+  }
+}
+
+enum {
+  // Timer for retransmitting the handshake if we don't hear back after REKEY_TIMEOUT_MS
+  TIMER_RETRANSMIT_HANDSHAKE = 0,
+  // Timer for sending keepalive if we received a packet if we don't send anything else for KEEPALIVE_TIMEOUT_MS
+  TIMER_SEND_KEEPALIVE = 1,
+  // Timer for initiating new handshake if we have sent a packet but after have not received one for KEEPALIVE_TIMEOUT_MS + REKEY_TIMEOUT_MS
+  TIMER_NEW_HANDSHAKE = 2,
+  // Timer for zeroing out all keys and handshake state after (REJECT_AFTER_TIME_MS * 3) if no new keys have been received
+  TIMER_ZERO_KEYS = 3,
+  // Timer for sending a keepalive packet every PERSISTENT_KEEPALIVE_MS
+  TIMER_PERSISTENT_KEEPALIVE = 4,
+};
+
+#define WgClearTimer(x) (timers_ &= ~(33 << x))
+#define WgIsTimerActive(x) (timers_ & (33 << x))
+#define WgSetTimer(x) (timers_ |= (32 << (x)))
+
+void WgPeer::OnDataSent() {
+  WgClearTimer(TIMER_SEND_KEEPALIVE);
+  if (!WgIsTimerActive(TIMER_NEW_HANDSHAKE))
+    WgSetTimer(TIMER_NEW_HANDSHAKE);
+  WgSetTimer(TIMER_PERSISTENT_KEEPALIVE);
+}
+
+void WgPeer::OnKeepaliveSent() {
+  WgSetTimer(TIMER_PERSISTENT_KEEPALIVE);
+}
+
+void WgPeer::OnDataReceived() {
+  WgClearTimer(TIMER_NEW_HANDSHAKE);
+  if (!WgIsTimerActive(TIMER_SEND_KEEPALIVE))
+    WgSetTimer(TIMER_SEND_KEEPALIVE);
+  else
+    pending_keepalive_ = true;
+  WgSetTimer(TIMER_PERSISTENT_KEEPALIVE);
+}
+
+void WgPeer::OnKeepaliveReceived() {
+  WgClearTimer(TIMER_NEW_HANDSHAKE);
+  WgSetTimer(TIMER_PERSISTENT_KEEPALIVE);
+}
+
+void WgPeer::OnHandshakeInitSent() {
+  WgClearTimer(TIMER_SEND_KEEPALIVE);
+  WgSetTimer(TIMER_RETRANSMIT_HANDSHAKE);
+}
+
+void WgPeer::OnHandshakeAuthComplete() {
+  WgClearTimer(TIMER_NEW_HANDSHAKE);
+  WgSetTimer(TIMER_ZERO_KEYS);
+  WgSetTimer(TIMER_PERSISTENT_KEEPALIVE);
+}
+
+static const char * const kCipherSuites[] = {
+  "chacha20-poly1305",
+  "aes128-gcm",
+  "aes256-gcm",
+  "none"
+};
+
+void WgPeer::OnHandshakeFullyComplete() {
+  WgClearTimer(TIMER_RETRANSMIT_HANDSHAKE);
+  handshake_attempts_ = 0;
+
+  if (last_complete_handskake_timestamp_ == 0) {
+    bool any_feature = false;
+    for(size_t i = 0; i < WG_FEATURES_COUNT; i++)
+      any_feature |= curr_keypair_->enabled_features[i];
+    if (curr_keypair_->cipher_suite != 0 || any_feature) {
+      RINFO("Using %s, %s %s %s %s %s", kCipherSuites[curr_keypair_->cipher_suite], 
+            curr_keypair_->enabled_features[0] ? "short_header" : "",
+            curr_keypair_->enabled_features[1] ? "mac64" : "",
+            curr_keypair_->enabled_features[2] ? "ipzip" : "",
+            curr_keypair_->enabled_features[4] ? "skip_keyid_in" : "",
+            curr_keypair_->enabled_features[5] ? "skip_keyid_out" : "");
+    }
+
+
+  }
+
+  last_complete_handskake_timestamp_ = OsGetMilliseconds();
+  dev_->last_complete_handskake_timestamp_ = last_complete_handskake_timestamp_;
+//  RINFO("Connection established.");
+}
+
+// Check if any of the timeouts have expired
+uint32 WgPeer::CheckTimeouts(uint64 now) {
+  uint32 t, rv = 0;
+
+  if (now >= time_of_next_key_event_)
+    CheckAndUpdateTimeOfNextKeyEvent(now);
+
+  if ((t = timers_) == 0)
+    return 0;
+  uint32 now32 = (uint32)now;
+  // Got any new timers?
+  if (t & (0x1f << 5)) {
+    if (t & (1 << (5+0))) timer_value_[0] = now32;
+    if (t & (1 << (5+1))) timer_value_[1] = now32;
+    if (t & (1 << (5+2))) timer_value_[2] = now32;
+    if (t & (1 << (5+3))) timer_value_[3] = now32;
+    if (t & (1 << (5+4))) timer_value_[4] = now32;
+    t |= (t >> 5);
+    t &= 0x1F;
+  }
+  // Got any expired timers?
+  if (t & 0x1F) {
+    if ((t & (1 << TIMER_RETRANSMIT_HANDSHAKE)) && (now32 - timer_value_[TIMER_RETRANSMIT_HANDSHAKE]) >= REKEY_TIMEOUT_MS) {
+      t ^= (1 << TIMER_RETRANSMIT_HANDSHAKE);
+      if (handshake_attempts_ > MAX_HANDSHAKE_ATTEMPTS) {
+        RINFO("Too many handshake attempts. Stopping.");
+        t &= ~(1 << TIMER_SEND_KEEPALIVE);
+        ClearPacketQueue();
+      } else {
+        RINFO("Retrying handshake, attempt %d...", handshake_attempts_ + 2);
+        handshake_attempts_++;
+        rv |= ACTION_SEND_HANDSHAKE;
+      }
+    }
+    if ((t & (1 << TIMER_SEND_KEEPALIVE)) && (now32 - timer_value_[TIMER_SEND_KEEPALIVE]) >= KEEPALIVE_TIMEOUT_MS) {
+      t &= ~(1 << TIMER_SEND_KEEPALIVE);
+      rv |= ACTION_SEND_KEEPALIVE;
+      if (pending_keepalive_) {
+        pending_keepalive_ = false;
+        timer_value_[TIMER_SEND_KEEPALIVE] = now32;
+        t |= (1 << TIMER_SEND_KEEPALIVE);
+      }
+    }
+    if ((t & (1 << TIMER_PERSISTENT_KEEPALIVE)) && (now32 - timer_value_[TIMER_PERSISTENT_KEEPALIVE]) >= (uint32)persistent_keepalive_ms_) {
+      t &= ~(1 << TIMER_PERSISTENT_KEEPALIVE);
+      if (persistent_keepalive_ms_) {
+        t &= ~(1 << TIMER_SEND_KEEPALIVE);
+        rv |= ACTION_SEND_KEEPALIVE;
+      }
+    }
+    if ((t & (1 << TIMER_NEW_HANDSHAKE)) && (now32 - timer_value_[TIMER_NEW_HANDSHAKE]) >= KEEPALIVE_TIMEOUT_MS + REKEY_TIMEOUT_MS) {
+      t &= ~(1 << TIMER_NEW_HANDSHAKE);
+      handshake_attempts_ = 0;
+      rv |= ACTION_SEND_HANDSHAKE;
+      RINFO("Retrying handshake with peer");
+    }
+    if ((t & (1 << TIMER_ZERO_KEYS)) && (now32 - timer_value_[TIMER_ZERO_KEYS]) >= REJECT_AFTER_TIME_MS * 3) {
+      RINFO("Expiring all keys for peer");
+      t &= ~(1 << TIMER_ZERO_KEYS);
+      ClearKeys();
+      ClearHandshake();
+    }
+  }
+  timers_ = t;
+  return rv;
+}
+
+// Check all key stuff here to avoid calling possibly expensive timestamp routines in the packet handler
+void WgPeer::CheckAndUpdateTimeOfNextKeyEvent(uint64 now) {
+  uint64 next_time = UINT64_MAX;
+  uint32 rv = 0;
+
+  if (curr_keypair_ != NULL) {
+    if (now >= curr_keypair_->key_timestamp + REJECT_AFTER_TIME_MS) {
+      DeleteKeypair(&curr_keypair_);
+    } else if (curr_keypair_->is_initiator) {
+      // if a peer is the initiator of a current secure session, WireGuard will send a handshake initiation
+      // message to begin a new secure session if, after transmitting a transport data message, the current secure session
+      // is REKEY_AFTER_TIME_MS old, or if after receiving a transport data message, the current secure session is
+      // (REKEY_AFTER_TIME_MS - KEEPALIVE_TIMEOUT_MS - REKEY_TIMEOUT_MS) seconds old and it has not yet acted upon
+      // this event.
+      if (now >= curr_keypair_->key_timestamp + (REJECT_AFTER_TIME_MS - KEEPALIVE_TIMEOUT_MS - REKEY_TIMEOUT_MS)) {
+        next_time = curr_keypair_->key_timestamp + REJECT_AFTER_TIME_MS;
+        if (curr_keypair_->recv_key_state == WgKeypair::KEY_VALID)
+          curr_keypair_->recv_key_state = WgKeypair::KEY_WANT_REFRESH;
+      } else if (now >= curr_keypair_->key_timestamp + REKEY_AFTER_TIME_MS) {
+        next_time = curr_keypair_->key_timestamp + (REJECT_AFTER_TIME_MS - KEEPALIVE_TIMEOUT_MS - REKEY_TIMEOUT_MS);
+        if (curr_keypair_->send_key_state == WgKeypair::KEY_VALID)
+          curr_keypair_->send_key_state = WgKeypair::KEY_WANT_REFRESH;
+      } else  {
+        next_time = curr_keypair_->key_timestamp + REKEY_AFTER_TIME_MS;
+      }
+    } else {
+      next_time = curr_keypair_->key_timestamp + REJECT_AFTER_TIME_MS;
+    }
+  }
+  if (prev_keypair_ != NULL) {
+    if (now >= prev_keypair_->key_timestamp + REJECT_AFTER_TIME_MS)
+      DeleteKeypair(&prev_keypair_);
+    else
+      next_time = std::min<uint64>(next_time, prev_keypair_->key_timestamp + REJECT_AFTER_TIME_MS);
+  }
+  if (next_keypair_ != NULL) {
+    if (now >= next_keypair_->key_timestamp + REJECT_AFTER_TIME_MS)
+      DeleteKeypair(&next_keypair_);
+    else
+      next_time = std::min<uint64>(next_time, next_keypair_->key_timestamp + REJECT_AFTER_TIME_MS);
+  }
+  time_of_next_key_event_ = next_time;
+}
+
+void WgPeer::SetEndpoint(const IpAddr &sin) {
+  endpoint_ = sin;
+}
+
+void WgPeer::SetPersistentKeepalive(int persistent_keepalive_secs) {
+  if (persistent_keepalive_secs < 10 || persistent_keepalive_secs > 10000)
+    return;
+  persistent_keepalive_ms_ = persistent_keepalive_secs * 1000;
+}
+
+bool WgPeer::AddIp(const WgCidrAddr &cidr_addr) {
+  if (cidr_addr.size == 32) {
+    if (cidr_addr.cidr > 32)
+      return false;
+    dev_->ip_to_peer_map_.InsertV4(cidr_addr.addr, cidr_addr.cidr, this);
+    allowed_ips_.push_back(cidr_addr);
+    return true;
+  } else if (cidr_addr.size == 128) {
+    if (cidr_addr.cidr > 128)
+      return false;
+    dev_->ip_to_peer_map_.InsertV6(cidr_addr.addr, cidr_addr.cidr, this);
+    allowed_ips_.push_back(cidr_addr);
+    return true;
+  } else {
+    return false;
+  }
+}
+
+void WgPeer::SetAllowMulticast(bool allow) {
+  allow_multicast_through_peer_ = allow;
+}
+
+void WgPeer::SetFeature(int feature, uint8 value) {
+  features_[feature] = value;
+}
+
+bool WgPeer::AddCipher(int cipher) {
+  if (num_ciphers_ == MAX_CIPHERS)
+    return false;
+
+  if (cipher == EXT_CIPHER_SUITE_AES128_GCM || cipher == EXT_CIPHER_SUITE_AES256_GCM) {
+#if !WITH_AESGCM
+    return true;
+#endif  // !WITH_AESGCM
+    if (!X86_PCAP_AES)
+      return true;
+  }
+
+
+  ciphers_[num_ciphers_++] = cipher;
+  return true;
+}
+
+WgRateLimit::WgRateLimit() { 
+  key1_[0] = key1_[1] = 1;
+  key2_[0] = key2_[1] = 1;
+  bin1_ = bins_[0];
+  bin2_ = bins_[1];
+  rand_ = 0;
+  rand_xor_ = 0;
+  packets_per_sec_ = PACKETS_PER_SEC;
+  used_rate_limit_ = 0;
+  memset(bins_, 0, sizeof(bins_));
+}
+
+void WgRateLimit::Periodic(uint32 s[5]) {
+  unsigned int per_sec = PACKETS_PER_SEC;
+  if (used_rate_limit_ >= TOTAL_PACKETS_PER_SEC) {
+    per_sec = PACKETS_PER_SEC * TOTAL_PACKETS_PER_SEC / used_rate_limit_;
+    if (per_sec < 1)
+      per_sec = 1;
+  }
+
+  if ((unsigned)per_sec > packets_per_sec_)
+    per_sec = (per_sec + packets_per_sec_ + 1) >> 1;
+    
+//  if (per_sec != packets_per_sec_) {
+//    RINFO("Setting pps: %d", per_sec);
+  packets_per_sec_ = per_sec;
+//  }
+  
+  used_rate_limit_ = 0;
+  rand_xor_ = s[4];
+  key2_[0] = key1_[0];
+  key2_[1] = key1_[1];
+  memcpy(key1_, s, sizeof(key1_));
+  std::swap(bin1_, bin2_);
+  memset(bin1_, 0, BINSIZE);
+}
+
+static inline size_t hashit(uint64 ip, const uint64 *key) {
+  uint64 x = ip * key[0] + rol64(ip, 32) * key[1];
+  uint32 a = (uint32)(x + (x >> 32) * 0x85ebca6b);
+  a -= a >> 16;
+  a ^= a >> 4;
+  return a;
+}
+
+WgRateLimit::RateLimitResult WgRateLimit::CheckRateLimit(uint64 ip) {
+  uint8 *a = &bin1_[hashit(ip, key1_) & (BINSIZE - 1)];
+  uint8 *b = &bin2_[hashit(ip, key2_) & (BINSIZE - 1)];
+  unsigned int old = std::max<int>(*a, *b - packets_per_sec_), v = 0;
+  if (old < PACKET_ACCUM / 2) {
+    v = 1;
+  } else if (old < PACKET_ACCUM) {
+    v = old < ((uint64)rand_ * ((PACKET_ACCUM / 2) + 1) >> 32) + (PACKET_ACCUM / 2);
+    rand_ = (rand_ * 0x1b873593 + 5) + rand_xor_;
+  }
+  RateLimitResult rr = {a, (uint8)(old + v), (uint8)v};
+  return rr;
+}
+
+void WgKeypairEncryptPayload(uint8 *dst, const size_t src_len,
+    const uint8 *ad, const size_t ad_len,
+    const uint64 nonce, WgKeypair *keypair) {
+  if (keypair->cipher_suite == EXT_CIPHER_SUITE_CHACHA20POLY1305) {
+    chacha20poly1305_encrypt(dst, dst, src_len, ad, ad_len, nonce, keypair->send_key);
+  } else if (keypair->cipher_suite >= EXT_CIPHER_SUITE_AES128_GCM && keypair->cipher_suite <= EXT_CIPHER_SUITE_AES256_GCM) {
+#if WITH_AESGCM
+    aesgcm_encrypt(dst, dst, src_len, ad, ad_len, nonce, &keypair->aes_gcm128_context_[0]);
+#endif  // WITH_AESGCM
+  } else {
+    poly1305_get_mac(dst, src_len, ad, ad_len, nonce, keypair->send_key, dst + src_len);
+  }
+
+  // Convert MAC to 8 bytes if that's all we need.
+  if (keypair->auth_tag_length != WG_MAC_LEN) {
+    uint8 *mac = dst + src_len;
+    uint64 rv = siphash_2u64(ReadLE64(mac), ReadLE64(mac + 8), (siphash_key_t*)keypair->compress_mac_keys[0]);
+    WriteLE64(mac, rv);
+  }
+}
+
+bool WgKeypairDecryptPayload(uint8 *dst, size_t src_len,
+    const uint8 *ad, size_t ad_len,
+    const uint64 nonce, WgKeypair *keypair) {
+  uint8 mac[16];
+
+  if (src_len < keypair->auth_tag_length)
+    return false;
+
+  src_len -= keypair->auth_tag_length;
+
+  if (keypair->cipher_suite == EXT_CIPHER_SUITE_CHACHA20POLY1305) {
+    chacha20poly1305_decrypt_get_mac(dst, dst, src_len, ad, ad_len, nonce, keypair->recv_key, mac);
+  } else if (keypair->cipher_suite >= EXT_CIPHER_SUITE_AES128_GCM && keypair->cipher_suite <= EXT_CIPHER_SUITE_AES256_GCM) {
+#if WITH_AESGCM
+    aesgcm_decrypt_get_mac(dst, dst, src_len, ad, ad_len, nonce, &keypair->aes_gcm128_context_[1], mac);
+#else   // WITH_AESGCM
+    return false;
+#endif  // WITH_AESGCM
+  } else {
+    poly1305_get_mac(dst, src_len, ad, ad_len, nonce, keypair->recv_key, mac);
+  }
+
+  if (keypair->auth_tag_length == WG_MAC_LEN) {
+    return memcmp_crypto(mac, dst + src_len, WG_MAC_LEN) == 0;
+  } else {
+    uint64 rv = siphash_2u64(ReadLE64(mac), ReadLE64(mac + 8), (siphash_key_t*)keypair->compress_mac_keys[1]);
+    WriteLE64(mac, rv);
+    return memcmp_crypto(mac, dst + src_len, keypair->auth_tag_length) == 0;
+  }
+}
diff --git a/wireguard_proto.h b/wireguard_proto.h
new file mode 100644
index 0000000..cd66901
--- /dev/null
+++ b/wireguard_proto.h
@@ -0,0 +1,617 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#pragma once
+
+#include "tunsafe_types.h"
+#include "netapi.h"
+#include "tunsafe_config.h"
+#include <vector>
+#include <unordered_map>
+
+enum ProtocolTimeouts {
+  COOKIE_SECRET_MAX_AGE_MS = 120000,
+  COOKIE_SECRET_LATENCY_MS = 5000,
+  REKEY_TIMEOUT_MS = 5000,
+  KEEPALIVE_TIMEOUT_MS = 10000,
+  REKEY_AFTER_TIME_MS = 120000,
+  REJECT_AFTER_TIME_MS = 180000,
+  PERSISTENT_KEEPALIVE_MS = 25000,
+  MIN_HANDSHAKE_INTERVAL_MS = 20,
+};
+
+enum ProtocolLimits {
+  REJECT_AFTER_MESSAGES = UINT64_MAX - 2048,
+  REKEY_AFTER_MESSAGES = UINT64_MAX - 0xffff,
+
+  MAX_HANDSHAKE_ATTEMPTS = 20,
+  MAX_QUEUED_PACKETS_PER_PEER = 128,
+  MESSAGE_MINIMUM_SIZE = 16,
+  MAX_SIZE_OF_HANDSHAKE_EXTENSION = 1024,
+};
+
+enum MessageType {
+  MESSAGE_HANDSHAKE_INITIATION = 1,
+  MESSAGE_HANDSHAKE_RESPONSE = 2,
+  MESSAGE_HANDSHAKE_COOKIE = 3,
+  MESSAGE_DATA = 4,
+};
+
+enum MessageFieldSizes {
+  WG_COOKIE_LEN = 16,
+  WG_COOKIE_NONCE_LEN = 24,
+  WG_PUBLIC_KEY_LEN = 32,
+  WG_HASH_LEN = 32,
+  WG_SYMMETRIC_KEY_LEN = 32,
+  WG_MAC_LEN = 16,
+  WG_TIMESTAMP_LEN = 12,
+  WG_SIPHASH_KEY_LEN = 16,
+};
+
+enum {
+  WG_SHORT_HEADER_BIT = 0x80,
+  WG_SHORT_HEADER_KEY_ID_MASK = 0x60,
+  WG_SHORT_HEADER_KEY_ID = 0x20,
+  WG_SHORT_HEADER_ACK = 0x10,
+  WG_SHORT_HEADER_TYPE_MASK = 0x0F,
+  WG_SHORT_HEADER_CTR1 = 0x00,
+  WG_SHORT_HEADER_CTR2 = 0x01,
+  WG_SHORT_HEADER_CTR4 = 0x02,
+
+  WG_ACK_HEADER_COUNTER_MASK = 0x0C,
+  WG_ACK_HEADER_COUNTER_NONE = 0x00,
+  WG_ACK_HEADER_COUNTER_2 = 0x04,
+  WG_ACK_HEADER_COUNTER_4 = 0x08,
+  WG_ACK_HEADER_COUNTER_8 = 0x0C,
+
+  WG_ACK_HEADER_KEY_MASK = 3,
+};
+
+
+struct MessageMacs {
+  uint8 mac1[WG_COOKIE_LEN];
+  uint8 mac2[WG_COOKIE_LEN];
+};
+STATIC_ASSERT(sizeof(MessageMacs) == 32, MessageMacs_wrong_size);
+
+struct MessageHandshakeInitiation {
+  uint32 type;
+  uint32 sender_key_id;
+  uint8 ephemeral[WG_PUBLIC_KEY_LEN];
+  uint8 static_enc[WG_PUBLIC_KEY_LEN + WG_MAC_LEN];
+  uint8 timestamp_enc[WG_TIMESTAMP_LEN + WG_MAC_LEN];
+  MessageMacs mac;
+};
+STATIC_ASSERT(sizeof(MessageHandshakeInitiation) == 148, MessageHandshakeInitiation_wrong_size);
+
+// Format of variable length payload.
+// 1 byte type
+// 1 byte length
+// <payload>
+
+
+
+struct MessageHandshakeResponse {
+  uint32 type;
+  uint32 sender_key_id;
+  uint32 receiver_key_id;
+  uint8 ephemeral[WG_PUBLIC_KEY_LEN];
+  uint8 empty_enc[WG_MAC_LEN];
+  MessageMacs mac;
+};
+STATIC_ASSERT(sizeof(MessageHandshakeResponse) == 92, MessageHandshakeResponse_wrong_size);
+
+struct MessageHandshakeCookie {
+  uint32 type;
+  uint32 receiver_key_id;
+  uint8 nonce[WG_COOKIE_NONCE_LEN];
+  uint8 cookie_enc[WG_COOKIE_LEN + WG_MAC_LEN];
+};
+STATIC_ASSERT(sizeof(MessageHandshakeCookie) == 64, MessageHandshakeCookie_wrong_size);
+
+struct MessageData {
+  uint32 type;
+  uint32 receiver_id;
+  uint64 counter;
+};
+STATIC_ASSERT(sizeof(MessageData) == 16, MessageData_wrong_size);
+
+enum {
+  EXT_PACKET_COMPRESSION = 0x15,
+  EXT_PACKET_COMPRESSION_VER = 0x01,
+
+  EXT_BOOLEAN_FEATURES = 0x16,
+
+  EXT_CIPHER_SUITES = 0x18,
+  EXT_CIPHER_SUITES_PRIO = 0x19,
+
+  // The standard wireguard chacha
+  EXT_CIPHER_SUITE_CHACHA20POLY1305 = 0x00,
+  // AES GCM 128 bit
+  EXT_CIPHER_SUITE_AES128_GCM = 0x01,
+  // AES GCM 256 bit
+  EXT_CIPHER_SUITE_AES256_GCM = 0x02,
+  // Same as CHACHA20POLY1305 but without the encryption step
+  EXT_CIPHER_SUITE_NONE_POLY1305 = 0x03,
+
+  EXT_CIPHER_SUITE_COUNT = 4,
+
+};
+
+enum {
+  WG_FEATURES_COUNT = 6,
+  WG_FEATURE_ID_SHORT_HEADER = 0,    // Supports short headers
+  WG_FEATURE_ID_SHORT_MAC = 1,       // Supports 8-byte MAC
+  WG_FEATURE_ID_IPZIP = 2,           // Using ipzip
+  WG_FEATURE_ID_SKIP_KEYID_IN = 4,   // Skip keyid for incoming packets
+  WG_FEATURE_ID_SKIP_KEYID_OUT = 5,  // Skip keyid for outgoing packets
+};
+
+enum {
+  WG_BOOLEAN_FEATURE_OFF = 0x0,
+  WG_BOOLEAN_FEATURE_SUPPORTS = 0x1,
+  WG_BOOLEAN_FEATURE_WANTS = 0x2,
+  WG_BOOLEAN_FEATURE_ENFORCES = 0x3,
+};
+
+struct WgPacketCompressionVer01 {
+  uint16 version;      // Packet compressor version
+  uint8 ttl;           // Guessed TTL
+  uint8 flags;         // Subnet length and packet direction
+  uint8 ipv4_addr[4];  // IPV4 address of endpoint
+  uint8 ipv6_addr[16]; // IPV6 address of endpoint
+};
+STATIC_ASSERT(sizeof(WgPacketCompressionVer01) == 24, WgPacketCompressionVer01_wrong_size);
+
+
+struct WgKeypair;
+class WgPeer;
+
+// Maps CIDR addresses to a peer, always returning the longest match
+class IpToPeerMap {
+public:
+  IpToPeerMap();
+  ~IpToPeerMap();
+
+  // Inserts an IP address of a given CIDR length into the lookup table, pointing to peer.
+  bool InsertV4(const void *addr, int cidr, void *peer);
+  bool InsertV6(const void *addr, int cidr, void *peer);
+
+  // Lookup the peer matching the IP Address
+  void *LookupV4(uint32 ip);
+  void *LookupV6(const void *addr);
+
+  void *LookupV4DefaultPeer();
+  void *LookupV6DefaultPeer();
+
+  // Remove a peer from the table
+  void RemovePeer(void *peer);
+private:
+  struct Entry4 {
+    uint32 ip;
+    uint32 mask;
+    void *peer;
+  };
+  struct Entry6 {
+    uint8 ip[16];
+    uint8 cidr_len;
+    void *peer;
+  };
+  std::vector<Entry4> ipv4_;
+  std::vector<Entry6> ipv6_;
+};
+
+class WgRateLimit {
+public:
+  WgRateLimit();
+
+  struct RateLimitResult {
+    uint8 *value_ptr;
+    uint8 new_value;
+    uint8 is_ok;
+
+    bool is_rate_limited() { return !is_ok; }
+    bool is_first_ip() { return new_value == 1; }
+  };
+
+  RateLimitResult CheckRateLimit(uint64 ip);
+
+  void CommitResult(const RateLimitResult &rr) { *rr.value_ptr = rr.new_value; if (used_rate_limit_++ == TOTAL_PACKETS_PER_SEC) packets_per_sec_ = (packets_per_sec_ + 1) >> 1; }
+
+  void Periodic(uint32 s[5]);
+
+  bool is_used() { return used_rate_limit_ != 0 || packets_per_sec_ != PACKETS_PER_SEC; }
+private:
+  uint8 *bin1_, *bin2_;
+  uint32 rand_, rand_xor_;
+  uint32 packets_per_sec_, used_rate_limit_;
+  uint64 key1_[2], key2_[2];
+  enum {
+    BINSIZE = 4096,
+    PACKETS_PER_SEC = 25,
+    PACKET_ACCUM = 100,
+    TOTAL_PACKETS_PER_SEC = 25000,
+  };
+  uint8 bins_[2][BINSIZE];
+};
+
+struct WgAddrEntry {
+  // The id of the addr entry, so we can delete ourselves
+  uint64 addr_entry_id;
+
+  // Ensure there's at least 1 minute between we allow registering
+  // a new key in this table. This means that each key will have
+  // a life time of at least 3 minutes.
+  uint64 time_of_last_insertion;
+
+  // This entry gets erased when there's no longer any key pointing at it.
+  uint8 ref_count;
+
+  // Index of the next slot 0-2 where we'll insert the next key.
+  uint8 next_slot;
+
+  // The three keys.
+  WgKeypair *keys[3];
+
+  WgAddrEntry(uint64 addr_entry_id) : addr_entry_id(addr_entry_id), ref_count(0), next_slot(0) {
+    keys[0] = keys[1] = keys[2] = NULL;
+    time_of_last_insertion = 0x123456789123456;
+  }
+};
+
+struct ScramblerSiphashKeys {
+  uint64 keys[4];
+};
+ 
+// Implementation of most business logic of Wireguard
+class WgDevice {
+  friend class WgPeer;
+  friend class WireguardProcessor;
+public:
+  WgDevice();
+  ~WgDevice();
+
+  // Initialize with the private key, precompute all internal keys etc.
+  void Initialize(const uint8 private_key[WG_PUBLIC_KEY_LEN]);
+  
+  WgPeer *AddPeer();
+
+  // Setup header obfuscation
+  void SetHeaderObfuscation(const char *key);
+
+  // Check whether Mac1 appears to be valid
+  bool CheckCookieMac1(Packet *packet);
+
+  // Check whether Mac2 appears to be valid, this also uses
+  // the remote ip address
+  bool CheckCookieMac2(Packet *packet);
+
+  void CreateCookieMessage(MessageHandshakeCookie *dst, Packet *packet, uint32 remote_key_id);
+
+  void UpdateKeypairAddrEntry(uint64 addr_id, WgKeypair *keypair);
+
+  IpToPeerMap &ip_to_peer_map() { return ip_to_peer_map_; }
+  
+  std::unordered_map<uint32, std::pair<WgPeer*, WgKeypair*> > &key_id_lookup() { return key_id_lookup_; }
+
+  WgPeer *first_peer() { return peers_; }
+
+  uint64 last_complete_handskake_timestamp() const {
+    return last_complete_handskake_timestamp_;
+  }
+
+  const uint8 *public_key() const { return s_pub_; }
+
+  void SecondLoop(uint64 now);
+  
+  WgRateLimit *rate_limiter() { return &rate_limiter_; }
+
+  std::unordered_map<uint64, WgAddrEntry*> &addr_entry_map() { return addr_entry_lookup_; }
+
+
+  WgPacketCompressionVer01 *compression_header() { return &compression_header_; }
+private:
+  // Return the peer matching the |public_key| or NULL
+  WgPeer *GetPeerFromPublicKey(uint8 public_key[WG_PUBLIC_KEY_LEN]);
+  // Create a cookie by inspecting the source address of the |packet|
+  void MakeCookie(uint8 cookie[WG_COOKIE_LEN], Packet *packet);
+  // Insert a new entry in |key_id_lookup_|
+  uint32 InsertInKeyIdLookup(WgPeer *peer, WgKeypair *kp);
+  // Get a random number
+  uint32 GetRandomNumber();
+
+  void EraseKeypairAddrEntry(WgKeypair *kp);
+
+  // Maps IP addresses to peers
+  IpToPeerMap ip_to_peer_map_;
+  // For enumerating all peers
+  WgPeer *peers_;
+  // Mapping from key-id to either an active keypair (if keypair is non-NULL),
+  // or to a handshake.
+  std::unordered_map<uint32, std::pair<WgPeer*, WgKeypair*> > key_id_lookup_;
+
+  // Mapping from IPV4 IP/PORT to WgPeer*, so we can find the peer when a key id is
+  // not explicitly included.
+  std::unordered_map<uint64, WgAddrEntry*> addr_entry_lookup_;
+
+  // Counter for generating new indices in |keypair_lookup_|
+  uint8 next_rng_slot_;
+
+  // Whether packet obfuscation is enabled
+  bool header_obfuscation_;
+
+  uint64 last_complete_handskake_timestamp_;
+
+  uint64 low_resolution_timestamp_;
+
+  uint64 cookie_secret_timestamp_;
+  uint8 cookie_secret_[WG_HASH_LEN];
+  uint8 s_priv_[WG_PUBLIC_KEY_LEN];
+  uint8 s_pub_[WG_PUBLIC_KEY_LEN];
+
+  // Siphash keys for packet scrambling
+  ScramblerSiphashKeys header_obfuscation_key_;
+
+  uint8 precomputed_cookie_key_[WG_SYMMETRIC_KEY_LEN];
+  uint8 precomputed_mac1_key_[WG_SYMMETRIC_KEY_LEN];
+
+  uint64 random_number_input_[WG_HASH_LEN / 8 + 1];
+  uint32 random_number_output_[WG_HASH_LEN / 4];
+
+  WgRateLimit rate_limiter_;
+
+  WgPacketCompressionVer01 compression_header_;
+};
+
+// State for Noise handshake
+class WgPeer {
+  friend class WgDevice;
+  friend class WireguardProcessor;
+  friend bool WgKeypairParseExtendedHandshake(WgKeypair *keypair, const uint8 *data, size_t data_size);
+  friend void WgKeypairSetupCompressionExtension(WgKeypair *keypair, const WgPacketCompressionVer01 *remotec);
+public:
+  explicit WgPeer(WgDevice *dev);
+  ~WgPeer();
+
+  void Initialize(const uint8 spub[WG_PUBLIC_KEY_LEN], const uint8 preshared_key[WG_SYMMETRIC_KEY_LEN]);
+
+  void SetPersistentKeepalive(int persistent_keepalive_secs);
+  void SetEndpoint(const IpAddr &sin);
+  void SetAllowMulticast(bool allow);
+
+  void SetFeature(int feature, uint8 value);
+  bool AddCipher(int cipher);
+  void SetCipherPrio(bool prio) { cipher_prio_ = prio; }
+  bool AddIp(const WgCidrAddr &cidr_addr);
+
+  static WgPeer *ParseMessageHandshakeInitiation(WgDevice *dev, Packet *packet);
+  static WgPeer *ParseMessageHandshakeResponse(WgDevice *dev, const Packet *packet);
+  static void ParseMessageHandshakeCookie(WgDevice *dev, const MessageHandshakeCookie *src);
+  void CreateMessageHandshakeInitiation(Packet *packet);
+  bool CheckSwitchToNextKey(WgKeypair *keypair);
+  void ClearKeys();
+  void ClearHandshake();
+  void ClearPacketQueue();
+  bool CheckHandshakeRateLimit();
+
+  // Timer notifications
+  void OnDataSent();
+  void OnKeepaliveSent();
+  void OnDataReceived();
+  void OnKeepaliveReceived();
+  void OnHandshakeInitSent();
+  void OnHandshakeAuthComplete();
+  void OnHandshakeFullyComplete();
+
+  enum {
+    ACTION_SEND_KEEPALIVE = 1,
+    ACTION_SEND_HANDSHAKE = 2,
+  };
+  uint32 CheckTimeouts(uint64 now);
+
+private:
+  WgKeypair *CreateNewKeypair(bool is_initiator, const uint8 key[WG_HASH_LEN], uint32 send_key_id, const uint8 *extfield, size_t extfield_size);
+  void WriteMacToPacket(const uint8 *data, MessageMacs *mac);
+  void DeleteKeypair(WgKeypair **kp);
+  void CheckAndUpdateTimeOfNextKeyEvent(uint64 now);
+  static void CopyEndpointToPeer(WgKeypair *keypair, const IpAddr *addr);
+  size_t WriteHandshakeExtension(uint8 *dst, WgKeypair *keypair);
+  void InsertKeypairInPeer(WgKeypair *keypair);
+
+  WgDevice *dev_;
+  WgPeer *next_peer_;
+
+  // Keypairs, |curr_keypair_| is the used one, the other ones are
+  // the old ones and the next one.
+  WgKeypair *curr_keypair_;
+  WgKeypair *prev_keypair_;
+  WgKeypair *next_keypair_;
+
+  // Timestamp when the next key related event is going to occur.
+  uint64 time_of_next_key_event_;
+
+  // For timer management
+  uint32 timers_;
+  uint32 timer_value_[5];
+
+  // Holds the entry into the key id table during handshake
+  uint32 local_key_id_during_hs_;
+  IpAddr endpoint_;
+
+  // The broadcast address of the IPv4 network, used to block broadcast traffic
+  // from being sent out over the VPN link.
+  uint32 ipv4_broadcast_addr_;
+
+  bool supports_handshake_extensions_;
+
+  bool pending_keepalive_;
+  bool expect_cookie_reply_;
+
+  // Whether we want to route incoming multicast/broadcast traffic to this peer.
+  bool allow_multicast_through_peer_;
+
+  // Whether 
+  bool has_mac2_cookie_;
+
+  // Number of handshakes made so far, when this gets too high we stop connecting.
+  uint8 handshake_attempts_;
+
+  // Which features are enabled for this peer?
+  uint8 features_[WG_FEATURES_COUNT];
+
+  // Queue of packets that will get sent once handshake finishes
+  uint8 num_queued_packets_;
+  Packet *first_queued_packet_, **last_queued_packet_ptr_;
+  
+  uint64 last_handshake_init_timestamp_;
+  uint64 last_complete_handskake_timestamp_;
+  uint64 last_handshake_init_recv_timestamp_;
+
+  enum { MAX_CIPHERS = 16 };
+  uint8 cipher_prio_;
+  uint8 num_ciphers_;
+  uint8 ciphers_[MAX_CIPHERS];
+  
+  // Handshake state that gets setup in |CreateMessageHandshakeInitiation| and used in
+  // the response.
+  struct HandshakeState {
+    // Hash
+    uint8 hi[WG_HASH_LEN];
+    // Chaining key
+    uint8 ci[WG_HASH_LEN];
+    // Private ephemeral
+    uint8 e_priv[WG_PUBLIC_KEY_LEN];
+  };
+  HandshakeState hs_;
+  // Remote's static public key - Written only by Init
+  uint8 s_remote_[WG_PUBLIC_KEY_LEN];
+  // Remote's preshared key - Written only by Init
+  uint8 preshared_key_[WG_SYMMETRIC_KEY_LEN];
+  // Precomputed DH(spriv_local, spub_remote).
+  uint8 s_priv_pub_[WG_PUBLIC_KEY_LEN];
+  // The most recent seen timestamp, only accept higher timestamps.
+  uint8 last_timestamp_[WG_TIMESTAMP_LEN];
+  // Precomputed key for decrypting cookies from the peer.
+  uint8 precomputed_cookie_key_[WG_SYMMETRIC_KEY_LEN];
+  // Precomputed key for sending MACs to the peer.
+  uint8 precomputed_mac1_key_[WG_SYMMETRIC_KEY_LEN];
+  // The last mac value sent, required to make cookies
+  uint8 sent_mac1_[WG_COOKIE_LEN];
+  // The mac2 cookie that gets appended to outgoing packets
+  uint8 mac2_cookie_[WG_COOKIE_LEN];
+  // The timestamp of the mac2 cookie
+  uint64 mac2_cookie_timestamp_;
+  int persistent_keepalive_ms_;
+
+  // Allowed ips
+  std::vector<WgCidrAddr> allowed_ips_;
+};
+
+// RFC6479 - IPsec Anti-Replay Algorithm without Bit Shifting
+class ReplayDetector {
+public:
+  ReplayDetector();
+  ~ReplayDetector();
+
+  bool CheckReplay(uint64 other);
+  enum {
+    BITS_PER_ENTRY = 32,
+    WINDOW_SIZE = 2048 - BITS_PER_ENTRY,
+    BITMAP_SIZE = WINDOW_SIZE / BITS_PER_ENTRY + 1,
+    BITMAP_MASK = BITMAP_SIZE - 1,
+  };
+
+  uint64 expected_seq_nr() const { return expected_seq_nr_; }
+
+private:
+  uint64 expected_seq_nr_;
+  uint32 bitmap_[BITMAP_SIZE];
+};
+
+struct AesGcm128StaticContext;
+
+struct WgKeypair {
+  WgPeer *peer;
+
+  // If the key has an addr entry mapping,
+  // then this points at it.
+  WgAddrEntry *addr_entry;
+  // The slot in the addr entry where the key is registered.
+  uint8 addr_entry_slot;
+
+  enum {
+    KEY_INVALID = 0,
+    KEY_VALID = 1,
+    KEY_WANT_REFRESH = 2,
+    KEY_DID_REFRESH = 3,
+  };
+  // True if i'm the initiator of the key exchange
+  bool is_initiator;
+
+  // True if we saved the peer's address in our table recently,
+  // avoids doing it too much
+  bool did_attempt_remember_ip_port;
+
+  // Which features are enabled
+  bool enabled_features[WG_FEATURES_COUNT];
+
+  // True if we want to notify the sender about that it can use a short key.
+  uint8 broadcast_short_key;
+
+  // Index of the short key index that we can use for outgoing packets.
+  uint8 can_use_short_key_for_outgoing;
+
+  // Whether the key is valid or needs refresh for receives
+  uint8 recv_key_state;
+  // Whether the key is valid or needs refresh for sends
+  uint8 send_key_state;
+
+  // Length of authentication tag
+  uint8 auth_tag_length;
+
+  // Cipher suite
+  uint8 cipher_suite;
+  
+  // Used so we know when to send out ack packets.
+  uint32 incoming_packet_count;
+
+  // Id of the key in my map
+  uint32 local_key_id;
+  // Id of the key in their map
+  uint32 remote_key_id;
+  // The timestamp of when the key was created, to be able to expire it
+  uint64 key_timestamp;
+  // The highest acked send_ctr value
+  uint64 send_ctr_acked;
+  // Counter value for chacha20 for outgoing packets
+  uint64 send_ctr;
+  // The key used for chacha20 encryption
+  uint8 send_key[WG_SYMMETRIC_KEY_LEN];
+  // The key used for chacha20 decryption
+  uint8 recv_key[WG_SYMMETRIC_KEY_LEN];
+
+  // Used when less than 16-byte mac is enabled to hash the hmac into 64 bits.
+  uint64 compress_mac_keys[2][2];
+
+  AesGcm128StaticContext *aes_gcm128_context_;
+
+  // -- all up to this point is initialized to zero
+  // For replay detection of incoming packets
+  ReplayDetector replay_detector;
+
+#if WITH_HANDSHAKE_EXT
+  // State for packet compressor
+  IpzipState ipzip_state_;
+#endif  // WITH_HANDSHAKE_EXT
+
+};
+
+void WgKeypairEncryptPayload(uint8 *dst, const size_t src_len,
+    const uint8 *ad, const size_t ad_len,
+    const uint64 nonce, WgKeypair *keypair);
+
+bool WgKeypairDecryptPayload(uint8 *dst, const size_t src_len,
+    const uint8 *ad, const size_t ad_len,
+    const uint64 nonce, WgKeypair *keypair);
+
+bool WgKeypairParseExtendedHandshake(WgKeypair *keypair, const uint8 *data, size_t data_size);
+