TunSafe open source (Same as 1.3-rc3 version)

2018-08-08 13:12:38 +02:00 · 2018-08-08 13:12:38 +02:00 · 64bb3cd6b3
commit 64bb3cd6b3
198 changed files with 92490 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,18 @@
+/Debug/
+/Release/
+/ipzip2/Debug/
+/Build
+/Win32/
+/TunSafe.aps
+/ipch
+/*.sdf
+/*vcxproj.user
+/*.opensdf
+/*.suo
+/.vs/
+/x64/
+/Azire.conf
+/*.psess
+/*.vspx
+/installer/*.zip
+/config/
--- a/LICENSE.AGPL.TXT
+++ b/LICENSE.AGPL.TXT
@ -0,0 +1,76 @@
+AFFERO GENERAL PUBLIC LICENSE 
+Version 1, March 2002
+
+Copyright © 2002 Affero Inc. 
+510 Third Street - Suite 225, San Francisco, CA 94107, USA
+
+This license is a modified version of the GNU General Public License copyright (C) 1989, 1991 Free Software Foundation, Inc. made with their permission. Section 2(d) has been added to cover use of software over a computer network.
+
+Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
+
+Preamble
+
+The licenses for most software are designed to take away your freedom to share and change it. By contrast, the Affero General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This Public License applies to most of Affero's software and to any other program whose authors commit to using it. (Some other Affero software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.
+
+When we speak of free software, we are referring to freedom, not price. This General Public License is designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.
+
+To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
+
+For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
+
+We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.
+
+Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.
+
+Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.
+
+The precise terms and conditions for copying, distribution and modification follow.
+
+TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this Affero General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you".
+Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does.
+
+1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.
+You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee.
+
+2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:
+a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change.
+b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.
+c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.)
+d) If the Program as you received it is intended to interact with users through a computer network and if, in the version you received, any user interacting with the Program was given the opportunity to request transmission to that user of the Program's complete source code, you must not remove that facility from your modified version of the Program or work based on the Program, and must offer an equivalent opportunity for all users interacting with your Program through a computer network to request immediate transmission by HTTP of the complete source code of your modified version or other derivative work.
+These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License.
+
+3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:
+a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
+b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
+c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)
+The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.
+
+If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code.
+
+4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
+5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it.
+6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License.
+7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program.
+If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances.
+
+It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice.
+
+This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License.
+
+8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License.
+9. Affero Inc. may publish revised and/or new versions of the Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.
+Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by Affero, Inc. If the Program does not specify a version number of this License, you may choose any version ever published by Affero, Inc.
+
+You may also choose to redistribute modified versions of this program under any version of the Free Software Foundation's GNU General Public License version 3 or higher, so long as that version of the GNU GPL includes terms and conditions substantially equivalent to those of this license.
+
+10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by Affero, Inc., write to us; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.
+NO WARRANTY
+
+11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
--- a/README.md
+++ b/README.md
@ -0,0 +1,11 @@
+# TunSafe
+Source code of the TunSafe client.
+
+This open sourced TunSafe code is AGPL-1.0 licensed. Do note that the repository contains BSD and OpenSSL licensed files, so if you want to release a version based off of this repository you need to take that into account.
+
+To build on Windows, open TunSafe.sln and build, or run build.py.
+
+To build on Linux, run build_linux.sh
+
+To build on FreeBSD, run build_freebsd.sh
+
--- a/TunSafe.conf
+++ b/TunSafe.conf
@ -0,0 +1,16 @@
+[Interface]
+PrivateKey = KMakx+0sYjWKnkY2pO8+CFZ0Sp+Gzzp/GfxwlR+WgXQ=
+ListenPort = 51820
+Address = 192.168.2.2/24
+MTU = 1420
+
+
+[Peer]
+PublicKey = 2m1BdGW9AwwF5dqaGm0NgMggdDZDUPFAL4JxCySdgBw=
+#AllowedIPs = 0.0.0.0/0, fc00::2/64
+AllowedIPs = 192.168.2.0/24
+Endpoint = 192.168.1.4:8040
+#Endpoint = [fe80::6825:68f4:7c6f:42d4]:8040
+PersistentKeepalive = 25
+
+
--- a/TunSafe.rc
+++ b/TunSafe.rc
--- a/TunSafe.sln
+++ b/TunSafe.sln
@ -0,0 +1,46 @@
+
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio 15
+VisualStudioVersion = 15.0.26403.7
+MinimumVisualStudioVersion = 10.0.40219.1
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "TunSafe", "TunSafe.vcxproj", "{626FBC16-64C6-407D-BC2B-6C087794E0D0}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|Win32 = Debug|Win32
+		Debug|x64 = Debug|x64
+		Release|Win32 = Release|Win32
+		Release|x64 = Release|x64
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|Win32.ActiveCfg = Debug|Win32
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|Win32.Build.0 = Debug|Win32
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|x64.ActiveCfg = Debug|x64
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Debug|x64.Build.0 = Debug|x64
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|Win32.ActiveCfg = Release|Win32
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|Win32.Build.0 = Release|Win32
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|x64.ActiveCfg = Release|x64
+		{626FBC16-64C6-407D-BC2B-6C087794E0D0}.Release|x64.Build.0 = Release|x64
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+	GlobalSection(Performance) = preSolution
+		HasPerformanceSessions = true
+	EndGlobalSection
+EndGlobal
--- a/TunSafe.vcxproj
+++ b/TunSafe.vcxproj
@ -0,0 +1,268 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <ProjectGuid>{626FBC16-64C6-407D-BC2B-6C087794E0D0}</ProjectGuid>
+    <Keyword>Win32Proj</Keyword>
+    <RootNamespace>TunSafe</RootNamespace>
+    <WindowsTargetPlatformVersion>10.0.15063.0</WindowsTargetPlatformVersion>
+    <ProjectName>TunSafe</ProjectName>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+    <Import Project="crypto\nasm.props" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <LinkIncremental>true</LinkIncremental>
+    <TargetName>TunSafe</TargetName>
+    <OutDir>$(SolutionDir)$(Platform)\$(Configuration)\</OutDir>
+    <IntDir>$(Platform)\$(Configuration)\</IntDir>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <LinkIncremental>true</LinkIncremental>
+    <ExecutablePath>$(VC_ExecutablePath_x64);$(WindowsSDK_ExecutablePath);$(VS_ExecutablePath);$(MSBuild_ExecutablePath);$(FxCopDir);$(PATH);C:\Bin\Dev\nasm</ExecutablePath>
+    <TargetName>TunSafe</TargetName>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <LinkIncremental>false</LinkIncremental>
+    <TargetName>TunSafe</TargetName>
+    <OutDir>$(SolutionDir)$(Platform)\$(Configuration)\</OutDir>
+    <IntDir>$(Platform)\$(Configuration)\</IntDir>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <LinkIncremental>false</LinkIncremental>
+    <ExecutablePath>$(VC_ExecutablePath_x64);$(WindowsSDK_ExecutablePath);$(VS_ExecutablePath);$(MSBuild_ExecutablePath);$(FxCopDir);$(PATH);C:\Bin\Dev\nasm</ExecutablePath>
+    <TargetName>TunSafe</TargetName>
+  </PropertyGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <PrecompiledHeader>Use</PrecompiledHeader>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS</PreprocessorDefinitions>
+      <AdditionalIncludeDirectories>.</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Windows</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib</AdditionalDependencies>
+      <UACExecutionLevel>RequireAdministrator</UACExecutionLevel>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <ClCompile>
+      <PrecompiledHeader>Use</PrecompiledHeader>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS=1</PreprocessorDefinitions>
+      <ForcedIncludeFiles>
+      </ForcedIncludeFiles>
+      <AdditionalIncludeDirectories>.</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Windows</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib;Comctl32.lib</AdditionalDependencies>
+      <AdditionalManifestDependencies>
+      </AdditionalManifestDependencies>
+      <UACExecutionLevel>RequireAdministrator</UACExecutionLevel>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <PrecompiledHeader>Use</PrecompiledHeader>
+      <Optimization>MaxSpeed</Optimization>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS</PreprocessorDefinitions>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <AdditionalIncludeDirectories>.</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Windows</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib</AdditionalDependencies>
+      <UACExecutionLevel>RequireAdministrator</UACExecutionLevel>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <PrecompiledHeader>Use</PrecompiledHeader>
+      <Optimization>MinSpace</Optimization>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions);_CRT_SECURE_NO_WARNINGS=1</PreprocessorDefinitions>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <FavorSizeOrSpeed>Size</FavorSizeOrSpeed>
+      <ForcedIncludeFiles>
+      </ForcedIncludeFiles>
+      <InlineFunctionExpansion>AnySuitable</InlineFunctionExpansion>
+      <OmitFramePointers>true</OmitFramePointers>
+      <AdditionalIncludeDirectories>.</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Windows</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies);ws2_32.lib;Iphlpapi.lib</AdditionalDependencies>
+      <UACExecutionLevel>RequireAdministrator</UACExecutionLevel>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <ClInclude Include="bit_ops.h" />
+    <ClInclude Include="tunsafe_config.h" />
+    <ClInclude Include="tunsafe_cpu.h" />
+    <ClInclude Include="crypto\aesgcm\aes.h" />
+    <ClInclude Include="crypto\blake2s.h" />
+    <ClInclude Include="crypto\chacha20poly1305.h" />
+    <ClInclude Include="crypto\siphash.h" />
+    <ClInclude Include="tunsafe_endian.h" />
+    <ClInclude Include="netapi.h" />
+    <ClInclude Include="network_win32_api.h" />
+    <ClInclude Include="network_win32_dnsblock.h" />
+    <ClInclude Include="resource.h" />
+    <ClInclude Include="stdafx.h" />
+    <ClInclude Include="tunsafe_types.h" />
+    <ClInclude Include="wireguard_config.h" />
+    <ClInclude Include="util.h" />
+    <ClInclude Include="network_win32.h" />
+    <ClInclude Include="wireguard.h" />
+    <ClInclude Include="wireguard_proto.h" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="benchmark.cpp" />
+    <ClCompile Include="tunsafe_cpu.cpp" />
+    <ClCompile Include="crypto\aesgcm\aesgcm.cpp" />
+    <ClCompile Include="crypto\blake2s_sse.cpp" />
+    <ClCompile Include="crypto\siphash.cpp" />
+    <ClCompile Include="network_win32_dnsblock.cpp" />
+    <ClCompile Include="util.cpp" />
+    <ClCompile Include="network_win32.cpp" />
+    <ClCompile Include="wireguard.cpp" />
+    <ClCompile Include="crypto\blake2s.cpp">
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|x64'">NotUsing</PrecompiledHeader>
+    </ClCompile>
+    <ClCompile Include="crypto\chacha20poly1305.cpp">
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|x64'">NotUsing</PrecompiledHeader>
+    </ClCompile>
+    <ClCompile Include="crypto\curve25519-donna.cpp">
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|x64'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">NotUsing</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">NotUsing</PrecompiledHeader>
+    </ClCompile>
+    <ClCompile Include="stdafx.cpp">
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">Create</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">Create</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">Create</PrecompiledHeader>
+      <PrecompiledHeader Condition="'$(Configuration)|$(Platform)'=='Release|x64'">Create</PrecompiledHeader>
+    </ClCompile>
+    <ClCompile Include="wireguard_config.cpp" />
+    <ClCompile Include="tunsafe_win32.cpp" />
+    <ClCompile Include="wireguard_proto.cpp" />
+  </ItemGroup>
+  <ItemGroup>
+    <ResourceCompile Include="TunSafe.rc" />
+  </ItemGroup>
+  <ItemGroup>
+    <Image Include="icons\green-bg-icon.ico" />
+    <Image Include="icons\green-icon.ico" />
+    <Image Include="icons\neutral-icon.ico" />
+    <Image Include="icons\red-icon.ico" />
+  </ItemGroup>
+  <ItemGroup>
+    <NASM Include="crypto\aesgcm\aesni_gcm_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\aesgcm\aesni_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\aesgcm\ghash_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\chacha20_x64.asm">
+      <FileType>Document</FileType>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\curve25519_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+    <NASM Include="crypto\poly1305_x64_nasm.asm">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">true</ExcludedFromBuild>
+    </NASM>
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+    <Import Project="crypto\nasm.targets" />
+  </ImportGroup>
+</Project>
--- a/TunSafe.vcxproj.filters
+++ b/TunSafe.vcxproj.filters
@ -0,0 +1,154 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="crypto">
+      <UniqueIdentifier>{cfa17b4c-1bee-434e-81b4-ba780c3f7e2d}</UniqueIdentifier>
+    </Filter>
+    <Filter Include="Source Files\Win32">
+      <UniqueIdentifier>{49ba9478-f871-449f-a410-b401e993893f}</UniqueIdentifier>
+    </Filter>
+    <Filter Include="crypto\aesgcm">
+      <UniqueIdentifier>{d31b1b9f-4a2e-42d4-a26c-7c3daa4ccbe3}</UniqueIdentifier>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="stdafx.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tunsafe_endian.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="resource.h" />
+    <ClInclude Include="wireguard.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="wireguard_proto.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="util.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="network_win32_dnsblock.h">
+      <Filter>Source Files\Win32</Filter>
+    </ClInclude>
+    <ClInclude Include="network_win32.h">
+      <Filter>Source Files\Win32</Filter>
+    </ClInclude>
+    <ClInclude Include="network_win32_api.h">
+      <Filter>Source Files\Win32</Filter>
+    </ClInclude>
+    <ClInclude Include="crypto\chacha20poly1305.h">
+      <Filter>crypto</Filter>
+    </ClInclude>
+    <ClInclude Include="crypto\blake2s.h">
+      <Filter>crypto</Filter>
+    </ClInclude>
+    <ClInclude Include="wireguard_config.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="netapi.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="crypto\siphash.h">
+      <Filter>crypto</Filter>
+    </ClInclude>
+    <ClInclude Include="tunsafe_types.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="crypto\aesgcm\aes.h">
+      <Filter>crypto\aesgcm</Filter>
+    </ClInclude>
+    <ClInclude Include="tunsafe_cpu.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="bit_ops.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tunsafe_config.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="stdafx.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="wireguard.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="wireguard_proto.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="util.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="network_win32_dnsblock.cpp">
+      <Filter>Source Files\Win32</Filter>
+    </ClCompile>
+    <ClCompile Include="tunsafe_win32.cpp">
+      <Filter>Source Files\Win32</Filter>
+    </ClCompile>
+    <ClCompile Include="network_win32.cpp">
+      <Filter>Source Files\Win32</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\blake2s.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\blake2s_sse.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\chacha20poly1305.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\curve25519-donna.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="wireguard_config.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\siphash.cpp">
+      <Filter>crypto</Filter>
+    </ClCompile>
+    <ClCompile Include="crypto\aesgcm\aesgcm.cpp">
+      <Filter>crypto\aesgcm</Filter>
+    </ClCompile>
+    <ClCompile Include="benchmark.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tunsafe_cpu.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+  <ItemGroup>
+    <ResourceCompile Include="TunSafe.rc" />
+  </ItemGroup>
+  <ItemGroup>
+    <Image Include="icons\neutral-icon.ico" />
+    <Image Include="icons\green-icon.ico" />
+    <Image Include="icons\red-icon.ico" />
+    <Image Include="icons\green-bg-icon.ico" />
+  </ItemGroup>
+  <ItemGroup>
+    <NASM Include="crypto\chacha20_x64.asm">
+      <Filter>crypto</Filter>
+    </NASM>
+    <NASM Include="crypto\curve25519_x64_nasm.asm">
+      <Filter>crypto</Filter>
+    </NASM>
+    <NASM Include="crypto\poly1305_x64_nasm.asm">
+      <Filter>crypto</Filter>
+    </NASM>
+    <NASM Include="crypto\aesgcm\aesni_gcm_x64_nasm.asm">
+      <Filter>crypto\aesgcm</Filter>
+    </NASM>
+    <NASM Include="crypto\aesgcm\aesni_x64_nasm.asm">
+      <Filter>crypto\aesgcm</Filter>
+    </NASM>
+    <NASM Include="crypto\aesgcm\ghash_x64_nasm.asm">
+      <Filter>crypto\aesgcm</Filter>
+    </NASM>
+  </ItemGroup>
+</Project>
--- a/benchmark.cpp
+++ b/benchmark.cpp
@ -0,0 +1,94 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#include "stdafx.h"
+#include "tunsafe_types.h"
+#include "crypto/chacha20poly1305.h"
+#include "crypto/aesgcm/aes.h"
+#include "tunsafe_cpu.h"
+
+#include <functional>
+#include <string.h>
+
+#if defined(OS_FREEBSD) || defined(OS_LINUX)
+#include <time.h>
+#include <stdlib.h>
+typedef uint64 LARGE_INTEGER;
+void QueryPerformanceCounter(LARGE_INTEGER *x) {
+  struct timespec ts;
+  if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) {
+    fprintf(stderr, "clock_gettime failed\n");
+    exit(1);
+  }
+  *x = (uint64)ts.tv_sec * 1000000000 + ts.tv_nsec;
+}
+
+void QueryPerformanceFrequency(LARGE_INTEGER *x) {
+  *x = 1000000000;
+}
+#elif defined(OS_MACOSX)
+#include <mach/mach.h>
+#include <mach/mach_time.h>
+typedef uint64 LARGE_INTEGER;
+
+void QueryPerformanceCounter(LARGE_INTEGER *x) {
+  *x = mach_absolute_time();
+}
+
+void QueryPerformanceFrequency(LARGE_INTEGER *x) {
+  mach_timebase_info_data_t timebase = { 0, 0 };
+  if (mach_timebase_info(&timebase) != 0)
+    abort();
+  printf("numer/denom: %d %d\n", timebase.numer, timebase.denom);
+  *x = timebase.denom * 1000000000;  
+}
+
+#endif
+
+int gcm_self_test();
+
+
+
+void *fake_glb;
+void Benchmark() {
+  int64 a, b, f, t1 = 0, t2 = 0;
+
+#if WITH_AESGCM
+  gcm_self_test();
+#endif  // WITH_AESGCM
+
+  PrintCpuFeatures();
+
+  QueryPerformanceFrequency((LARGE_INTEGER*)&f);
+
+  uint8 dst[1500 + 16];
+  uint8 key[32] = {0, 1, 2, 3, 4, 5, 6};
+  uint8 mac[16];
+
+  fake_glb = dst;
+
+  auto RunOneBenchmark = [&](const char *name, const std::function<uint64(size_t)> &ff) {
+    uint64 bytes = 0;
+    QueryPerformanceCounter((LARGE_INTEGER*)&b);
+    size_t i;
+    for (i = 0; bytes < 1000000000; i++)
+      bytes += ff(i);
+    QueryPerformanceCounter((LARGE_INTEGER*)&a);
+    RINFO("%s: %f MB/s", name, (double)bytes * 0.000001 / (a - b) * f);
+  };
+
+  memset(dst, 0, 1500);
+  RunOneBenchmark("chacha20-encrypt", [&](size_t i) -> uint64 { chacha20poly1305_encrypt(dst, dst, 1460, NULL, 0, i, key); return 1460; });
+  RunOneBenchmark("chacha20-decrypt", [&](size_t i) -> uint64 { chacha20poly1305_decrypt_get_mac(dst, dst, 1460, NULL, 0, i, key, mac); return 1460; });
+
+  RunOneBenchmark("poly1305-only", [&](size_t i) -> uint64 { poly1305_get_mac(dst, 1460, NULL, 0, i, key, mac); return 1460; });
+
+#if WITH_AESGCM
+  if (X86_PCAP_AES) {
+    AesGcm128StaticContext sctx;
+    CRYPTO_gcm128_init(&sctx, key, 128);
+
+    RunOneBenchmark("aes128-gcm-encrypt", [&](size_t i) -> uint64 { aesgcm_encrypt(dst, dst, 1460, NULL, 0, i, &sctx); return 1460; });
+    RunOneBenchmark("aes128-gcm-decrypt", [&](size_t i) -> uint64 { aesgcm_decrypt_get_mac(dst, dst, 1460, NULL, 0, i, &sctx, mac); return 1460; });
+  }
+#endif   //  WITH_AESGCM
+}
--- a/bit_ops.h
+++ b/bit_ops.h
@ -0,0 +1,49 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#pragma once
+
+#include "tunsafe_types.h"
+#include "tunsafe_endian.h"
+
+#if !defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC)
+static inline int _BitScanReverse64(unsigned long *index, uint64 x) {
+  if (_BitScanReverse(index, x >> 32)) {
+    (*index) += 32;
+    return true;
+  }
+  return _BitScanReverse(index, (uint32)x);
+}
+#endif
+
+#if !defined(COMPILER_MSVC)
+static inline int _BitScanReverse64(unsigned long *index, uint64 x) {
+  *index = 63 - __builtin_clzll(x);
+  return (x != 0);
+}
+
+static inline int _BitScanReverse(unsigned long *index, uint32 x) {
+  *index = 31 - __builtin_clz(x);
+  return (x != 0);
+}
+
+#endif
+
+static inline int FindHighestSetBit32(uint32 x) {
+  unsigned long index;
+  return _BitScanReverse(&index, x) ? (int)(index + 1) : 0;
+}
+
+static inline int FindLastSetBit32(uint32 x) {
+  unsigned long index;
+  _BitScanReverse(&index, x);
+  return index;
+}
+
+static inline int FindHighestSetBit64(uint64 x) {
+  unsigned long index;
+  return _BitScanReverse64(&index, x) ? (int)(index + 1) : 0;
+}
+
+static inline int FindHighestSetBit128(uint64 hi, uint64 lo) {
+  return hi ? 64 + FindHighestSetBit64(hi) : FindHighestSetBit64(lo);
+}
--- a/build.py
+++ b/build.py
@ -0,0 +1,95 @@
+# SPDX-License-Identifier: AGPL-1.0-only
+# Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+import os
+import shutil
+import win32crypt
+import base64
+import sys
+import zipfile
+import re
+
+MSBUILD_PATH = r"C:\Dev\VS2017\MSBuild\15.0\Bin\MSBuild.exe"
+NSIS_PATH = r'C:\Dev\NSIS\makeNSIS.EXE'
+
+SIGNTOOL_PATH = r'c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe'
+SIGNTOOL_KEY_PATH = '' # put key here
+SIGNTOOL_PASS = '' # put key pass here
+
+def RmTree(path):
+  try:
+    print ('Deleting %s' % path)
+    shutil.rmtree(path)
+  except FileNotFoundError:
+    pass
+  
+def Run(s):
+  print ('Running %s' % s)
+  x = os.system(s)
+  if x:
+    raise Exception('Command failed (%d) : %s' % (x, s))
+
+def CopyFile(src, dst):
+  shutil.copyfile(src, dst)
+
+def SignExe(src):
+  print ('Signing %s' % src)
+  cmd = r'""c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe" sign /f "%s" /p %s /t http://timestamp.verisign.com/scripts/timstamp.dll "%s"' % (SIGNTOOL_KEY_PATH, SIGNTOOL_PASS, src)
+  #cmd = r'""c:\Program Files (x86)\Windows Kits\10\bin\10.0.15063.0\x86\signtool.exe" sign %s ' % (SIGNTOOL_KEY_PATH, )
+  x = os.system(cmd)
+  if x:
+    raise Exception('Signing failed (%d) : %s' % (x, cmd))
+
+def GetVersion():
+  for line in open(BASE + '/tunsafe_config.h', 'r'):
+    m = re.match('^#define TUNSAFE_VERSION_STRING "TunSafe (.*)"$', line)
+    if m:
+      return m.group(1)
+  raise Exception('Version not found')
+
+#
+
+#os.system(r'""')
+
+command = sys.argv[1]
+
+BASE = r'D:\Code\TunSafe'
+
+
+if command == 'build_tap':
+  Run(r'%s /V4 installer\tap\tap-windows6.nsi'  % NSIS_PATH)
+  SignExe(r'installer\tap\TunSafe-TAP-9.21.2.exe')
+  sys.exit(0)
+
+if 1:
+  RmTree(BASE + r'\Win32\Release')
+  RmTree(BASE + r'\x64\Release')
+  Run('%s TunSafe.sln /t:Clean;Rebuild /p:Configuration=Release /p:Platform=x64' % MSBUILD_PATH)
+  Run('%s TunSafe.sln /t:Clean;Rebuild /p:Configuration=Release /p:Platform=Win32' % MSBUILD_PATH)
+
+if 1:
+  CopyFile(BASE + r'\Win32\Release\TunSafe.exe',
+           BASE + r'\installer\x86\TunSafe.exe')
+
+  SignExe(BASE + r'\installer\x86\TunSafe.exe')
+  CopyFile(BASE + r'\x64\Release\TunSafe.exe',
+           BASE + r'\installer\x64\TunSafe.exe')
+  SignExe(BASE + r'\installer\x64\TunSafe.exe')
+
+VERSION = GetVersion()
+
+Run(r'%s /V4 -DPRODUCT_VERSION=%s installer\tunsafe.nsi ' % (NSIS_PATH, VERSION))
+SignExe(BASE + r'\installer\TunSafe-%s.exe' % VERSION)
+
+zipf = zipfile.ZipFile(BASE + '\installer\TunSafe-%s-x86.zip' % VERSION, 'w', zipfile.ZIP_DEFLATED)
+zipf.write(BASE + r'\installer\x86\TunSafe.exe', 'TunSafe.exe')
+zipf.write(BASE + r'\installer\License.txt', 'License.txt')
+zipf.write(BASE + r'\installer\ChangeLog.txt', 'ChangeLog.txt')
+zipf.write(BASE + r'\installer\TunSafe.conf', 'Config\\TunSafe.conf')
+zipf.close()
+
+zipf = zipfile.ZipFile(BASE + '\installer\TunSafe-%s-x64.zip' % VERSION, 'w', zipfile.ZIP_DEFLATED)
+zipf.write(BASE + r'\installer\x64\TunSafe.exe', 'TunSafe.exe')
+zipf.write(BASE + r'\installer\License.txt', 'License.txt')
+zipf.write(BASE + r'\installer\ChangeLog.txt', 'ChangeLog.txt')
+zipf.write(BASE + r'\installer\TunSafe.conf', 'Config\\TunSafe.conf')
+zipf.close()
--- a/build_config.h
+++ b/build_config.h
@ -0,0 +1,116 @@
+// File is taken from Chromium
+#ifndef BUILD_BUILD_CONFIG_H_
+#define BUILD_BUILD_CONFIG_H_
+
+#if defined(__APPLE__)
+#include <TargetConditionals.h>
+#endif
+
+// A set of macros to use for platform detection.
+#if defined(__APPLE__)
+#define OS_MACOSX 1
+#if defined(TARGET_OS_IPHONE) && TARGET_OS_IPHONE
+#define OS_IOS 1
+#endif  // defined(TARGET_OS_IPHONE) && TARGET_OS_IPHONE
+#elif defined(ANDROID)
+#define OS_ANDROID 1
+#elif defined(__native_client__)
+#define OS_NACL 1
+#elif defined(__FLASHPLAYER)
+#define OS_FLASHPLAYER 1
+#elif defined(__linux__)
+#define OS_LINUX 1
+#elif defined(_WIN32)
+#define OS_WIN 1
+#elif defined(__FreeBSD__)
+#define OS_FREEBSD 1
+#elif defined(__OpenBSD__)
+#define OS_OPENBSD 1
+#elif defined(__sun)
+#define OS_SOLARIS 1
+#elif defined(EMSCRIPTEN)
+#define OS_EMSCRIPTEN 1
+#else
+#error Please add support for your platform in build_config.h
+#endif
+
+// For access to standard BSD features, use OS_BSD instead of a
+// more specific macro.
+#if defined(OS_FREEBSD) || defined(OS_OPENBSD)
+#define OS_BSD 1
+#endif
+
+// For access to standard POSIXish features, use OS_POSIX instead of a
+// more specific macro.
+#if defined(OS_MACOSX) || defined(OS_LINUX) || defined(OS_FREEBSD) ||     \
+    defined(OS_OPENBSD) || defined(OS_SOLARIS) || defined(OS_ANDROID) ||  \
+    defined(OS_NACL)
+#define OS_POSIX 1
+#endif
+
+#if defined(OS_POSIX) && !defined(OS_MACOSX) && !defined(OS_ANDROID) && \
+    !defined(OS_NACL)
+#define USE_X11 1  // Use X for graphics.
+#endif
+
+// Compiler detection.
+#if defined(__GNUC__)
+#define COMPILER_GCC 1
+
+#if defined(__clang__)
+#define COMPILER_CLANG 1
+#endif
+#elif defined(_MSC_VER)
+#define COMPILER_MSVC 1
+#elif defined(__TINYC__)
+#define COMPILER_TCC 1
+#else
+#error Please add support for your compiler in build/build_config.h
+#endif
+
+// Processor architecture detection.  For more info on what's defined, see:
+//   http://msdn.microsoft.com/en-us/library/b0084kay.aspx
+//   http://www.agner.org/optimize/calling_conventions.pdf
+//   or with gcc, run: "echo | gcc -E -dM -"
+#if defined(_M_X64) || defined(__x86_64__)
+#define ARCH_CPU_X86_FAMILY 1
+#define ARCH_CPU_X86_64 1
+#define ARCH_CPU_64_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#define ARCH_CPU_ALLOW_UNALIGNED 1
+#elif defined(_M_IX86) || defined(__i386__)
+#define ARCH_CPU_X86_FAMILY 1
+#define ARCH_CPU_X86 1
+#define ARCH_CPU_32_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#define ARCH_CPU_ALLOW_UNALIGNED 1
+#define ARCH_CPU_NEED_64BIT_ALIGN 1
+#elif defined(__ARMEL__) || defined(__arm__) && defined(__ARMCC_VERSION)
+#define ARCH_CPU_ARM_FAMILY 1
+#define ARCH_CPU_ARMEL 1
+#define ARCH_CPU_32_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#elif defined(__pnacl__)
+#define ARCH_CPU_32_BITS 1
+#elif defined(__MIPSEL__)
+#define ARCH_CPU_MIPS_FAMILY 1
+#define ARCH_CPU_MIPSEL 1
+#define ARCH_CPU_32_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#elif defined(EMSCRIPTEN)
+#define ARCH_CPU_JS 1
+#define ARCH_CPU_32_BITS 1
+#define ARCH_CPU_LITTLE_ENDIAN 1
+#elif defined(__FLASHPLAYER)
+#define ARCH_CPU_FLASHPLAYER 1
+#define ARCH_CPU_32_BITS 1
+#else
+#error Please add support for your architecture in build_config.h
+#endif
+
+#if defined(ARCH_CPU_LITTLE_ENDIAN) && defined(ARCH_CPU_BIG_ENDIAN) || !defined(ARCH_CPU_LITTLE_ENDIAN) && !defined(ARCH_CPU_BIG_ENDIAN)
+#error Please add support for your endianness in build_config.h
+#endif
+
+
+#endif  // BUILD_BUILD_CONFIG_H_
--- a/build_freebsd.sh
+++ b/build_freebsd.sh
@ -0,0 +1,2 @@
+g++7 -I . -O2 -static -mssse3 -o tunsafe benchmark.cpp tunsafe_cpu.cpp wireguard_config.cpp wireguard.cpp wireguard_proto.cpp util.cpp network_bsd.cpp crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp crypto/curve25519-donna.cpp crypto/siphash.cpp crypto/chacha20_x64_gas.s crypto/poly1305_x64_gas.s ipzip2/ipzip2.cpp -lrt
+
--- a/build_linux.sh
+++ b/build_linux.sh
@ -0,0 +1,9 @@
+#!/bin/sh
+clang++-6.0 -c -march=skylake-avx512 crypto/poly1305_x64_gas.s crypto/chacha20_x64_gas.s 
+clang++-6.0 -I . -O3 -mssse3 -pthread -lrt -o tunsafe util.cpp wireguard_config.cpp wireguard.cpp \
+wireguard_proto.cpp network_bsd_mt.cpp tunsafe_cpu.cpp benchmark.cpp crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp \
+crypto/curve25519-donna.cpp crypto/siphash.cpp chacha20_x64_gas.o crypto/aesgcm/aesni_gcm_x64_gas.s \
+crypto/aesgcm/aesni_x64_gas.s crypto/aesgcm/aesgcm.cpp poly1305_x64_gas.o ipzip2/ipzip2.cpp \
+crypto/aesgcm/ghash_x64_gas.s
+
+
--- a/build_osx.sh
+++ b/build_osx.sh
@ -0,0 +1,17 @@
+set -e
+
+
+clang++ -c -mavx512f -mavx512vl crypto/poly1305_x64_gas_macosx.s crypto/chacha20_x64_gas_macosx.s 
+
+clang++ -g -O3 -I . -std=c++11 -DNDEBUG=1 -fno-exceptions -fno-rtti -ffunction-sections -o tunsafe \
+wireguard_config.cpp wireguard.cpp wireguard_proto.cpp util.cpp network_bsd_mt.cpp benchmark.cpp tunsafe_cpu.cpp \
+crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp crypto/curve25519-donna.cpp \
+crypto/siphash.cpp crypto/aesgcm/aesgcm.cpp ipzip2/ipzip2.cpp \
+crypto/aesgcm/aesni_gcm_x64_gas_macosx.s crypto/aesgcm/aesni_x64_gas_macosx.s crypto/aesgcm/ghash_x64_gas_macosx.s \
+chacha20_x64_gas_macosx.o poly1305_x64_gas_macosx.o
+
+cp tunsafe tunsafe.unstripped
+strip tunsafe
+rm -f tunsafe_osx.zip
+zip tunsafe_osx.zip tunsafe readme_osx.txt
+
--- a/crypto/.gitignore
+++ b/crypto/.gitignore
@ -0,0 +1 @@
+/old/
--- a/crypto/aesgcm/aes.h
+++ b/crypto/aesgcm/aes.h
@ -0,0 +1,84 @@
+/**
+ * Downloaded from
+ *
+ *  http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndael-fst-3.0.zip
+ *
+ * rijndael-alg-fst.h
+ *
+ * @version 3.0 (December 2000)
+ *
+ * Optimised ANSI C code for the Rijndael cipher (now AES)
+ *
+ * @author Vincent Rijmen <vincent.rijmen@esat.kuleuven.ac.be>
+ * @author Antoon Bosselaers <antoon.bosselaers@esat.kuleuven.ac.be>
+ * @author Paulo Barreto <paulo.barreto@terra.com.br>
+ *
+ * This code is hereby placed in the public domain.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS ''AS IS'' AND ANY EXPRESS
+ * OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
+ * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
+ * EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#ifndef __RIJNDAEL_ALG_FST_H
+#define __RIJNDAEL_ALG_FST_H
+
+#include "tunsafe_types.h"
+
+#define AESGCM_MAXNR	14
+
+struct AesContext {
+  uint32 rk[(AESGCM_MAXNR + 1) * 4];
+  int rounds;
+};
+
+typedef struct { uint64 hi, lo; } aesgcm_u128;
+
+struct AesGcm128StaticContext {
+  void(*gmult)(uint64 Xi[2], const aesgcm_u128 Htable[16]);
+  void(*ghash)(uint64 Xi[2], const aesgcm_u128 Htable[16], const uint8 *inp, size_t len);
+  bool use_aesni_gcm_crypt;
+
+  // Don't move H and Htable cause the asm code depends on them
+  union { uint64 u[2]; uint32 d[4]; uint8 c[16]; size_t t[16 / sizeof(size_t)]; } H;
+  aesgcm_u128 Htable[16];
+  AesContext aes;
+};
+
+struct AesGcm128TempContext {
+  AesGcm128StaticContext *sctx;
+  union { uint64 u[2]; uint32 d[4]; uint8 c[16]; size_t t[16/sizeof(size_t)]; } EKi,EK0,len, Yi, Xi;
+  unsigned int mres, ares;
+};
+
+void CRYPTO_gcm128_init(AesGcm128StaticContext *ctx, const uint8 *key, int key_size);
+
+void CRYPTO_gcm128_setiv(AesGcm128TempContext *ctx, AesGcm128StaticContext *sctx, const unsigned char *iv,size_t len);
+void CRYPTO_gcm128_aad(AesGcm128TempContext *ctx,const uint8 *aad,size_t len);
+void CRYPTO_gcm128_encrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len);
+void CRYPTO_gcm128_decrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len);
+void CRYPTO_gcm128_finish(AesGcm128TempContext *ctx, unsigned char *tag, size_t len);
+
+void aesgcm_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                    const uint8 *ad, const size_t ad_len,
+                    const uint64 nonce, AesGcm128StaticContext *sctx);
+
+void aesgcm_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len,
+                            const uint8 *ad, const size_t ad_len,
+                            const uint64 nonce, AesGcm128StaticContext *sctx,
+                            uint8 mac[16]);
+
+#if defined(ARCH_CPU_X86_64)
+#define WITH_AESGCM 0
+#endif
+
+
+
+#endif /* __RIJNDAEL_ALG_FST_H */
--- a/crypto/aesgcm/aesgcm.cpp
+++ b/crypto/aesgcm/aesgcm.cpp
@ -0,0 +1,882 @@
+#include "stdafx.h"
+#include "tunsafe_types.h"
+#include "tunsafe_endian.h"
+#include "tunsafe_cpu.h"
+#include "crypto/aesgcm/aes.h"
+#include <assert.h>
+#include <string.h>
+#include <stdio.h>
+//#include <Windows.h>
+#include "crypto/chacha20poly1305.h"
+#define AESNIGCM_ASM 1
+#define AESGCM_ASM 1
+#define AESNI_GCM 1
+
+// We only implement AES stuff on X86-64
+#if WITH_AESGCM
+
+extern "C" {
+void gcm_init_clmul(aesgcm_u128 Htable[16],const uint64 Xi[2]);
+void gcm_gmult_clmul(uint64 Xi[2],const aesgcm_u128 Htable[16]);
+void gcm_ghash_clmul(uint64 Xi[2],const aesgcm_u128 Htable[16],const uint8 *inp,size_t len);
+void gcm_init_avx(aesgcm_u128 Htable[16],const uint64 Xi[2]);
+void gcm_gmult_avx(uint64 Xi[2],const aesgcm_u128 Htable[16]);
+void gcm_ghash_avx(uint64 Xi[2],const aesgcm_u128 Htable[16],const uint8 *inp,size_t len);
+void gcm_gmult_4bit(uint64 Xi[2], const aesgcm_u128 Htable[16]);
+void gcm_ghash_4bit(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len);
+
+// ivec points to Yi followed by Xi
+// h_and_htable points at h and htable from the static context
+size_t aesni_gcm_encrypt(const uint8 *in,uint8 *out,size_t len,const void *key,uint8 ivec_and_xi[16],uint64 *h_and_htable);
+size_t aesni_gcm_decrypt(const uint8 *in,uint8 *out,size_t len,const void *key,uint8 ivec_and_xi[16],uint64 *h_and_htable);
+void aesni_ctr32_encrypt_blocks(const void *in, void *out, size_t blocks, const AesContext *key, const uint8 *ivec);
+void aesni_encrypt(const void *inp, void *out, const AesContext *key);
+void aesni_decrypt(const void *inp, void *out, const AesContext *key);
+int aesni_set_encrypt_key(const unsigned char *inp, int bits, AesContext *key);
+int aesni_set_decrypt_key(const unsigned char *inp, int bits, AesContext *key);
+};
+
+
+#define GCM_MUL(ctx,Xi)  (*gcm_gmult_p)(ctx->Xi.u,sctx->Htable)
+#define GHASH(ctx,in,len) (*gcm_ghash_p)(ctx->Xi.u,sctx->Htable,in,len)
+#define GHASH_CHUNK       (3*1024)
+
+void CRYPTO_gcm128_aad(AesGcm128TempContext *ctx,const uint8 *aad,size_t len) {
+  size_t i;
+  unsigned int n;
+  AesGcm128StaticContext *sctx = ctx->sctx;
+  uint64 alen = ctx->len.u[0];
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+  void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) = sctx->ghash;
+
+  assert(!ctx->len.u[1]);
+//  if () return -2;
+  alen += len;
+//  if (alen>(uint64(1)<<61) || (sizeof(len)==8 && alen<len))
+//    return -1;
+  ctx->len.u[0] = alen;
+
+  n = ctx->ares;
+  if (n) {
+    while (n && len) {
+      ctx->Xi.c[n] ^= *(aad++);
+      --len;
+      n = (n+1)%16;
+    }
+    if (n==0) GCM_MUL(ctx,Xi);
+    else {
+      ctx->ares = n;
+      return;
+    }
+  }
+
+#ifdef GHASH
+  if ((i = (len&(size_t)-16))) {
+    GHASH(ctx,aad,i);
+    aad += i;
+    len -= i;
+  }
+#else
+  while (len>=16) {
+    for (i=0; i<16; ++i) ctx->Xi.c[i] ^= aad[i];
+    GCM_MUL(ctx,Xi);
+    aad += 16;
+    len -= 16;
+  }
+#endif
+  if (len) {
+    n = (unsigned int)len;
+    for (i=0; i<len; ++i) ctx->Xi.c[i] ^= aad[i];
+  }
+
+  ctx->ares = n;
+}
+
+void CRYPTO_gcm128_encrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len) {
+  unsigned int n, ctr;
+  size_t i;
+  AesGcm128StaticContext *sctx = ctx->sctx;
+  uint64        mlen  = ctx->len.u[1];
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+  void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) = sctx->ghash;
+  mlen += len;
+//  if (mlen>((uint64(1)<<36)-32) || (sizeof(len)==8 && mlen<len))
+//    return -1;
+  ctx->len.u[1] = mlen;
+
+  if (ctx->ares) {
+    /* First call to encrypt finalizes GHASH(AAD) */
+    GCM_MUL(ctx,Xi);
+    ctx->ares = 0;
+  }
+  n = ctx->mres;
+  if (n) {
+    while (n && len) {
+      ctx->Xi.c[n] ^= *(out++) = *(in++)^ctx->EKi.c[n];
+      --len;
+      n = (n+1)%16;
+    }
+    if (n==0) GCM_MUL(ctx,Xi);
+    else {
+      ctx->mres = n;
+      return;
+    }
+  }
+
+#if defined(AESNI_GCM)
+  if (sctx->use_aesni_gcm_crypt && len >= 0x120) {
+    // |aesni_gcm_encrypt| may not process all the input given to it. It may
+    // not process *any* of its input if it is deemed too small.
+    size_t bulk = aesni_gcm_encrypt(in, out, len, &sctx->aes, ctx->Yi.c, sctx->H.u);
+    in += bulk;
+    out += bulk;
+    len -= bulk;
+  }
+#endif
+  ctr = ReadBE32(ctx->Yi.c + 12);
+
+#if defined(STRICT_ALIGNMENT)
+  if (((size_t)in | (size_t)out) % sizeof(size_t) != 0) {
+    for (i = 0; i<len; ++i) {
+      if (n == 0) {
+        aesni_encrypt(ctx->Yi.c, ctx->EKi.c, &sctx->aes);
+        ++ctr;
+        WriteBE32(ctx->Yi.c + 12, ctr);
+      }
+      ctx->Xi.c[n] ^= out[i] = in[i] ^ ctx->EKi.c[n];
+      n = (n + 1) % 16;
+      if (n == 0)
+        GCM_MUL(ctx, Xi);
+    }
+    ctx->mres = n;
+    return;
+  }
+#endif
+  while (len>=GHASH_CHUNK) {
+    aesni_ctr32_encrypt_blocks(in, out, GHASH_CHUNK / 16, &sctx->aes, ctx->Yi.c);
+    GHASH(ctx, out, GHASH_CHUNK);
+    ctr += GHASH_CHUNK / 16;
+    WriteBE32(ctx->Yi.c + 12, ctr);
+    in += GHASH_CHUNK;
+    out += GHASH_CHUNK;
+    len -= GHASH_CHUNK;
+  }
+  if ((i = (len&(size_t)-16))) {
+    aesni_ctr32_encrypt_blocks(in, out, i / 16, &sctx->aes, ctx->Yi.c);
+    GHASH(ctx, out, i);
+    ctr += (uint32)(i / 16);
+    WriteBE32(ctx->Yi.c + 12, ctr);
+    out += i;
+    in += i;
+    len -= i;
+  }
+  if (len) {
+    aesni_encrypt(ctx->Yi.c, ctx->EKi.c, &sctx->aes);
+    ++ctr;
+    WriteBE32(ctx->Yi.c+12,ctr);
+    while (len--) {
+      ctx->Xi.c[n] ^= out[n] = in[n] ^ ctx->EKi.c[n];
+      ++n;
+    }
+  }
+  ctx->mres = n;
+}
+
+void CRYPTO_gcm128_decrypt_ctr32(AesGcm128TempContext *ctx, const uint8 *in, uint8 *out, size_t len) {
+  unsigned int n, ctr;
+  size_t i;
+  uint64        mlen  = ctx->len.u[1];
+  AesGcm128StaticContext *sctx = ctx->sctx;
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+  void (*gcm_ghash_p)(uint64 Xi[2],const aesgcm_u128 Htable[16],  const uint8 *inp,size_t len) = sctx->ghash;
+
+  mlen += len;
+//  if (mlen>((uint64(1)<<36)-32) || (sizeof(len)==8 && mlen<len))
+//    return -1;
+  ctx->len.u[1] = mlen;
+
+  if (ctx->ares) {
+    /* First call to decrypt finalizes GHASH(AAD) */
+    GCM_MUL(ctx,Xi);
+    ctx->ares = 0;
+  }
+
+  n = ctx->mres;
+  if (n) {
+    while (n && len) {
+      uint8 c = *(in++);
+      *(out++) = c^ctx->EKi.c[n];
+      ctx->Xi.c[n] ^= c;
+      --len;
+      n = (n+1)%16;
+    }
+    if (n==0) GCM_MUL (ctx,Xi);
+    else {
+      ctx->mres = n;
+      return;
+    }
+  }
+
+#if defined(AESNI_GCM)
+  if (sctx->use_aesni_gcm_crypt) {
+    // |aesni_gcm_decrypt| may not process all the input given to it. It may
+    // not process *any* of its input if it is deemed too small.
+    size_t bulk = aesni_gcm_decrypt(in, out, len, &sctx->aes, ctx->Yi.c, sctx->H.u);
+    in += bulk;
+    out += bulk;
+    len -= bulk;
+  }
+#endif
+  ctr = ReadBE32(ctx->Yi.c + 12);
+
+#if defined(STRICT_ALIGNMENT)
+  if (((size_t)in|(size_t)out)%sizeof(size_t) != 0) {
+    for (i=0;i<len;++i) {
+      uint8 c;
+      if (n==0) {
+        aesni_encrypt(ctx->Yi.c, ctx->EKi.c, key);
+        ++ctr;
+        WriteBE32(ctx->Yi.c+12,ctr);
+      }
+      c = in[i];
+      out[i] = c^ctx->EKi.c[n];
+      ctx->Xi.c[n] ^= c;
+      n = (n+1)%16;
+      if (n==0)
+        GCM_MUL(ctx,Xi);
+    }
+    ctx->mres = n;
+    return;
+  }
+#endif
+  while (len >= GHASH_CHUNK) {
+    GHASH(ctx, in, GHASH_CHUNK);
+    aesni_ctr32_encrypt_blocks(in, out, GHASH_CHUNK / 16, &sctx->aes, ctx->Yi.c);
+    ctr += GHASH_CHUNK / 16;
+    WriteBE32(ctx->Yi.c + 12, ctr);
+    in += GHASH_CHUNK;
+    out += GHASH_CHUNK;
+    len -= GHASH_CHUNK;
+  }
+  if ((i = (len&(size_t)-16))) {
+    GHASH(ctx, in, i);
+    aesni_ctr32_encrypt_blocks(in, out, i / 16, &sctx->aes, ctx->Yi.c);
+    ctr += (uint32)(i / 16);
+    WriteBE32(ctx->Yi.c + 12, ctr);
+    out += i;
+    in += i;
+    len -= i;
+  }
+  if (len) {
+    aesni_encrypt(ctx->Yi.c, ctx->EKi.c, &sctx->aes);
+    ++ctr;
+    WriteBE32(ctx->Yi.c+12,ctr);
+    while (len--) {
+      uint8 c = in[n];
+      ctx->Xi.c[n] ^= c;
+      out[n] = c^ctx->EKi.c[n];
+      ++n;
+    }
+  }
+  ctx->mres = n;
+}
+
+void CRYPTO_gcm128_finish(AesGcm128TempContext *ctx,uint8 *tag, size_t len) {
+  uint64 alen = ctx->len.u[0]<<3;
+  uint64 clen = ctx->len.u[1]<<3;
+  AesGcm128StaticContext *sctx = ctx->sctx;
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+
+  if (ctx->mres || ctx->ares)
+    GCM_MUL(ctx,Xi);
+
+  alen = ToBE64(alen);
+  clen = ToBE64(clen);
+
+  ctx->Xi.u[0] ^= alen;
+  ctx->Xi.u[1] ^= clen;
+  GCM_MUL(ctx,Xi);
+
+  ctx->Xi.u[0] ^= ctx->EK0.u[0];
+  ctx->Xi.u[1] ^= ctx->EK0.u[1];
+
+  memcpy(tag, ctx->Xi.c,len);
+}
+
+#define REDUCE1BIT(V) do { \
+  if (sizeof(size_t)==8) { \
+    uint64 T = 0xe100000000000000ull & (0-(V.lo&1)); \
+    V.lo  = (V.hi<<63)|(V.lo>>1); \
+    V.hi  = (V.hi>>1 )^T; \
+  } else { \
+    uint32 T = 0xe1000000U & (0-(uint32)(V.lo&1)); \
+    V.lo  = (V.hi<<63)|(V.lo>>1); \
+    V.hi  = (V.hi>>1 )^((uint64)T<<32); \
+  } \
+} while(0)
+
+static void gcm_init_4bit(aesgcm_u128 Htable[16], uint64 H[2]) {
+  aesgcm_u128 V;
+
+  Htable[0].hi = 0;
+  Htable[0].lo = 0;
+  V.hi = H[0];
+  V.lo = H[1];
+
+  Htable[8] = V;
+  REDUCE1BIT(V);
+  Htable[4] = V;
+  REDUCE1BIT(V);
+  Htable[2] = V;
+  REDUCE1BIT(V);
+  Htable[1] = V;
+  Htable[3].hi  = V.hi^Htable[2].hi, Htable[3].lo  = V.lo^Htable[2].lo;
+  V=Htable[4];
+  Htable[5].hi  = V.hi^Htable[1].hi, Htable[5].lo  = V.lo^Htable[1].lo;
+  Htable[6].hi  = V.hi^Htable[2].hi, Htable[6].lo  = V.lo^Htable[2].lo;
+  Htable[7].hi  = V.hi^Htable[3].hi, Htable[7].lo  = V.lo^Htable[3].lo;
+  V=Htable[8];
+  Htable[9].hi  = V.hi^Htable[1].hi, Htable[9].lo  = V.lo^Htable[1].lo;
+  Htable[10].hi = V.hi^Htable[2].hi, Htable[10].lo = V.lo^Htable[2].lo;
+  Htable[11].hi = V.hi^Htable[3].hi, Htable[11].lo = V.lo^Htable[3].lo;
+  Htable[12].hi = V.hi^Htable[4].hi, Htable[12].lo = V.lo^Htable[4].lo;
+  Htable[13].hi = V.hi^Htable[5].hi, Htable[13].lo = V.lo^Htable[5].lo;
+  Htable[14].hi = V.hi^Htable[6].hi, Htable[14].lo = V.lo^Htable[6].lo;
+  Htable[15].hi = V.hi^Htable[7].hi, Htable[15].lo = V.lo^Htable[7].lo;
+}
+
+
+#if !AESGCM_ASM
+#define PACK(s)   ((size_t)(s)<<(sizeof(size_t)*8-16))
+static const size_t rem_4bit[16] = {
+  PACK(0x0000), PACK(0x1C20), PACK(0x3840), PACK(0x2460),
+  PACK(0x7080), PACK(0x6CA0), PACK(0x48C0), PACK(0x54E0),
+  PACK(0xE100), PACK(0xFD20), PACK(0xD940), PACK(0xC560),
+  PACK(0x9180), PACK(0x8DA0), PACK(0xA9C0), PACK(0xB5E0)};
+
+void gcm_gmult_4bit(uint64 Xi[2], const aesgcm_u128 Htable[16]) {
+  aesgcm_u128 Z;
+  int cnt = 15;
+  size_t rem, nlo, nhi;
+  const union { long one; char little; } is_endian = {1};
+
+  nlo  = ((const uint8 *)Xi)[15];
+  nhi  = nlo>>4;
+  nlo &= 0xf;
+
+  Z.hi = Htable[nlo].hi;
+  Z.lo = Htable[nlo].lo;
+
+  while (1) {
+    rem  = (size_t)Z.lo&0xf;
+    Z.lo = (Z.hi<<60)|(Z.lo>>4);
+    Z.hi = (Z.hi>>4);
+    if (sizeof(size_t)==8)
+      Z.hi ^= rem_4bit[rem];
+    else
+      Z.hi ^= (uint64)rem_4bit[rem]<<32;
+
+    Z.hi ^= Htable[nhi].hi;
+    Z.lo ^= Htable[nhi].lo;
+
+    if (--cnt<0)    break;
+
+    nlo  = ((const uint8 *)Xi)[cnt];
+    nhi  = nlo>>4;
+    nlo &= 0xf;
+
+    rem  = (size_t)Z.lo&0xf;
+    Z.lo = (Z.hi<<60)|(Z.lo>>4);
+    Z.hi = (Z.hi>>4);
+    if (sizeof(size_t)==8)
+      Z.hi ^= rem_4bit[rem];
+    else
+      Z.hi ^= (uint64)rem_4bit[rem]<<32;
+
+    Z.hi ^= Htable[nlo].hi;
+    Z.lo ^= Htable[nlo].lo;
+  }
+  Xi[0] = ToBE64(Z.hi);
+  Xi[1] = ToBE64(Z.lo);
+}
+
+void gcm_ghash_4bit(uint64 Xi[2],const aesgcm_u128 Htable[16], const uint8 *inp,size_t len) {
+    aesgcm_u128 Z;
+    int cnt;
+    size_t rem, nlo, nhi;
+
+    do {
+      cnt  = 15;
+      nlo  = ((const uint8 *)Xi)[15];
+      nlo ^= inp[15];
+      nhi  = nlo>>4;
+      nlo &= 0xf;
+
+      Z.hi = Htable[nlo].hi;
+      Z.lo = Htable[nlo].lo;
+
+      while (1) {
+        rem  = (size_t)Z.lo&0xf;
+        Z.lo = (Z.hi<<60)|(Z.lo>>4);
+        Z.hi = (Z.hi>>4);
+        if (sizeof(size_t)==8)
+          Z.hi ^= rem_4bit[rem];
+        else
+          Z.hi ^= (uint64)rem_4bit[rem]<<32;
+
+        Z.hi ^= Htable[nhi].hi;
+        Z.lo ^= Htable[nhi].lo;
+
+        if (--cnt<0)    break;
+
+        nlo  = ((const uint8 *)Xi)[cnt];
+        nlo ^= inp[cnt];
+        nhi  = nlo>>4;
+        nlo &= 0xf;
+
+        rem  = (size_t)Z.lo&0xf;
+        Z.lo = (Z.hi<<60)|(Z.lo>>4);
+        Z.hi = (Z.hi>>4);
+        if (sizeof(size_t)==8)
+          Z.hi ^= rem_4bit[rem];
+        else
+          Z.hi ^= (uint64)rem_4bit[rem]<<32;
+
+        Z.hi ^= Htable[nlo].hi;
+        Z.lo ^= Htable[nlo].lo;
+      }
+    Xi[0] = ToBE64(Z.hi);
+    Xi[1] = ToBE64(Z.lo);
+
+    } while (inp+=16, len-=16);
+}
+#endif
+
+void CRYPTO_gcm128_init(AesGcm128StaticContext *ctx, const uint8 *key, int key_size) {
+  memset(ctx,0,sizeof(*ctx));
+  ctx->use_aesni_gcm_crypt = X86_PCAP_MOVBE;
+  aesni_set_encrypt_key(key, key_size, &ctx->aes);
+  aesni_encrypt(ctx->H.c,ctx->H.c, &ctx->aes);
+  ctx->H.u[0] = ToBE64(ctx->H.u[0]);
+  ctx->H.u[1] = ToBE64(ctx->H.u[1]);
+  if (X86_PCAP_AVX) {
+    gcm_init_avx(ctx->Htable,ctx->H.u);
+    ctx->gmult = gcm_gmult_avx;
+    ctx->ghash = gcm_ghash_avx;
+  } else if (X86_PCAP_PCLMULQDQ) {
+    gcm_init_clmul(ctx->Htable,ctx->H.u);
+    ctx->gmult = gcm_gmult_clmul;
+    ctx->ghash = gcm_ghash_clmul;
+  } else {
+    gcm_init_4bit(ctx->Htable, ctx->H.u);
+    ctx->gmult = gcm_gmult_4bit;
+    ctx->ghash = gcm_ghash_4bit;
+  }
+}
+
+void CRYPTO_gcm128_setiv(AesGcm128TempContext *ctx, AesGcm128StaticContext *sctx, const unsigned char *iv, size_t len) {
+  unsigned int ctr;
+  void (*gcm_gmult_p)(uint64 Xi[2],const aesgcm_u128 Htable[16])  = sctx->gmult;
+
+  ctx->sctx = sctx;
+  ctx->Yi.u[0]  = 0;
+  ctx->Yi.u[1]  = 0;
+  ctx->Xi.u[0]  = 0;
+  ctx->Xi.u[1]  = 0;
+  ctx->len.u[0] = 0;  /* AAD length */
+  ctx->len.u[1] = 0;  /* message length */
+  ctx->ares = 0;
+  ctx->mres = 0;
+
+  if (len==12) {
+    memcpy(ctx->Yi.c,iv,12);
+    ctx->Yi.c[15]=1;
+    ctr=1;
+  } else {
+    size_t i;
+    uint64 len0 = len;
+
+    while (len>=16) {
+      for (i=0; i<16; ++i) ctx->Yi.c[i] ^= iv[i];
+      GCM_MUL(ctx,Yi);
+      iv += 16;
+      len -= 16;
+    }
+    if (len) {
+      for (i=0; i<len; ++i) ctx->Yi.c[i] ^= iv[i];
+      GCM_MUL(ctx,Yi);
+    }
+    len0 <<= 3;
+    ctx->Yi.u[1]  ^= ToBE64(len0);
+
+    GCM_MUL(ctx,Yi);
+
+    ctr = ToBE32(ctx->Yi.d[3]);
+  }
+
+  aesni_encrypt(ctx->Yi.c, ctx->EK0.c, &sctx->aes);
+  ++ctr;
+  ctx->Yi.d[3] = ToBE32(ctr);
+}
+
+union AesGcmIV {
+  uint32 nonce[3];
+  uint8 nonceb[12];
+};
+
+void aesgcm_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                    const uint8 *ad, const size_t ad_len,
+                    const uint64 nonce, AesGcm128StaticContext *sctx) {
+  AesGcm128TempContext ctx;
+  AesGcmIV iv;
+
+  WriteLE64(iv.nonce, nonce);
+  iv.nonce[2] = 0;
+
+  CRYPTO_gcm128_setiv(&ctx, sctx, iv.nonceb, sizeof(iv));
+  CRYPTO_gcm128_aad(&ctx, ad, ad_len);
+  CRYPTO_gcm128_encrypt_ctr32(&ctx, src, dst, src_len);
+  CRYPTO_gcm128_finish(&ctx, dst + src_len, 16);
+}
+
+void aesgcm_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len,
+                            const uint8 *ad, const size_t ad_len,
+                            const uint64 nonce, AesGcm128StaticContext *sctx,
+                            uint8 mac[16]) {
+  AesGcm128TempContext ctx;
+  AesGcmIV iv;
+
+  WriteLE64(iv.nonce, nonce);
+  iv.nonce[2] = 0;
+
+  CRYPTO_gcm128_setiv(&ctx, sctx, iv.nonceb, sizeof(iv));
+  CRYPTO_gcm128_aad(&ctx, ad, ad_len);
+  CRYPTO_gcm128_decrypt_ctr32(&ctx, src, dst, src_len);
+  CRYPTO_gcm128_finish(&ctx, mac, 16);
+}
+
+#if 1
+
+/*
+* GCM test vectors from:
+*
+* http://csrc.nist.gov/groups/STM/cavp/documents/mac/gcmtestvectors.zip
+*/
+#define MAX_TESTS   6
+
+static int key_index[MAX_TESTS] =
+{ 0, 0, 1, 1, 1, 1 };
+
+static uint8 key[MAX_TESTS][32] =
+{
+  { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+  { 0xfe, 0xff, 0xe9, 0x92, 0x86, 0x65, 0x73, 0x1c,
+  0x6d, 0x6a, 0x8f, 0x94, 0x67, 0x30, 0x83, 0x08,
+  0xfe, 0xff, 0xe9, 0x92, 0x86, 0x65, 0x73, 0x1c,
+  0x6d, 0x6a, 0x8f, 0x94, 0x67, 0x30, 0x83, 0x08 },  
+};
+
+static size_t iv_len[MAX_TESTS] =
+{ 12, 12, 12, 12, 8, 60 };
+
+static int iv_index[MAX_TESTS] =
+{ 0, 0, 1, 1, 1, 2 };
+
+static uint8 iv[MAX_TESTS][64] =
+{
+  { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00 },
+  { 0xca, 0xfe, 0xba, 0xbe, 0xfa, 0xce, 0xdb, 0xad,
+  0xde, 0xca, 0xf8, 0x88 },
+  { 0x93, 0x13, 0x22, 0x5d, 0xf8, 0x84, 0x06, 0xe5,
+  0x55, 0x90, 0x9c, 0x5a, 0xff, 0x52, 0x69, 0xaa, 
+  0x6a, 0x7a, 0x95, 0x38, 0x53, 0x4f, 0x7d, 0xa1,
+  0xe4, 0xc3, 0x03, 0xd2, 0xa3, 0x18, 0xa7, 0x28, 
+  0xc3, 0xc0, 0xc9, 0x51, 0x56, 0x80, 0x95, 0x39,
+  0xfc, 0xf0, 0xe2, 0x42, 0x9a, 0x6b, 0x52, 0x54, 
+  0x16, 0xae, 0xdb, 0xf5, 0xa0, 0xde, 0x6a, 0x57,
+  0xa6, 0x37, 0xb3, 0x9b }, 
+};
+
+static size_t add_len[MAX_TESTS] =
+{ 0, 0, 0, 20, 20, 20 };
+
+int add_index[MAX_TESTS] =
+{ 0, 0, 0, 1, 1, 1 };
+
+static uint8 additional[MAX_TESTS][64] =
+{
+  { 0x00 },
+  { 0xfe, 0xed, 0xfa, 0xce, 0xde, 0xad, 0xbe, 0xef,
+  0xfe, 0xed, 0xfa, 0xce, 0xde, 0xad, 0xbe, 0xef, 
+  0xab, 0xad, 0xda, 0xd2 },
+};
+
+static size_t pt_len[MAX_TESTS] =
+{ 0, 16, 64, 60, 60, 60 };
+
+static int pt_index[MAX_TESTS] =
+{ 0, 0, 1, 1, 1, 1 };
+
+static uint8 pt[MAX_TESTS][64] =
+{
+  { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
+  { 0xd9, 0x31, 0x32, 0x25, 0xf8, 0x84, 0x06, 0xe5,
+  0xa5, 0x59, 0x09, 0xc5, 0xaf, 0xf5, 0x26, 0x9a,
+  0x86, 0xa7, 0xa9, 0x53, 0x15, 0x34, 0xf7, 0xda,
+  0x2e, 0x4c, 0x30, 0x3d, 0x8a, 0x31, 0x8a, 0x72,
+  0x1c, 0x3c, 0x0c, 0x95, 0x95, 0x68, 0x09, 0x53,
+  0x2f, 0xcf, 0x0e, 0x24, 0x49, 0xa6, 0xb5, 0x25,
+  0xb1, 0x6a, 0xed, 0xf5, 0xaa, 0x0d, 0xe6, 0x57,
+  0xba, 0x63, 0x7b, 0x39, 0x1a, 0xaf, 0xd2, 0x55 },
+};
+
+static uint8 ct[MAX_TESTS * 3][64] =
+{
+  { 0x00 },
+  { 0x03, 0x88, 0xda, 0xce, 0x60, 0xb6, 0xa3, 0x92,
+  0xf3, 0x28, 0xc2, 0xb9, 0x71, 0xb2, 0xfe, 0x78 },
+  { 0x42, 0x83, 0x1e, 0xc2, 0x21, 0x77, 0x74, 0x24,
+  0x4b, 0x72, 0x21, 0xb7, 0x84, 0xd0, 0xd4, 0x9c, 
+  0xe3, 0xaa, 0x21, 0x2f, 0x2c, 0x02, 0xa4, 0xe0,
+  0x35, 0xc1, 0x7e, 0x23, 0x29, 0xac, 0xa1, 0x2e, 
+  0x21, 0xd5, 0x14, 0xb2, 0x54, 0x66, 0x93, 0x1c,
+  0x7d, 0x8f, 0x6a, 0x5a, 0xac, 0x84, 0xaa, 0x05, 
+  0x1b, 0xa3, 0x0b, 0x39, 0x6a, 0x0a, 0xac, 0x97,
+  0x3d, 0x58, 0xe0, 0x91, 0x47, 0x3f, 0x59, 0x85 },
+  { 0x42, 0x83, 0x1e, 0xc2, 0x21, 0x77, 0x74, 0x24,
+  0x4b, 0x72, 0x21, 0xb7, 0x84, 0xd0, 0xd4, 0x9c, 
+  0xe3, 0xaa, 0x21, 0x2f, 0x2c, 0x02, 0xa4, 0xe0,
+  0x35, 0xc1, 0x7e, 0x23, 0x29, 0xac, 0xa1, 0x2e, 
+  0x21, 0xd5, 0x14, 0xb2, 0x54, 0x66, 0x93, 0x1c,
+  0x7d, 0x8f, 0x6a, 0x5a, 0xac, 0x84, 0xaa, 0x05, 
+  0x1b, 0xa3, 0x0b, 0x39, 0x6a, 0x0a, 0xac, 0x97,
+  0x3d, 0x58, 0xe0, 0x91 },
+  { 0x61, 0x35, 0x3b, 0x4c, 0x28, 0x06, 0x93, 0x4a,
+  0x77, 0x7f, 0xf5, 0x1f, 0xa2, 0x2a, 0x47, 0x55, 
+  0x69, 0x9b, 0x2a, 0x71, 0x4f, 0xcd, 0xc6, 0xf8,
+  0x37, 0x66, 0xe5, 0xf9, 0x7b, 0x6c, 0x74, 0x23, 
+  0x73, 0x80, 0x69, 0x00, 0xe4, 0x9f, 0x24, 0xb2,
+  0x2b, 0x09, 0x75, 0x44, 0xd4, 0x89, 0x6b, 0x42, 
+  0x49, 0x89, 0xb5, 0xe1, 0xeb, 0xac, 0x0f, 0x07,
+  0xc2, 0x3f, 0x45, 0x98 },
+  { 0x8c, 0xe2, 0x49, 0x98, 0x62, 0x56, 0x15, 0xb6,
+  0x03, 0xa0, 0x33, 0xac, 0xa1, 0x3f, 0xb8, 0x94, 
+  0xbe, 0x91, 0x12, 0xa5, 0xc3, 0xa2, 0x11, 0xa8,
+  0xba, 0x26, 0x2a, 0x3c, 0xca, 0x7e, 0x2c, 0xa7, 
+  0x01, 0xe4, 0xa9, 0xa4, 0xfb, 0xa4, 0x3c, 0x90,
+  0xcc, 0xdc, 0xb2, 0x81, 0xd4, 0x8c, 0x7c, 0x6f, 
+  0xd6, 0x28, 0x75, 0xd2, 0xac, 0xa4, 0x17, 0x03,
+  0x4c, 0x34, 0xae, 0xe5 },
+  { 0x00 },
+  { 0x98, 0xe7, 0x24, 0x7c, 0x07, 0xf0, 0xfe, 0x41,
+  0x1c, 0x26, 0x7e, 0x43, 0x84, 0xb0, 0xf6, 0x00 }, 
+  { 0x39, 0x80, 0xca, 0x0b, 0x3c, 0x00, 0xe8, 0x41,
+  0xeb, 0x06, 0xfa, 0xc4, 0x87, 0x2a, 0x27, 0x57, 
+  0x85, 0x9e, 0x1c, 0xea, 0xa6, 0xef, 0xd9, 0x84,
+  0x62, 0x85, 0x93, 0xb4, 0x0c, 0xa1, 0xe1, 0x9c, 
+  0x7d, 0x77, 0x3d, 0x00, 0xc1, 0x44, 0xc5, 0x25,
+  0xac, 0x61, 0x9d, 0x18, 0xc8, 0x4a, 0x3f, 0x47, 
+  0x18, 0xe2, 0x44, 0x8b, 0x2f, 0xe3, 0x24, 0xd9,
+  0xcc, 0xda, 0x27, 0x10, 0xac, 0xad, 0xe2, 0x56 },
+  { 0x39, 0x80, 0xca, 0x0b, 0x3c, 0x00, 0xe8, 0x41,
+  0xeb, 0x06, 0xfa, 0xc4, 0x87, 0x2a, 0x27, 0x57, 
+  0x85, 0x9e, 0x1c, 0xea, 0xa6, 0xef, 0xd9, 0x84,
+  0x62, 0x85, 0x93, 0xb4, 0x0c, 0xa1, 0xe1, 0x9c, 
+  0x7d, 0x77, 0x3d, 0x00, 0xc1, 0x44, 0xc5, 0x25, 
+  0xac, 0x61, 0x9d, 0x18, 0xc8, 0x4a, 0x3f, 0x47, 
+  0x18, 0xe2, 0x44, 0x8b, 0x2f, 0xe3, 0x24, 0xd9,
+  0xcc, 0xda, 0x27, 0x10 }, 
+  { 0x0f, 0x10, 0xf5, 0x99, 0xae, 0x14, 0xa1, 0x54,
+  0xed, 0x24, 0xb3, 0x6e, 0x25, 0x32, 0x4d, 0xb8, 
+  0xc5, 0x66, 0x63, 0x2e, 0xf2, 0xbb, 0xb3, 0x4f,
+  0x83, 0x47, 0x28, 0x0f, 0xc4, 0x50, 0x70, 0x57, 
+  0xfd, 0xdc, 0x29, 0xdf, 0x9a, 0x47, 0x1f, 0x75,
+  0xc6, 0x65, 0x41, 0xd4, 0xd4, 0xda, 0xd1, 0xc9, 
+  0xe9, 0x3a, 0x19, 0xa5, 0x8e, 0x8b, 0x47, 0x3f,
+  0xa0, 0xf0, 0x62, 0xf7 }, 
+  { 0xd2, 0x7e, 0x88, 0x68, 0x1c, 0xe3, 0x24, 0x3c,
+  0x48, 0x30, 0x16, 0x5a, 0x8f, 0xdc, 0xf9, 0xff, 
+  0x1d, 0xe9, 0xa1, 0xd8, 0xe6, 0xb4, 0x47, 0xef,
+  0x6e, 0xf7, 0xb7, 0x98, 0x28, 0x66, 0x6e, 0x45, 
+  0x81, 0xe7, 0x90, 0x12, 0xaf, 0x34, 0xdd, 0xd9,
+  0xe2, 0xf0, 0x37, 0x58, 0x9b, 0x29, 0x2d, 0xb3, 
+  0xe6, 0x7c, 0x03, 0x67, 0x45, 0xfa, 0x22, 0xe7,
+  0xe9, 0xb7, 0x37, 0x3b }, 
+  { 0x00 },
+  { 0xce, 0xa7, 0x40, 0x3d, 0x4d, 0x60, 0x6b, 0x6e, 
+  0x07, 0x4e, 0xc5, 0xd3, 0xba, 0xf3, 0x9d, 0x18 }, 
+  { 0x52, 0x2d, 0xc1, 0xf0, 0x99, 0x56, 0x7d, 0x07, 
+  0xf4, 0x7f, 0x37, 0xa3, 0x2a, 0x84, 0x42, 0x7d, 
+  0x64, 0x3a, 0x8c, 0xdc, 0xbf, 0xe5, 0xc0, 0xc9, 
+  0x75, 0x98, 0xa2, 0xbd, 0x25, 0x55, 0xd1, 0xaa, 
+  0x8c, 0xb0, 0x8e, 0x48, 0x59, 0x0d, 0xbb, 0x3d, 
+  0xa7, 0xb0, 0x8b, 0x10, 0x56, 0x82, 0x88, 0x38, 
+  0xc5, 0xf6, 0x1e, 0x63, 0x93, 0xba, 0x7a, 0x0a, 
+  0xbc, 0xc9, 0xf6, 0x62, 0x89, 0x80, 0x15, 0xad }, 
+  { 0x52, 0x2d, 0xc1, 0xf0, 0x99, 0x56, 0x7d, 0x07, 
+  0xf4, 0x7f, 0x37, 0xa3, 0x2a, 0x84, 0x42, 0x7d,  
+  0x64, 0x3a, 0x8c, 0xdc, 0xbf, 0xe5, 0xc0, 0xc9, 
+  0x75, 0x98, 0xa2, 0xbd, 0x25, 0x55, 0xd1, 0xaa, 
+  0x8c, 0xb0, 0x8e, 0x48, 0x59, 0x0d, 0xbb, 0x3d, 
+  0xa7, 0xb0, 0x8b, 0x10, 0x56, 0x82, 0x88, 0x38, 
+  0xc5, 0xf6, 0x1e, 0x63, 0x93, 0xba, 0x7a, 0x0a, 
+  0xbc, 0xc9, 0xf6, 0x62 }, 
+  { 0xc3, 0x76, 0x2d, 0xf1, 0xca, 0x78, 0x7d, 0x32,
+  0xae, 0x47, 0xc1, 0x3b, 0xf1, 0x98, 0x44, 0xcb, 
+  0xaf, 0x1a, 0xe1, 0x4d, 0x0b, 0x97, 0x6a, 0xfa,
+  0xc5, 0x2f, 0xf7, 0xd7, 0x9b, 0xba, 0x9d, 0xe0, 
+  0xfe, 0xb5, 0x82, 0xd3, 0x39, 0x34, 0xa4, 0xf0,
+  0x95, 0x4c, 0xc2, 0x36, 0x3b, 0xc7, 0x3f, 0x78, 
+  0x62, 0xac, 0x43, 0x0e, 0x64, 0xab, 0xe4, 0x99,
+  0xf4, 0x7c, 0x9b, 0x1f }, 
+  { 0x5a, 0x8d, 0xef, 0x2f, 0x0c, 0x9e, 0x53, 0xf1,
+  0xf7, 0x5d, 0x78, 0x53, 0x65, 0x9e, 0x2a, 0x20, 
+  0xee, 0xb2, 0xb2, 0x2a, 0xaf, 0xde, 0x64, 0x19,
+  0xa0, 0x58, 0xab, 0x4f, 0x6f, 0x74, 0x6b, 0xf4, 
+  0x0f, 0xc0, 0xc3, 0xb7, 0x80, 0xf2, 0x44, 0x45,
+  0x2d, 0xa3, 0xeb, 0xf1, 0xc5, 0xd8, 0x2c, 0xde, 
+  0xa2, 0x41, 0x89, 0x97, 0x20, 0x0e, 0xf8, 0x2e,
+  0x44, 0xae, 0x7e, 0x3f }, 
+};
+
+static uint8 tag[MAX_TESTS * 3][16] =
+{
+  { 0x58, 0xe2, 0xfc, 0xce, 0xfa, 0x7e, 0x30, 0x61,
+  0x36, 0x7f, 0x1d, 0x57, 0xa4, 0xe7, 0x45, 0x5a },
+  { 0xab, 0x6e, 0x47, 0xd4, 0x2c, 0xec, 0x13, 0xbd,
+  0xf5, 0x3a, 0x67, 0xb2, 0x12, 0x57, 0xbd, 0xdf },
+  { 0x4d, 0x5c, 0x2a, 0xf3, 0x27, 0xcd, 0x64, 0xa6,
+  0x2c, 0xf3, 0x5a, 0xbd, 0x2b, 0xa6, 0xfa, 0xb4 }, 
+  { 0x5b, 0xc9, 0x4f, 0xbc, 0x32, 0x21, 0xa5, 0xdb,
+  0x94, 0xfa, 0xe9, 0x5a, 0xe7, 0x12, 0x1a, 0x47 },
+  { 0x36, 0x12, 0xd2, 0xe7, 0x9e, 0x3b, 0x07, 0x85,
+  0x56, 0x1b, 0xe1, 0x4a, 0xac, 0xa2, 0xfc, 0xcb },
+  { 0x61, 0x9c, 0xc5, 0xae, 0xff, 0xfe, 0x0b, 0xfa,
+  0x46, 0x2a, 0xf4, 0x3c, 0x16, 0x99, 0xd0, 0x50 },
+  { 0xcd, 0x33, 0xb2, 0x8a, 0xc7, 0x73, 0xf7, 0x4b,
+  0xa0, 0x0e, 0xd1, 0xf3, 0x12, 0x57, 0x24, 0x35 },
+  { 0x2f, 0xf5, 0x8d, 0x80, 0x03, 0x39, 0x27, 0xab,
+  0x8e, 0xf4, 0xd4, 0x58, 0x75, 0x14, 0xf0, 0xfb }, 
+  { 0x99, 0x24, 0xa7, 0xc8, 0x58, 0x73, 0x36, 0xbf,
+  0xb1, 0x18, 0x02, 0x4d, 0xb8, 0x67, 0x4a, 0x14 },
+  { 0x25, 0x19, 0x49, 0x8e, 0x80, 0xf1, 0x47, 0x8f,
+  0x37, 0xba, 0x55, 0xbd, 0x6d, 0x27, 0x61, 0x8c }, 
+  { 0x65, 0xdc, 0xc5, 0x7f, 0xcf, 0x62, 0x3a, 0x24,
+  0x09, 0x4f, 0xcc, 0xa4, 0x0d, 0x35, 0x33, 0xf8 }, 
+  { 0xdc, 0xf5, 0x66, 0xff, 0x29, 0x1c, 0x25, 0xbb,
+  0xb8, 0x56, 0x8f, 0xc3, 0xd3, 0x76, 0xa6, 0xd9 }, 
+  { 0x53, 0x0f, 0x8a, 0xfb, 0xc7, 0x45, 0x36, 0xb9,
+  0xa9, 0x63, 0xb4, 0xf1, 0xc4, 0xcb, 0x73, 0x8b }, 
+  { 0xd0, 0xd1, 0xc8, 0xa7, 0x99, 0x99, 0x6b, 0xf0,
+  0x26, 0x5b, 0x98, 0xb5, 0xd4, 0x8a, 0xb9, 0x19 }, 
+  { 0xb0, 0x94, 0xda, 0xc5, 0xd9, 0x34, 0x71, 0xbd,
+  0xec, 0x1a, 0x50, 0x22, 0x70, 0xe3, 0xcc, 0x6c }, 
+  { 0x76, 0xfc, 0x6e, 0xce, 0x0f, 0x4e, 0x17, 0x68,
+  0xcd, 0xdf, 0x88, 0x53, 0xbb, 0x2d, 0x55, 0x1b }, 
+  { 0x3a, 0x33, 0x7d, 0xbf, 0x46, 0xa7, 0x92, 0xc4,
+  0x5e, 0x45, 0x49, 0x13, 0xfe, 0x2e, 0xa8, 0xf2 }, 
+  { 0xa4, 0x4a, 0x82, 0x66, 0xee, 0x1c, 0x8e, 0xb0,
+  0xc8, 0xb5, 0xd4, 0xcf, 0x5a, 0xe9, 0xf1, 0x9a }, 
+};
+
+int gcm_self_test()
+{
+  uint8 buf[64];
+  uint8 tag_buf[16];
+  int i, j;
+
+  AesGcm128TempContext ctx;
+  AesGcm128StaticContext sctx;
+
+
+  {
+    AesContext aes;
+    uint8  key[16] = {43,126,21,22,40,174,210,166,171,247,21,136,9,207,79,60};
+    uint8   in[16] = {107,193,190,226,46,64,159,150,233,61,126,17,115,147,23,42};
+    uint8   out[16] = {58,215,123,180,13,122,54,96,168,158,202,243,36,102,239,151}, t[16];
+    aesni_set_encrypt_key(key, 128, &aes);
+    aesni_encrypt(in, t, &aes);
+    if (memcmp(t, out,16)) { printf("AES test fail!\n"); return 1; }
+    aesni_set_decrypt_key(key, 128, &aes);
+    aesni_decrypt(out, t, &aes);
+    if (memcmp(t, in,16)) { printf("AES test fail!\n"); return 1; }
+  }
+
+  uint8 correct[] = { 62,85,184,249,224,220,4,77,201,216,202,172,121,7,25,200, };
+  if (0) {
+    uint8 buf[512 + 16];
+    for (size_t i = 0; i < 512; i++)
+      buf[i] = (uint8)(i >> 4);// 0x11;
+    uint8 buf2[512 + 16];
+    for (size_t i = 0; i < 512; i++)
+      buf2[i] = buf[i];
+
+    size_t pp = 0x60;
+
+    CRYPTO_gcm128_init(&sctx, key[0], 128);
+    
+    sctx.use_aesni_gcm_crypt = 1;
+
+    aesgcm_decrypt_get_mac(buf, buf, pp, NULL, 0, 1, &sctx, buf + pp);
+    sctx.use_aesni_gcm_crypt = 0;
+    aesgcm_decrypt_get_mac(buf2, buf2, pp, NULL, 0, 1, &sctx, buf2 + pp);
+    //aesgcm_encrypt(buf, buf, 0x120 + 32, NULL, 0, 1, &sctx);
+
+    for (size_t i = 0; i < 16; i++)
+      printf("%d,", buf[pp + i]);
+    printf("\n");
+    for (size_t i = 0; i < 16; i++)
+      printf("%d,", buf2[pp + i]);
+    printf("\n");
+
+    if (memcmp(buf2 + pp, buf + pp, 16) == 0)
+      printf("CORRECT!!\n");
+    else
+      printf("******** FAIL ************\n");
+//    for(size_t i = 0; i < 16; i++)
+//      printf("%d,", buf[pp +i]);
+    printf("\n");
+  }
+  return 0;
+
+  for( j = 0; j < 3; j++ ) {
+    int key_len = 128 + 64 * j;
+    for( i = 0; i < MAX_TESTS; i++ ) {
+      CRYPTO_gcm128_init(&sctx, key[key_index[i]], key_len);
+      CRYPTO_gcm128_setiv(&ctx, &sctx, iv[iv_index[i]], iv_len[i]);
+      CRYPTO_gcm128_aad(&ctx, additional[add_index[i]], add_len[i]);
+      CRYPTO_gcm128_encrypt_ctr32(&ctx, pt[pt_index[i]], buf, pt_len[i]);
+      CRYPTO_gcm128_finish(&ctx, tag_buf, 16);
+      if(memcmp( buf, ct[j * 6 + i], pt_len[i] ) != 0 ||
+         memcmp( tag_buf, tag[j * 6 + i], 16 ) != 0 ) {
+        printf( "AES-GCM-%3d #%d (%s):  failed\n", key_len, i, "enc"  );
+        return( 1 );
+      }
+
+      CRYPTO_gcm128_init(&sctx, key[key_index[i]], key_len);
+      CRYPTO_gcm128_setiv(&ctx, &sctx, iv[iv_index[i]], iv_len[i]);
+      CRYPTO_gcm128_aad(&ctx, additional[add_index[i]], add_len[i]);
+      CRYPTO_gcm128_decrypt_ctr32(&ctx, ct[j * 6 + i], buf, pt_len[i]);
+      CRYPTO_gcm128_finish(&ctx, tag_buf, 16);
+      if(memcmp( buf, pt[pt_index[i]], pt_len[i] ) != 0 ||
+         memcmp( tag_buf, tag[j * 6 + i], 16 ) != 0 ) {
+        printf( "AES-GCM-%3d #%d (%s): failed\n", key_len, i, "dec"  );
+        return( 1 );
+      }
+    }
+  }
+
+  return( 0 );
+}
+
+//int main() {
+//  gcm_self_test();
+//}
+#endif
+
+#endif  // #if WITH_AESGCM
--- a/crypto/aesgcm/aesni-gcm-x86_64.pl
+++ b/crypto/aesgcm/aesni-gcm-x86_64.pl
--- a/crypto/aesgcm/aesni-x86.pl
+++ b/crypto/aesgcm/aesni-x86.pl
--- a/crypto/aesgcm/aesni-x86_64.pl
+++ b/crypto/aesgcm/aesni-x86_64.pl
--- a/crypto/aesgcm/aesni_gcm_x64_gas.s
+++ b/crypto/aesgcm/aesni_gcm_x64_gas.s
@ -0,0 +1,831 @@
+.text	
+
+.type	_aesni_ctr32_ghash_6x,@function
+.align	32
+_aesni_ctr32_ghash_6x:
+.cfi_startproc	
+	vmovdqu	32(%r11),%xmm2
+	subq	$6,%rdx
+	vpxor	%xmm4,%xmm4,%xmm4
+	vmovdqu	0-128(%rcx),%xmm15
+	vpaddb	%xmm2,%xmm1,%xmm10
+	vpaddb	%xmm2,%xmm10,%xmm11
+	vpaddb	%xmm2,%xmm11,%xmm12
+	vpaddb	%xmm2,%xmm12,%xmm13
+	vpaddb	%xmm2,%xmm13,%xmm14
+	vpxor	%xmm15,%xmm1,%xmm9
+	vmovdqu	%xmm4,16+8(%rsp)
+	jmp	.Loop6x
+
+.align	32
+.Loop6x:
+	addl	$100663296,%ebx
+	jc	.Lhandle_ctr32
+	vmovdqu	0-32(%r9),%xmm3
+	vpaddb	%xmm2,%xmm14,%xmm1
+	vpxor	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm15,%xmm11,%xmm11
+
+.Lresume_ctr32:
+	vmovdqu	%xmm1,(%r8)
+	vpclmulqdq	$0x10,%xmm3,%xmm7,%xmm5
+	vpxor	%xmm15,%xmm12,%xmm12
+	vmovups	16-128(%rcx),%xmm2
+	vpclmulqdq	$0x01,%xmm3,%xmm7,%xmm6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	xorq	%r12,%r12
+	cmpq	%r14,%r15
+
+	vaesenc	%xmm2,%xmm9,%xmm9
+	vmovdqu	48+8(%rsp),%xmm0
+	vpxor	%xmm15,%xmm13,%xmm13
+	vpclmulqdq	$0x00,%xmm3,%xmm7,%xmm1
+	vaesenc	%xmm2,%xmm10,%xmm10
+	vpxor	%xmm15,%xmm14,%xmm14
+	setnc	%r12b
+	vpclmulqdq	$0x11,%xmm3,%xmm7,%xmm7
+	vaesenc	%xmm2,%xmm11,%xmm11
+	vmovdqu	16-32(%r9),%xmm3
+	negq	%r12
+	vaesenc	%xmm2,%xmm12,%xmm12
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm3,%xmm0,%xmm5
+	vpxor	%xmm4,%xmm8,%xmm8
+	vaesenc	%xmm2,%xmm13,%xmm13
+	vpxor	%xmm5,%xmm1,%xmm4
+	andq	$0x60,%r12
+	vmovups	32-128(%rcx),%xmm15
+	vpclmulqdq	$0x10,%xmm3,%xmm0,%xmm1
+	vaesenc	%xmm2,%xmm14,%xmm14
+
+	vpclmulqdq	$0x01,%xmm3,%xmm0,%xmm2
+	leaq	(%r14,%r12,1),%r14
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	16+8(%rsp),%xmm8,%xmm8
+	vpclmulqdq	$0x11,%xmm3,%xmm0,%xmm3
+	vmovdqu	64+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	88(%r14),%r13
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	80(%r14),%r12
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,32+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,40+8(%rsp)
+	vmovdqu	48-32(%r9),%xmm5
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	48-128(%rcx),%xmm15
+	vpxor	%xmm1,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm5,%xmm0,%xmm1
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm5,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm3,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm5,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpclmulqdq	$0x11,%xmm5,%xmm0,%xmm5
+	vmovdqu	80+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	%xmm1,%xmm4,%xmm4
+	vmovdqu	64-32(%r9),%xmm1
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	64-128(%rcx),%xmm15
+	vpxor	%xmm2,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm1,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm3,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm1,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	72(%r14),%r13
+	vpxor	%xmm5,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm1,%xmm0,%xmm5
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	64(%r14),%r12
+	vpclmulqdq	$0x11,%xmm1,%xmm0,%xmm1
+	vmovdqu	96+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,48+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,56+8(%rsp)
+	vpxor	%xmm2,%xmm4,%xmm4
+	vmovdqu	96-32(%r9),%xmm2
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	80-128(%rcx),%xmm15
+	vpxor	%xmm3,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm2,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm2,%xmm0,%xmm5
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	56(%r14),%r13
+	vpxor	%xmm1,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm2,%xmm0,%xmm1
+	vpxor	112+8(%rsp),%xmm8,%xmm8
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	48(%r14),%r12
+	vpclmulqdq	$0x11,%xmm2,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,64+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,72+8(%rsp)
+	vpxor	%xmm3,%xmm4,%xmm4
+	vmovdqu	112-32(%r9),%xmm3
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	96-128(%rcx),%xmm15
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm5
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm6,%xmm6
+	vpclmulqdq	$0x01,%xmm3,%xmm8,%xmm1
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	40(%r14),%r13
+	vpxor	%xmm2,%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm3,%xmm8,%xmm2
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	32(%r14),%r12
+	vpclmulqdq	$0x11,%xmm3,%xmm8,%xmm8
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,80+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,88+8(%rsp)
+	vpxor	%xmm5,%xmm6,%xmm6
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	%xmm1,%xmm6,%xmm6
+
+	vmovups	112-128(%rcx),%xmm15
+	vpslldq	$8,%xmm6,%xmm5
+	vpxor	%xmm2,%xmm4,%xmm4
+	vmovdqu	16(%r11),%xmm3
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm8,%xmm7,%xmm7
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm5,%xmm4,%xmm4
+	movbeq	24(%r14),%r13
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	16(%r14),%r12
+	vpalignr	$8,%xmm4,%xmm4,%xmm0
+	vpclmulqdq	$0x10,%xmm3,%xmm4,%xmm4
+	movq	%r13,96+8(%rsp)
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r12,104+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vmovups	128-128(%rcx),%xmm1
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vmovups	144-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vpsrldq	$8,%xmm6,%xmm6
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vpxor	%xmm6,%xmm7,%xmm7
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vpxor	%xmm0,%xmm4,%xmm4
+	movbeq	8(%r14),%r13
+	vaesenc	%xmm1,%xmm13,%xmm13
+	movbeq	0(%r14),%r12
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	160-128(%rcx),%xmm1
+	cmpl	$11,%ebp
+	jb	.Lenc_tail
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vaesenc	%xmm1,%xmm13,%xmm13
+	vmovups	176-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	192-128(%rcx),%xmm1
+	je	.Lenc_tail
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vaesenc	%xmm1,%xmm13,%xmm13
+	vmovups	208-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	224-128(%rcx),%xmm1
+	jmp	.Lenc_tail
+
+.align	32
+.Lhandle_ctr32:
+	vmovdqu	(%r11),%xmm0
+	vpshufb	%xmm0,%xmm1,%xmm6
+	vmovdqu	48(%r11),%xmm5
+	vpaddd	64(%r11),%xmm6,%xmm10
+	vpaddd	%xmm5,%xmm6,%xmm11
+	vmovdqu	0-32(%r9),%xmm3
+	vpaddd	%xmm5,%xmm10,%xmm12
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm11,%xmm13
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm15,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm12,%xmm14
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vpxor	%xmm15,%xmm11,%xmm11
+	vpaddd	%xmm5,%xmm13,%xmm1
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vpshufb	%xmm0,%xmm1,%xmm1
+	jmp	.Lresume_ctr32
+
+.align	32
+.Lenc_tail:
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vmovdqu	%xmm7,16+8(%rsp)
+	vpalignr	$8,%xmm4,%xmm4,%xmm8
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpclmulqdq	$0x10,%xmm3,%xmm4,%xmm4
+	vpxor	0(%rdi),%xmm1,%xmm2
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpxor	16(%rdi),%xmm1,%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vpxor	32(%rdi),%xmm1,%xmm5
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	48(%rdi),%xmm1,%xmm6
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	64(%rdi),%xmm1,%xmm7
+	vpxor	80(%rdi),%xmm1,%xmm3
+	vmovdqu	(%r8),%xmm1
+
+	vaesenclast	%xmm2,%xmm9,%xmm9
+	vmovdqu	32(%r11),%xmm2
+	vaesenclast	%xmm0,%xmm10,%xmm10
+	vpaddb	%xmm2,%xmm1,%xmm0
+	movq	%r13,112+8(%rsp)
+	leaq	96(%rdi),%rdi
+	vaesenclast	%xmm5,%xmm11,%xmm11
+	vpaddb	%xmm2,%xmm0,%xmm5
+	movq	%r12,120+8(%rsp)
+	leaq	96(%rsi),%rsi
+	vmovdqu	0-128(%rcx),%xmm15
+	vaesenclast	%xmm6,%xmm12,%xmm12
+	vpaddb	%xmm2,%xmm5,%xmm6
+	vaesenclast	%xmm7,%xmm13,%xmm13
+	vpaddb	%xmm2,%xmm6,%xmm7
+	vaesenclast	%xmm3,%xmm14,%xmm14
+	vpaddb	%xmm2,%xmm7,%xmm3
+
+	addq	$0x60,%r10
+	subq	$0x6,%rdx
+	jc	.L6x_done
+
+	vmovups	%xmm9,-96(%rsi)
+	vpxor	%xmm15,%xmm1,%xmm9
+	vmovups	%xmm10,-80(%rsi)
+	vmovdqa	%xmm0,%xmm10
+	vmovups	%xmm11,-64(%rsi)
+	vmovdqa	%xmm5,%xmm11
+	vmovups	%xmm12,-48(%rsi)
+	vmovdqa	%xmm6,%xmm12
+	vmovups	%xmm13,-32(%rsi)
+	vmovdqa	%xmm7,%xmm13
+	vmovups	%xmm14,-16(%rsi)
+	vmovdqa	%xmm3,%xmm14
+	vmovdqu	32+8(%rsp),%xmm7
+	jmp	.Loop6x
+
+.L6x_done:
+	vpxor	16+8(%rsp),%xmm8,%xmm8
+	vpxor	%xmm4,%xmm8,%xmm8
+
+	ret
+.cfi_endproc	
+.size	_aesni_ctr32_ghash_6x,.-_aesni_ctr32_ghash_6x
+.globl	aesni_gcm_decrypt
+.type	aesni_gcm_decrypt,@function
+.align	32
+aesni_gcm_decrypt:
+.cfi_startproc	
+	xorq	%r10,%r10
+
+
+
+	cmpq	$0x60,%rdx
+	jb	.Lgcm_dec_abort
+
+	leaq	(%rsp),%rax
+.cfi_def_cfa_register	%rax
+	pushq	%rbx
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_offset	%r15,-56
+	vzeroupper
+
+	vmovdqu	(%r8),%xmm1
+	addq	$-128,%rsp
+	movl	12(%r8),%ebx
+	leaq	.Lbswap_mask(%rip),%r11
+	leaq	-128(%rcx),%r14
+	movq	$0xf80,%r15
+	vmovdqu	16(%r8),%xmm8
+	andq	$-128,%rsp
+	vmovdqu	(%r11),%xmm0
+	leaq	128(%rcx),%rcx
+	leaq	16+32(%r9),%r9
+	movl	240-128(%rcx),%ebp
+	vpshufb	%xmm0,%xmm8,%xmm8
+
+	andq	%r15,%r14
+	andq	%rsp,%r15
+	subq	%r14,%r15
+	jc	.Ldec_no_key_aliasing
+	cmpq	$768,%r15
+	jnc	.Ldec_no_key_aliasing
+	subq	%r15,%rsp
+.Ldec_no_key_aliasing:
+
+	vmovdqu	80(%rdi),%xmm7
+	leaq	(%rdi),%r14
+	vmovdqu	64(%rdi),%xmm4
+
+
+
+
+
+
+
+	leaq	-192(%rdi,%rdx,1),%r15
+
+	vmovdqu	48(%rdi),%xmm5
+	shrq	$4,%rdx
+	xorq	%r10,%r10
+	vmovdqu	32(%rdi),%xmm6
+	vpshufb	%xmm0,%xmm7,%xmm7
+	vmovdqu	16(%rdi),%xmm2
+	vpshufb	%xmm0,%xmm4,%xmm4
+	vmovdqu	(%rdi),%xmm3
+	vpshufb	%xmm0,%xmm5,%xmm5
+	vmovdqu	%xmm4,48(%rsp)
+	vpshufb	%xmm0,%xmm6,%xmm6
+	vmovdqu	%xmm5,64(%rsp)
+	vpshufb	%xmm0,%xmm2,%xmm2
+	vmovdqu	%xmm6,80(%rsp)
+	vpshufb	%xmm0,%xmm3,%xmm3
+	vmovdqu	%xmm2,96(%rsp)
+	vmovdqu	%xmm3,112(%rsp)
+
+	call	_aesni_ctr32_ghash_6x
+
+	vmovups	%xmm9,-96(%rsi)
+	vmovups	%xmm10,-80(%rsi)
+	vmovups	%xmm11,-64(%rsi)
+	vmovups	%xmm12,-48(%rsi)
+	vmovups	%xmm13,-32(%rsi)
+	vmovups	%xmm14,-16(%rsi)
+
+	vpshufb	(%r11),%xmm8,%xmm8
+	vmovdqu	%xmm8,16(%r8)
+
+	vzeroupper
+	movq	-48(%rax),%r15
+.cfi_restore	%r15
+	movq	-40(%rax),%r14
+.cfi_restore	%r14
+	movq	-32(%rax),%r13
+.cfi_restore	%r13
+	movq	-24(%rax),%r12
+.cfi_restore	%r12
+	movq	-16(%rax),%rbp
+.cfi_restore	%rbp
+	movq	-8(%rax),%rbx
+.cfi_restore	%rbx
+	leaq	(%rax),%rsp
+.cfi_def_cfa_register	%rsp
+.Lgcm_dec_abort:
+	movq	%r10,%rax
+	ret
+.cfi_endproc	
+.size	aesni_gcm_decrypt,.-aesni_gcm_decrypt
+.type	_aesni_ctr32_6x,@function
+.align	32
+_aesni_ctr32_6x:
+.cfi_startproc	
+	vmovdqu	0-128(%rcx),%xmm4
+	vmovdqu	32(%r11),%xmm2
+	leaq	-1(%rbp),%r13
+	vmovups	16-128(%rcx),%xmm15
+	leaq	32-128(%rcx),%r12
+	vpxor	%xmm4,%xmm1,%xmm9
+	addl	$100663296,%ebx
+	jc	.Lhandle_ctr32_2
+	vpaddb	%xmm2,%xmm1,%xmm10
+	vpaddb	%xmm2,%xmm10,%xmm11
+	vpxor	%xmm4,%xmm10,%xmm10
+	vpaddb	%xmm2,%xmm11,%xmm12
+	vpxor	%xmm4,%xmm11,%xmm11
+	vpaddb	%xmm2,%xmm12,%xmm13
+	vpxor	%xmm4,%xmm12,%xmm12
+	vpaddb	%xmm2,%xmm13,%xmm14
+	vpxor	%xmm4,%xmm13,%xmm13
+	vpaddb	%xmm2,%xmm14,%xmm1
+	vpxor	%xmm4,%xmm14,%xmm14
+	jmp	.Loop_ctr32
+
+.align	16
+.Loop_ctr32:
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vmovups	(%r12),%xmm15
+	leaq	16(%r12),%r12
+	decl	%r13d
+	jnz	.Loop_ctr32
+
+	vmovdqu	(%r12),%xmm3
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	0(%rdi),%xmm3,%xmm4
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	16(%rdi),%xmm3,%xmm5
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpxor	32(%rdi),%xmm3,%xmm6
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vpxor	48(%rdi),%xmm3,%xmm8
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	64(%rdi),%xmm3,%xmm2
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	80(%rdi),%xmm3,%xmm3
+	leaq	96(%rdi),%rdi
+
+	vaesenclast	%xmm4,%xmm9,%xmm9
+	vaesenclast	%xmm5,%xmm10,%xmm10
+	vaesenclast	%xmm6,%xmm11,%xmm11
+	vaesenclast	%xmm8,%xmm12,%xmm12
+	vaesenclast	%xmm2,%xmm13,%xmm13
+	vaesenclast	%xmm3,%xmm14,%xmm14
+	vmovups	%xmm9,0(%rsi)
+	vmovups	%xmm10,16(%rsi)
+	vmovups	%xmm11,32(%rsi)
+	vmovups	%xmm12,48(%rsi)
+	vmovups	%xmm13,64(%rsi)
+	vmovups	%xmm14,80(%rsi)
+	leaq	96(%rsi),%rsi
+
+	ret
+.align	32
+.Lhandle_ctr32_2:
+	vpshufb	%xmm0,%xmm1,%xmm6
+	vmovdqu	48(%r11),%xmm5
+	vpaddd	64(%r11),%xmm6,%xmm10
+	vpaddd	%xmm5,%xmm6,%xmm11
+	vpaddd	%xmm5,%xmm10,%xmm12
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm11,%xmm13
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm4,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm12,%xmm14
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vpxor	%xmm4,%xmm11,%xmm11
+	vpaddd	%xmm5,%xmm13,%xmm1
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vpxor	%xmm4,%xmm12,%xmm12
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vpxor	%xmm4,%xmm13,%xmm13
+	vpshufb	%xmm0,%xmm1,%xmm1
+	vpxor	%xmm4,%xmm14,%xmm14
+	jmp	.Loop_ctr32
+.cfi_endproc	
+.size	_aesni_ctr32_6x,.-_aesni_ctr32_6x
+
+.globl	aesni_gcm_encrypt
+.type	aesni_gcm_encrypt,@function
+.align	32
+aesni_gcm_encrypt:
+.cfi_startproc	
+	xorq	%r10,%r10
+
+
+
+
+	cmpq	$288,%rdx
+	jb	.Lgcm_enc_abort
+
+	leaq	(%rsp),%rax
+.cfi_def_cfa_register	%rax
+	pushq	%rbx
+.cfi_offset	%rbx,-16
+	pushq	%rbp
+.cfi_offset	%rbp,-24
+	pushq	%r12
+.cfi_offset	%r12,-32
+	pushq	%r13
+.cfi_offset	%r13,-40
+	pushq	%r14
+.cfi_offset	%r14,-48
+	pushq	%r15
+.cfi_offset	%r15,-56
+	vzeroupper
+
+	vmovdqu	(%r8),%xmm1
+	addq	$-128,%rsp
+	movl	12(%r8),%ebx
+	leaq	.Lbswap_mask(%rip),%r11
+	leaq	-128(%rcx),%r14
+	movq	$0xf80,%r15
+	leaq	128(%rcx),%rcx
+	vmovdqu	(%r11),%xmm0
+	andq	$-128,%rsp
+	movl	240-128(%rcx),%ebp
+
+	andq	%r15,%r14
+	andq	%rsp,%r15
+	subq	%r14,%r15
+	jc	.Lenc_no_key_aliasing
+	cmpq	$768,%r15
+	jnc	.Lenc_no_key_aliasing
+	subq	%r15,%rsp
+.Lenc_no_key_aliasing:
+
+	leaq	(%rsi),%r14
+
+
+
+
+
+
+
+
+	leaq	-192(%rsi,%rdx,1),%r15
+
+	shrq	$4,%rdx
+
+	call	_aesni_ctr32_6x
+
+	vpshufb	%xmm0,%xmm9,%xmm8
+	vpshufb	%xmm0,%xmm10,%xmm2
+	vmovdqu	%xmm8,112(%rsp)
+	vpshufb	%xmm0,%xmm11,%xmm4
+	vmovdqu	%xmm2,96(%rsp)
+	vpshufb	%xmm0,%xmm12,%xmm5
+	vmovdqu	%xmm4,80(%rsp)
+	vpshufb	%xmm0,%xmm13,%xmm6
+	vmovdqu	%xmm5,64(%rsp)
+	vpshufb	%xmm0,%xmm14,%xmm7
+	vmovdqu	%xmm6,48(%rsp)
+
+	call	_aesni_ctr32_6x
+
+	vmovdqu	16(%r8),%xmm8
+	leaq	16+32(%r9),%r9
+	subq	$12,%rdx
+	movq	$192,%r10
+	vpshufb	%xmm0,%xmm8,%xmm8
+
+	call	_aesni_ctr32_ghash_6x
+	vmovdqu	32(%rsp),%xmm7
+	vmovdqu	(%r11),%xmm0
+	vmovdqu	0-32(%r9),%xmm3
+	vpunpckhqdq	%xmm7,%xmm7,%xmm1
+	vmovdqu	32-32(%r9),%xmm15
+	vmovups	%xmm9,-96(%rsi)
+	vpshufb	%xmm0,%xmm9,%xmm9
+	vpxor	%xmm7,%xmm1,%xmm1
+	vmovups	%xmm10,-80(%rsi)
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vmovups	%xmm11,-64(%rsi)
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vmovups	%xmm12,-48(%rsi)
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vmovups	%xmm13,-32(%rsi)
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vmovups	%xmm14,-16(%rsi)
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vmovdqu	%xmm9,16(%rsp)
+	vmovdqu	48(%rsp),%xmm6
+	vmovdqu	16-32(%r9),%xmm0
+	vpunpckhqdq	%xmm6,%xmm6,%xmm2
+	vpclmulqdq	$0x00,%xmm3,%xmm7,%xmm5
+	vpxor	%xmm6,%xmm2,%xmm2
+	vpclmulqdq	$0x11,%xmm3,%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm15,%xmm1,%xmm1
+
+	vmovdqu	64(%rsp),%xmm9
+	vpclmulqdq	$0x00,%xmm0,%xmm6,%xmm4
+	vmovdqu	48-32(%r9),%xmm3
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm9,%xmm9,%xmm5
+	vpclmulqdq	$0x11,%xmm0,%xmm6,%xmm6
+	vpxor	%xmm9,%xmm5,%xmm5
+	vpxor	%xmm7,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm15,%xmm2,%xmm2
+	vmovdqu	80-32(%r9),%xmm15
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vmovdqu	80(%rsp),%xmm1
+	vpclmulqdq	$0x00,%xmm3,%xmm9,%xmm7
+	vmovdqu	64-32(%r9),%xmm0
+	vpxor	%xmm4,%xmm7,%xmm7
+	vpunpckhqdq	%xmm1,%xmm1,%xmm4
+	vpclmulqdq	$0x11,%xmm3,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpxor	%xmm6,%xmm9,%xmm9
+	vpclmulqdq	$0x00,%xmm15,%xmm5,%xmm5
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	96(%rsp),%xmm2
+	vpclmulqdq	$0x00,%xmm0,%xmm1,%xmm6
+	vmovdqu	96-32(%r9),%xmm3
+	vpxor	%xmm7,%xmm6,%xmm6
+	vpunpckhqdq	%xmm2,%xmm2,%xmm7
+	vpclmulqdq	$0x11,%xmm0,%xmm1,%xmm1
+	vpxor	%xmm2,%xmm7,%xmm7
+	vpxor	%xmm9,%xmm1,%xmm1
+	vpclmulqdq	$0x10,%xmm15,%xmm4,%xmm4
+	vmovdqu	128-32(%r9),%xmm15
+	vpxor	%xmm5,%xmm4,%xmm4
+
+	vpxor	112(%rsp),%xmm8,%xmm8
+	vpclmulqdq	$0x00,%xmm3,%xmm2,%xmm5
+	vmovdqu	112-32(%r9),%xmm0
+	vpunpckhqdq	%xmm8,%xmm8,%xmm9
+	vpxor	%xmm6,%xmm5,%xmm5
+	vpclmulqdq	$0x11,%xmm3,%xmm2,%xmm2
+	vpxor	%xmm8,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm2,%xmm2
+	vpclmulqdq	$0x00,%xmm15,%xmm7,%xmm7
+	vpxor	%xmm4,%xmm7,%xmm4
+
+	vpclmulqdq	$0x00,%xmm0,%xmm8,%xmm6
+	vmovdqu	0-32(%r9),%xmm3
+	vpunpckhqdq	%xmm14,%xmm14,%xmm1
+	vpclmulqdq	$0x11,%xmm0,%xmm8,%xmm8
+	vpxor	%xmm14,%xmm1,%xmm1
+	vpxor	%xmm5,%xmm6,%xmm5
+	vpclmulqdq	$0x10,%xmm15,%xmm9,%xmm9
+	vmovdqu	32-32(%r9),%xmm15
+	vpxor	%xmm2,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm9,%xmm6
+
+	vmovdqu	16-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm7,%xmm9
+	vpclmulqdq	$0x00,%xmm3,%xmm14,%xmm4
+	vpxor	%xmm9,%xmm6,%xmm6
+	vpunpckhqdq	%xmm13,%xmm13,%xmm2
+	vpclmulqdq	$0x11,%xmm3,%xmm14,%xmm14
+	vpxor	%xmm13,%xmm2,%xmm2
+	vpslldq	$8,%xmm6,%xmm9
+	vpclmulqdq	$0x00,%xmm15,%xmm1,%xmm1
+	vpxor	%xmm9,%xmm5,%xmm8
+	vpsrldq	$8,%xmm6,%xmm6
+	vpxor	%xmm6,%xmm7,%xmm7
+
+	vpclmulqdq	$0x00,%xmm0,%xmm13,%xmm5
+	vmovdqu	48-32(%r9),%xmm3
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpunpckhqdq	%xmm12,%xmm12,%xmm9
+	vpclmulqdq	$0x11,%xmm0,%xmm13,%xmm13
+	vpxor	%xmm12,%xmm9,%xmm9
+	vpxor	%xmm14,%xmm13,%xmm13
+	vpalignr	$8,%xmm8,%xmm8,%xmm14
+	vpclmulqdq	$0x10,%xmm15,%xmm2,%xmm2
+	vmovdqu	80-32(%r9),%xmm15
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vpclmulqdq	$0x00,%xmm3,%xmm12,%xmm4
+	vmovdqu	64-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm11,%xmm11,%xmm1
+	vpclmulqdq	$0x11,%xmm3,%xmm12,%xmm12
+	vpxor	%xmm11,%xmm1,%xmm1
+	vpxor	%xmm13,%xmm12,%xmm12
+	vxorps	16(%rsp),%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm15,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm9,%xmm9
+
+	vpclmulqdq	$0x10,16(%r11),%xmm8,%xmm8
+	vxorps	%xmm14,%xmm8,%xmm8
+
+	vpclmulqdq	$0x00,%xmm0,%xmm11,%xmm5
+	vmovdqu	96-32(%r9),%xmm3
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpunpckhqdq	%xmm10,%xmm10,%xmm2
+	vpclmulqdq	$0x11,%xmm0,%xmm11,%xmm11
+	vpxor	%xmm10,%xmm2,%xmm2
+	vpalignr	$8,%xmm8,%xmm8,%xmm14
+	vpxor	%xmm12,%xmm11,%xmm11
+	vpclmulqdq	$0x10,%xmm15,%xmm1,%xmm1
+	vmovdqu	128-32(%r9),%xmm15
+	vpxor	%xmm9,%xmm1,%xmm1
+
+	vxorps	%xmm7,%xmm14,%xmm14
+	vpclmulqdq	$0x10,16(%r11),%xmm8,%xmm8
+	vxorps	%xmm14,%xmm8,%xmm8
+
+	vpclmulqdq	$0x00,%xmm3,%xmm10,%xmm4
+	vmovdqu	112-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm8,%xmm8,%xmm9
+	vpclmulqdq	$0x11,%xmm3,%xmm10,%xmm10
+	vpxor	%xmm8,%xmm9,%xmm9
+	vpxor	%xmm11,%xmm10,%xmm10
+	vpclmulqdq	$0x00,%xmm15,%xmm2,%xmm2
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vpclmulqdq	$0x00,%xmm0,%xmm8,%xmm5
+	vpclmulqdq	$0x11,%xmm0,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpclmulqdq	$0x10,%xmm15,%xmm9,%xmm6
+	vpxor	%xmm10,%xmm7,%xmm7
+	vpxor	%xmm2,%xmm6,%xmm6
+
+	vpxor	%xmm5,%xmm7,%xmm4
+	vpxor	%xmm4,%xmm6,%xmm6
+	vpslldq	$8,%xmm6,%xmm1
+	vmovdqu	16(%r11),%xmm3
+	vpsrldq	$8,%xmm6,%xmm6
+	vpxor	%xmm1,%xmm5,%xmm8
+	vpxor	%xmm6,%xmm7,%xmm7
+
+	vpalignr	$8,%xmm8,%xmm8,%xmm2
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm8
+	vpxor	%xmm2,%xmm8,%xmm8
+
+	vpalignr	$8,%xmm8,%xmm8,%xmm2
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm8
+	vpxor	%xmm7,%xmm2,%xmm2
+	vpxor	%xmm2,%xmm8,%xmm8
+	vpshufb	(%r11),%xmm8,%xmm8
+	vmovdqu	%xmm8,16(%r8)
+
+	vzeroupper
+	movq	-48(%rax),%r15
+.cfi_restore	%r15
+	movq	-40(%rax),%r14
+.cfi_restore	%r14
+	movq	-32(%rax),%r13
+.cfi_restore	%r13
+	movq	-24(%rax),%r12
+.cfi_restore	%r12
+	movq	-16(%rax),%rbp
+.cfi_restore	%rbp
+	movq	-8(%rax),%rbx
+.cfi_restore	%rbx
+	leaq	(%rax),%rsp
+.cfi_def_cfa_register	%rsp
+.Lgcm_enc_abort:
+	movq	%r10,%rax
+	ret
+.cfi_endproc	
+.size	aesni_gcm_encrypt,.-aesni_gcm_encrypt
+.align	64
+.Lbswap_mask:
+.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+.Lpoly:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+.Lone_msb:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+.Ltwo_lsb:
+.byte	2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.Lone_lsb:
+.byte	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.byte	65,69,83,45,78,73,32,71,67,77,32,109,111,100,117,108,101,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
+.align	64
--- a/crypto/aesgcm/aesni_gcm_x64_gas_macosx.s
+++ b/crypto/aesgcm/aesni_gcm_x64_gas_macosx.s
@ -0,0 +1,831 @@
+.text	
+
+
+.p2align	5
+_aesni_ctr32_ghash_6x:
+
+	vmovdqu	32(%r11),%xmm2
+	subq	$6,%rdx
+	vpxor	%xmm4,%xmm4,%xmm4
+	vmovdqu	0-128(%rcx),%xmm15
+	vpaddb	%xmm2,%xmm1,%xmm10
+	vpaddb	%xmm2,%xmm10,%xmm11
+	vpaddb	%xmm2,%xmm11,%xmm12
+	vpaddb	%xmm2,%xmm12,%xmm13
+	vpaddb	%xmm2,%xmm13,%xmm14
+	vpxor	%xmm15,%xmm1,%xmm9
+	vmovdqu	%xmm4,16+8(%rsp)
+	jmp	L$oop6x
+
+.p2align	5
+L$oop6x:
+	addl	$100663296,%ebx
+	jc	L$handle_ctr32
+	vmovdqu	0-32(%r9),%xmm3
+	vpaddb	%xmm2,%xmm14,%xmm1
+	vpxor	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm15,%xmm11,%xmm11
+
+L$resume_ctr32:
+	vmovdqu	%xmm1,(%r8)
+	vpclmulqdq	$0x10,%xmm3,%xmm7,%xmm5
+	vpxor	%xmm15,%xmm12,%xmm12
+	vmovups	16-128(%rcx),%xmm2
+	vpclmulqdq	$0x01,%xmm3,%xmm7,%xmm6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+	xorq	%r12,%r12
+	cmpq	%r14,%r15
+
+	vaesenc	%xmm2,%xmm9,%xmm9
+	vmovdqu	48+8(%rsp),%xmm0
+	vpxor	%xmm15,%xmm13,%xmm13
+	vpclmulqdq	$0x00,%xmm3,%xmm7,%xmm1
+	vaesenc	%xmm2,%xmm10,%xmm10
+	vpxor	%xmm15,%xmm14,%xmm14
+	setnc	%r12b
+	vpclmulqdq	$0x11,%xmm3,%xmm7,%xmm7
+	vaesenc	%xmm2,%xmm11,%xmm11
+	vmovdqu	16-32(%r9),%xmm3
+	negq	%r12
+	vaesenc	%xmm2,%xmm12,%xmm12
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm3,%xmm0,%xmm5
+	vpxor	%xmm4,%xmm8,%xmm8
+	vaesenc	%xmm2,%xmm13,%xmm13
+	vpxor	%xmm5,%xmm1,%xmm4
+	andq	$0x60,%r12
+	vmovups	32-128(%rcx),%xmm15
+	vpclmulqdq	$0x10,%xmm3,%xmm0,%xmm1
+	vaesenc	%xmm2,%xmm14,%xmm14
+
+	vpclmulqdq	$0x01,%xmm3,%xmm0,%xmm2
+	leaq	(%r14,%r12,1),%r14
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	16+8(%rsp),%xmm8,%xmm8
+	vpclmulqdq	$0x11,%xmm3,%xmm0,%xmm3
+	vmovdqu	64+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	88(%r14),%r13
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	80(%r14),%r12
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,32+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,40+8(%rsp)
+	vmovdqu	48-32(%r9),%xmm5
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	48-128(%rcx),%xmm15
+	vpxor	%xmm1,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm5,%xmm0,%xmm1
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm5,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm3,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm5,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpclmulqdq	$0x11,%xmm5,%xmm0,%xmm5
+	vmovdqu	80+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	%xmm1,%xmm4,%xmm4
+	vmovdqu	64-32(%r9),%xmm1
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	64-128(%rcx),%xmm15
+	vpxor	%xmm2,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm1,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm3,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm1,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	72(%r14),%r13
+	vpxor	%xmm5,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm1,%xmm0,%xmm5
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	64(%r14),%r12
+	vpclmulqdq	$0x11,%xmm1,%xmm0,%xmm1
+	vmovdqu	96+8(%rsp),%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,48+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,56+8(%rsp)
+	vpxor	%xmm2,%xmm4,%xmm4
+	vmovdqu	96-32(%r9),%xmm2
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	80-128(%rcx),%xmm15
+	vpxor	%xmm3,%xmm6,%xmm6
+	vpclmulqdq	$0x00,%xmm2,%xmm0,%xmm3
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm2,%xmm0,%xmm5
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	56(%r14),%r13
+	vpxor	%xmm1,%xmm7,%xmm7
+	vpclmulqdq	$0x01,%xmm2,%xmm0,%xmm1
+	vpxor	112+8(%rsp),%xmm8,%xmm8
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	48(%r14),%r12
+	vpclmulqdq	$0x11,%xmm2,%xmm0,%xmm2
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,64+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,72+8(%rsp)
+	vpxor	%xmm3,%xmm4,%xmm4
+	vmovdqu	112-32(%r9),%xmm3
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vmovups	96-128(%rcx),%xmm15
+	vpxor	%xmm5,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm5
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm6,%xmm6
+	vpclmulqdq	$0x01,%xmm3,%xmm8,%xmm1
+	vaesenc	%xmm15,%xmm10,%xmm10
+	movbeq	40(%r14),%r13
+	vpxor	%xmm2,%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm3,%xmm8,%xmm2
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	32(%r14),%r12
+	vpclmulqdq	$0x11,%xmm3,%xmm8,%xmm8
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r13,80+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	movq	%r12,88+8(%rsp)
+	vpxor	%xmm5,%xmm6,%xmm6
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	%xmm1,%xmm6,%xmm6
+
+	vmovups	112-128(%rcx),%xmm15
+	vpslldq	$8,%xmm6,%xmm5
+	vpxor	%xmm2,%xmm4,%xmm4
+	vmovdqu	16(%r11),%xmm3
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	%xmm8,%xmm7,%xmm7
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	%xmm5,%xmm4,%xmm4
+	movbeq	24(%r14),%r13
+	vaesenc	%xmm15,%xmm11,%xmm11
+	movbeq	16(%r14),%r12
+	vpalignr	$8,%xmm4,%xmm4,%xmm0
+	vpclmulqdq	$0x10,%xmm3,%xmm4,%xmm4
+	movq	%r13,96+8(%rsp)
+	vaesenc	%xmm15,%xmm12,%xmm12
+	movq	%r12,104+8(%rsp)
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vmovups	128-128(%rcx),%xmm1
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vmovups	144-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vpsrldq	$8,%xmm6,%xmm6
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vpxor	%xmm6,%xmm7,%xmm7
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vpxor	%xmm0,%xmm4,%xmm4
+	movbeq	8(%r14),%r13
+	vaesenc	%xmm1,%xmm13,%xmm13
+	movbeq	0(%r14),%r12
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	160-128(%rcx),%xmm1
+	cmpl	$11,%ebp
+	jb	L$enc_tail
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vaesenc	%xmm1,%xmm13,%xmm13
+	vmovups	176-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	192-128(%rcx),%xmm1
+	je	L$enc_tail
+
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+
+	vaesenc	%xmm1,%xmm9,%xmm9
+	vaesenc	%xmm1,%xmm10,%xmm10
+	vaesenc	%xmm1,%xmm11,%xmm11
+	vaesenc	%xmm1,%xmm12,%xmm12
+	vaesenc	%xmm1,%xmm13,%xmm13
+	vmovups	208-128(%rcx),%xmm15
+	vaesenc	%xmm1,%xmm14,%xmm14
+	vmovups	224-128(%rcx),%xmm1
+	jmp	L$enc_tail
+
+.p2align	5
+L$handle_ctr32:
+	vmovdqu	(%r11),%xmm0
+	vpshufb	%xmm0,%xmm1,%xmm6
+	vmovdqu	48(%r11),%xmm5
+	vpaddd	64(%r11),%xmm6,%xmm10
+	vpaddd	%xmm5,%xmm6,%xmm11
+	vmovdqu	0-32(%r9),%xmm3
+	vpaddd	%xmm5,%xmm10,%xmm12
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm11,%xmm13
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm15,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm12,%xmm14
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vpxor	%xmm15,%xmm11,%xmm11
+	vpaddd	%xmm5,%xmm13,%xmm1
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vpshufb	%xmm0,%xmm1,%xmm1
+	jmp	L$resume_ctr32
+
+.p2align	5
+L$enc_tail:
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vmovdqu	%xmm7,16+8(%rsp)
+	vpalignr	$8,%xmm4,%xmm4,%xmm8
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpclmulqdq	$0x10,%xmm3,%xmm4,%xmm4
+	vpxor	0(%rdi),%xmm1,%xmm2
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpxor	16(%rdi),%xmm1,%xmm0
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vpxor	32(%rdi),%xmm1,%xmm5
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	48(%rdi),%xmm1,%xmm6
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	64(%rdi),%xmm1,%xmm7
+	vpxor	80(%rdi),%xmm1,%xmm3
+	vmovdqu	(%r8),%xmm1
+
+	vaesenclast	%xmm2,%xmm9,%xmm9
+	vmovdqu	32(%r11),%xmm2
+	vaesenclast	%xmm0,%xmm10,%xmm10
+	vpaddb	%xmm2,%xmm1,%xmm0
+	movq	%r13,112+8(%rsp)
+	leaq	96(%rdi),%rdi
+	vaesenclast	%xmm5,%xmm11,%xmm11
+	vpaddb	%xmm2,%xmm0,%xmm5
+	movq	%r12,120+8(%rsp)
+	leaq	96(%rsi),%rsi
+	vmovdqu	0-128(%rcx),%xmm15
+	vaesenclast	%xmm6,%xmm12,%xmm12
+	vpaddb	%xmm2,%xmm5,%xmm6
+	vaesenclast	%xmm7,%xmm13,%xmm13
+	vpaddb	%xmm2,%xmm6,%xmm7
+	vaesenclast	%xmm3,%xmm14,%xmm14
+	vpaddb	%xmm2,%xmm7,%xmm3
+
+	addq	$0x60,%r10
+	subq	$0x6,%rdx
+	jc	L$6x_done
+
+	vmovups	%xmm9,-96(%rsi)
+	vpxor	%xmm15,%xmm1,%xmm9
+	vmovups	%xmm10,-80(%rsi)
+	vmovdqa	%xmm0,%xmm10
+	vmovups	%xmm11,-64(%rsi)
+	vmovdqa	%xmm5,%xmm11
+	vmovups	%xmm12,-48(%rsi)
+	vmovdqa	%xmm6,%xmm12
+	vmovups	%xmm13,-32(%rsi)
+	vmovdqa	%xmm7,%xmm13
+	vmovups	%xmm14,-16(%rsi)
+	vmovdqa	%xmm3,%xmm14
+	vmovdqu	32+8(%rsp),%xmm7
+	jmp	L$oop6x
+
+L$6x_done:
+	vpxor	16+8(%rsp),%xmm8,%xmm8
+	vpxor	%xmm4,%xmm8,%xmm8
+
+	ret
+
+
+.globl	_aesni_gcm_decrypt
+
+.p2align	5
+_aesni_gcm_decrypt:
+
+	xorq	%r10,%r10
+
+
+
+	cmpq	$0x60,%rdx
+	jb	L$gcm_dec_abort
+
+	leaq	(%rsp),%rax
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+	vzeroupper
+
+	vmovdqu	(%r8),%xmm1
+	addq	$-128,%rsp
+	movl	12(%r8),%ebx
+	leaq	L$bswap_mask(%rip),%r11
+	leaq	-128(%rcx),%r14
+	movq	$0xf80,%r15
+	vmovdqu	16(%r8),%xmm8
+	andq	$-128,%rsp
+	vmovdqu	(%r11),%xmm0
+	leaq	128(%rcx),%rcx
+	leaq	16+32(%r9),%r9
+	movl	240-128(%rcx),%ebp
+	vpshufb	%xmm0,%xmm8,%xmm8
+
+	andq	%r15,%r14
+	andq	%rsp,%r15
+	subq	%r14,%r15
+	jc	L$dec_no_key_aliasing
+	cmpq	$768,%r15
+	jnc	L$dec_no_key_aliasing
+	subq	%r15,%rsp
+L$dec_no_key_aliasing:
+
+	vmovdqu	80(%rdi),%xmm7
+	leaq	(%rdi),%r14
+	vmovdqu	64(%rdi),%xmm4
+
+
+
+
+
+
+
+	leaq	-192(%rdi,%rdx,1),%r15
+
+	vmovdqu	48(%rdi),%xmm5
+	shrq	$4,%rdx
+	xorq	%r10,%r10
+	vmovdqu	32(%rdi),%xmm6
+	vpshufb	%xmm0,%xmm7,%xmm7
+	vmovdqu	16(%rdi),%xmm2
+	vpshufb	%xmm0,%xmm4,%xmm4
+	vmovdqu	(%rdi),%xmm3
+	vpshufb	%xmm0,%xmm5,%xmm5
+	vmovdqu	%xmm4,48(%rsp)
+	vpshufb	%xmm0,%xmm6,%xmm6
+	vmovdqu	%xmm5,64(%rsp)
+	vpshufb	%xmm0,%xmm2,%xmm2
+	vmovdqu	%xmm6,80(%rsp)
+	vpshufb	%xmm0,%xmm3,%xmm3
+	vmovdqu	%xmm2,96(%rsp)
+	vmovdqu	%xmm3,112(%rsp)
+
+	call	_aesni_ctr32_ghash_6x
+
+	vmovups	%xmm9,-96(%rsi)
+	vmovups	%xmm10,-80(%rsi)
+	vmovups	%xmm11,-64(%rsi)
+	vmovups	%xmm12,-48(%rsi)
+	vmovups	%xmm13,-32(%rsi)
+	vmovups	%xmm14,-16(%rsi)
+
+	vpshufb	(%r11),%xmm8,%xmm8
+	vmovdqu	%xmm8,16(%r8)
+
+	vzeroupper
+	movq	-48(%rax),%r15
+
+	movq	-40(%rax),%r14
+
+	movq	-32(%rax),%r13
+
+	movq	-24(%rax),%r12
+
+	movq	-16(%rax),%rbp
+
+	movq	-8(%rax),%rbx
+
+	leaq	(%rax),%rsp
+
+L$gcm_dec_abort:
+	movq	%r10,%rax
+	ret
+
+
+
+.p2align	5
+_aesni_ctr32_6x:
+
+	vmovdqu	0-128(%rcx),%xmm4
+	vmovdqu	32(%r11),%xmm2
+	leaq	-1(%rbp),%r13
+	vmovups	16-128(%rcx),%xmm15
+	leaq	32-128(%rcx),%r12
+	vpxor	%xmm4,%xmm1,%xmm9
+	addl	$100663296,%ebx
+	jc	L$handle_ctr32_2
+	vpaddb	%xmm2,%xmm1,%xmm10
+	vpaddb	%xmm2,%xmm10,%xmm11
+	vpxor	%xmm4,%xmm10,%xmm10
+	vpaddb	%xmm2,%xmm11,%xmm12
+	vpxor	%xmm4,%xmm11,%xmm11
+	vpaddb	%xmm2,%xmm12,%xmm13
+	vpxor	%xmm4,%xmm12,%xmm12
+	vpaddb	%xmm2,%xmm13,%xmm14
+	vpxor	%xmm4,%xmm13,%xmm13
+	vpaddb	%xmm2,%xmm14,%xmm1
+	vpxor	%xmm4,%xmm14,%xmm14
+	jmp	L$oop_ctr32
+
+.p2align	4
+L$oop_ctr32:
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vmovups	(%r12),%xmm15
+	leaq	16(%r12),%r12
+	decl	%r13d
+	jnz	L$oop_ctr32
+
+	vmovdqu	(%r12),%xmm3
+	vaesenc	%xmm15,%xmm9,%xmm9
+	vpxor	0(%rdi),%xmm3,%xmm4
+	vaesenc	%xmm15,%xmm10,%xmm10
+	vpxor	16(%rdi),%xmm3,%xmm5
+	vaesenc	%xmm15,%xmm11,%xmm11
+	vpxor	32(%rdi),%xmm3,%xmm6
+	vaesenc	%xmm15,%xmm12,%xmm12
+	vpxor	48(%rdi),%xmm3,%xmm8
+	vaesenc	%xmm15,%xmm13,%xmm13
+	vpxor	64(%rdi),%xmm3,%xmm2
+	vaesenc	%xmm15,%xmm14,%xmm14
+	vpxor	80(%rdi),%xmm3,%xmm3
+	leaq	96(%rdi),%rdi
+
+	vaesenclast	%xmm4,%xmm9,%xmm9
+	vaesenclast	%xmm5,%xmm10,%xmm10
+	vaesenclast	%xmm6,%xmm11,%xmm11
+	vaesenclast	%xmm8,%xmm12,%xmm12
+	vaesenclast	%xmm2,%xmm13,%xmm13
+	vaesenclast	%xmm3,%xmm14,%xmm14
+	vmovups	%xmm9,0(%rsi)
+	vmovups	%xmm10,16(%rsi)
+	vmovups	%xmm11,32(%rsi)
+	vmovups	%xmm12,48(%rsi)
+	vmovups	%xmm13,64(%rsi)
+	vmovups	%xmm14,80(%rsi)
+	leaq	96(%rsi),%rsi
+
+	ret
+.p2align	5
+L$handle_ctr32_2:
+	vpshufb	%xmm0,%xmm1,%xmm6
+	vmovdqu	48(%r11),%xmm5
+	vpaddd	64(%r11),%xmm6,%xmm10
+	vpaddd	%xmm5,%xmm6,%xmm11
+	vpaddd	%xmm5,%xmm10,%xmm12
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm11,%xmm13
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vpxor	%xmm4,%xmm10,%xmm10
+	vpaddd	%xmm5,%xmm12,%xmm14
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vpxor	%xmm4,%xmm11,%xmm11
+	vpaddd	%xmm5,%xmm13,%xmm1
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vpxor	%xmm4,%xmm12,%xmm12
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vpxor	%xmm4,%xmm13,%xmm13
+	vpshufb	%xmm0,%xmm1,%xmm1
+	vpxor	%xmm4,%xmm14,%xmm14
+	jmp	L$oop_ctr32
+
+
+
+.globl	_aesni_gcm_encrypt
+
+.p2align	5
+_aesni_gcm_encrypt:
+
+	xorq	%r10,%r10
+
+
+
+
+	cmpq	$288,%rdx
+	jb	L$gcm_enc_abort
+
+	leaq	(%rsp),%rax
+
+	pushq	%rbx
+
+	pushq	%rbp
+
+	pushq	%r12
+
+	pushq	%r13
+
+	pushq	%r14
+
+	pushq	%r15
+
+	vzeroupper
+
+	vmovdqu	(%r8),%xmm1
+	addq	$-128,%rsp
+	movl	12(%r8),%ebx
+	leaq	L$bswap_mask(%rip),%r11
+	leaq	-128(%rcx),%r14
+	movq	$0xf80,%r15
+	leaq	128(%rcx),%rcx
+	vmovdqu	(%r11),%xmm0
+	andq	$-128,%rsp
+	movl	240-128(%rcx),%ebp
+
+	andq	%r15,%r14
+	andq	%rsp,%r15
+	subq	%r14,%r15
+	jc	L$enc_no_key_aliasing
+	cmpq	$768,%r15
+	jnc	L$enc_no_key_aliasing
+	subq	%r15,%rsp
+L$enc_no_key_aliasing:
+
+	leaq	(%rsi),%r14
+
+
+
+
+
+
+
+
+	leaq	-192(%rsi,%rdx,1),%r15
+
+	shrq	$4,%rdx
+
+	call	_aesni_ctr32_6x
+
+	vpshufb	%xmm0,%xmm9,%xmm8
+	vpshufb	%xmm0,%xmm10,%xmm2
+	vmovdqu	%xmm8,112(%rsp)
+	vpshufb	%xmm0,%xmm11,%xmm4
+	vmovdqu	%xmm2,96(%rsp)
+	vpshufb	%xmm0,%xmm12,%xmm5
+	vmovdqu	%xmm4,80(%rsp)
+	vpshufb	%xmm0,%xmm13,%xmm6
+	vmovdqu	%xmm5,64(%rsp)
+	vpshufb	%xmm0,%xmm14,%xmm7
+	vmovdqu	%xmm6,48(%rsp)
+
+	call	_aesni_ctr32_6x
+
+	vmovdqu	16(%r8),%xmm8
+	leaq	16+32(%r9),%r9
+	subq	$12,%rdx
+	movq	$192,%r10
+	vpshufb	%xmm0,%xmm8,%xmm8
+
+	call	_aesni_ctr32_ghash_6x
+	vmovdqu	32(%rsp),%xmm7
+	vmovdqu	(%r11),%xmm0
+	vmovdqu	0-32(%r9),%xmm3
+	vpunpckhqdq	%xmm7,%xmm7,%xmm1
+	vmovdqu	32-32(%r9),%xmm15
+	vmovups	%xmm9,-96(%rsi)
+	vpshufb	%xmm0,%xmm9,%xmm9
+	vpxor	%xmm7,%xmm1,%xmm1
+	vmovups	%xmm10,-80(%rsi)
+	vpshufb	%xmm0,%xmm10,%xmm10
+	vmovups	%xmm11,-64(%rsi)
+	vpshufb	%xmm0,%xmm11,%xmm11
+	vmovups	%xmm12,-48(%rsi)
+	vpshufb	%xmm0,%xmm12,%xmm12
+	vmovups	%xmm13,-32(%rsi)
+	vpshufb	%xmm0,%xmm13,%xmm13
+	vmovups	%xmm14,-16(%rsi)
+	vpshufb	%xmm0,%xmm14,%xmm14
+	vmovdqu	%xmm9,16(%rsp)
+	vmovdqu	48(%rsp),%xmm6
+	vmovdqu	16-32(%r9),%xmm0
+	vpunpckhqdq	%xmm6,%xmm6,%xmm2
+	vpclmulqdq	$0x00,%xmm3,%xmm7,%xmm5
+	vpxor	%xmm6,%xmm2,%xmm2
+	vpclmulqdq	$0x11,%xmm3,%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm15,%xmm1,%xmm1
+
+	vmovdqu	64(%rsp),%xmm9
+	vpclmulqdq	$0x00,%xmm0,%xmm6,%xmm4
+	vmovdqu	48-32(%r9),%xmm3
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm9,%xmm9,%xmm5
+	vpclmulqdq	$0x11,%xmm0,%xmm6,%xmm6
+	vpxor	%xmm9,%xmm5,%xmm5
+	vpxor	%xmm7,%xmm6,%xmm6
+	vpclmulqdq	$0x10,%xmm15,%xmm2,%xmm2
+	vmovdqu	80-32(%r9),%xmm15
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vmovdqu	80(%rsp),%xmm1
+	vpclmulqdq	$0x00,%xmm3,%xmm9,%xmm7
+	vmovdqu	64-32(%r9),%xmm0
+	vpxor	%xmm4,%xmm7,%xmm7
+	vpunpckhqdq	%xmm1,%xmm1,%xmm4
+	vpclmulqdq	$0x11,%xmm3,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm4,%xmm4
+	vpxor	%xmm6,%xmm9,%xmm9
+	vpclmulqdq	$0x00,%xmm15,%xmm5,%xmm5
+	vpxor	%xmm2,%xmm5,%xmm5
+
+	vmovdqu	96(%rsp),%xmm2
+	vpclmulqdq	$0x00,%xmm0,%xmm1,%xmm6
+	vmovdqu	96-32(%r9),%xmm3
+	vpxor	%xmm7,%xmm6,%xmm6
+	vpunpckhqdq	%xmm2,%xmm2,%xmm7
+	vpclmulqdq	$0x11,%xmm0,%xmm1,%xmm1
+	vpxor	%xmm2,%xmm7,%xmm7
+	vpxor	%xmm9,%xmm1,%xmm1
+	vpclmulqdq	$0x10,%xmm15,%xmm4,%xmm4
+	vmovdqu	128-32(%r9),%xmm15
+	vpxor	%xmm5,%xmm4,%xmm4
+
+	vpxor	112(%rsp),%xmm8,%xmm8
+	vpclmulqdq	$0x00,%xmm3,%xmm2,%xmm5
+	vmovdqu	112-32(%r9),%xmm0
+	vpunpckhqdq	%xmm8,%xmm8,%xmm9
+	vpxor	%xmm6,%xmm5,%xmm5
+	vpclmulqdq	$0x11,%xmm3,%xmm2,%xmm2
+	vpxor	%xmm8,%xmm9,%xmm9
+	vpxor	%xmm1,%xmm2,%xmm2
+	vpclmulqdq	$0x00,%xmm15,%xmm7,%xmm7
+	vpxor	%xmm4,%xmm7,%xmm4
+
+	vpclmulqdq	$0x00,%xmm0,%xmm8,%xmm6
+	vmovdqu	0-32(%r9),%xmm3
+	vpunpckhqdq	%xmm14,%xmm14,%xmm1
+	vpclmulqdq	$0x11,%xmm0,%xmm8,%xmm8
+	vpxor	%xmm14,%xmm1,%xmm1
+	vpxor	%xmm5,%xmm6,%xmm5
+	vpclmulqdq	$0x10,%xmm15,%xmm9,%xmm9
+	vmovdqu	32-32(%r9),%xmm15
+	vpxor	%xmm2,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm9,%xmm6
+
+	vmovdqu	16-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm7,%xmm9
+	vpclmulqdq	$0x00,%xmm3,%xmm14,%xmm4
+	vpxor	%xmm9,%xmm6,%xmm6
+	vpunpckhqdq	%xmm13,%xmm13,%xmm2
+	vpclmulqdq	$0x11,%xmm3,%xmm14,%xmm14
+	vpxor	%xmm13,%xmm2,%xmm2
+	vpslldq	$8,%xmm6,%xmm9
+	vpclmulqdq	$0x00,%xmm15,%xmm1,%xmm1
+	vpxor	%xmm9,%xmm5,%xmm8
+	vpsrldq	$8,%xmm6,%xmm6
+	vpxor	%xmm6,%xmm7,%xmm7
+
+	vpclmulqdq	$0x00,%xmm0,%xmm13,%xmm5
+	vmovdqu	48-32(%r9),%xmm3
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpunpckhqdq	%xmm12,%xmm12,%xmm9
+	vpclmulqdq	$0x11,%xmm0,%xmm13,%xmm13
+	vpxor	%xmm12,%xmm9,%xmm9
+	vpxor	%xmm14,%xmm13,%xmm13
+	vpalignr	$8,%xmm8,%xmm8,%xmm14
+	vpclmulqdq	$0x10,%xmm15,%xmm2,%xmm2
+	vmovdqu	80-32(%r9),%xmm15
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vpclmulqdq	$0x00,%xmm3,%xmm12,%xmm4
+	vmovdqu	64-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm11,%xmm11,%xmm1
+	vpclmulqdq	$0x11,%xmm3,%xmm12,%xmm12
+	vpxor	%xmm11,%xmm1,%xmm1
+	vpxor	%xmm13,%xmm12,%xmm12
+	vxorps	16(%rsp),%xmm7,%xmm7
+	vpclmulqdq	$0x00,%xmm15,%xmm9,%xmm9
+	vpxor	%xmm2,%xmm9,%xmm9
+
+	vpclmulqdq	$0x10,16(%r11),%xmm8,%xmm8
+	vxorps	%xmm14,%xmm8,%xmm8
+
+	vpclmulqdq	$0x00,%xmm0,%xmm11,%xmm5
+	vmovdqu	96-32(%r9),%xmm3
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpunpckhqdq	%xmm10,%xmm10,%xmm2
+	vpclmulqdq	$0x11,%xmm0,%xmm11,%xmm11
+	vpxor	%xmm10,%xmm2,%xmm2
+	vpalignr	$8,%xmm8,%xmm8,%xmm14
+	vpxor	%xmm12,%xmm11,%xmm11
+	vpclmulqdq	$0x10,%xmm15,%xmm1,%xmm1
+	vmovdqu	128-32(%r9),%xmm15
+	vpxor	%xmm9,%xmm1,%xmm1
+
+	vxorps	%xmm7,%xmm14,%xmm14
+	vpclmulqdq	$0x10,16(%r11),%xmm8,%xmm8
+	vxorps	%xmm14,%xmm8,%xmm8
+
+	vpclmulqdq	$0x00,%xmm3,%xmm10,%xmm4
+	vmovdqu	112-32(%r9),%xmm0
+	vpxor	%xmm5,%xmm4,%xmm4
+	vpunpckhqdq	%xmm8,%xmm8,%xmm9
+	vpclmulqdq	$0x11,%xmm3,%xmm10,%xmm10
+	vpxor	%xmm8,%xmm9,%xmm9
+	vpxor	%xmm11,%xmm10,%xmm10
+	vpclmulqdq	$0x00,%xmm15,%xmm2,%xmm2
+	vpxor	%xmm1,%xmm2,%xmm2
+
+	vpclmulqdq	$0x00,%xmm0,%xmm8,%xmm5
+	vpclmulqdq	$0x11,%xmm0,%xmm8,%xmm7
+	vpxor	%xmm4,%xmm5,%xmm5
+	vpclmulqdq	$0x10,%xmm15,%xmm9,%xmm6
+	vpxor	%xmm10,%xmm7,%xmm7
+	vpxor	%xmm2,%xmm6,%xmm6
+
+	vpxor	%xmm5,%xmm7,%xmm4
+	vpxor	%xmm4,%xmm6,%xmm6
+	vpslldq	$8,%xmm6,%xmm1
+	vmovdqu	16(%r11),%xmm3
+	vpsrldq	$8,%xmm6,%xmm6
+	vpxor	%xmm1,%xmm5,%xmm8
+	vpxor	%xmm6,%xmm7,%xmm7
+
+	vpalignr	$8,%xmm8,%xmm8,%xmm2
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm8
+	vpxor	%xmm2,%xmm8,%xmm8
+
+	vpalignr	$8,%xmm8,%xmm8,%xmm2
+	vpclmulqdq	$0x10,%xmm3,%xmm8,%xmm8
+	vpxor	%xmm7,%xmm2,%xmm2
+	vpxor	%xmm2,%xmm8,%xmm8
+	vpshufb	(%r11),%xmm8,%xmm8
+	vmovdqu	%xmm8,16(%r8)
+
+	vzeroupper
+	movq	-48(%rax),%r15
+
+	movq	-40(%rax),%r14
+
+	movq	-32(%rax),%r13
+
+	movq	-24(%rax),%r12
+
+	movq	-16(%rax),%rbp
+
+	movq	-8(%rax),%rbx
+
+	leaq	(%rax),%rsp
+
+L$gcm_enc_abort:
+	movq	%r10,%rax
+	ret
+
+
+.p2align	6
+L$bswap_mask:
+.byte	15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
+L$poly:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2
+L$one_msb:
+.byte	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
+L$two_lsb:
+.byte	2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+L$one_lsb:
+.byte	1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
+.byte	65,69,83,45,78,73,32,71,67,77,32,109,111,100,117,108,101,32,102,111,114,32,120,56,54,95,54,52,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
+.p2align	6
--- a/crypto/aesgcm/aesni_gcm_x64_nasm.asm
+++ b/crypto/aesgcm/aesni_gcm_x64_nasm.asm
--- a/crypto/aesgcm/aesni_x64_gas.s
+++ b/crypto/aesgcm/aesni_x64_gas.s
--- a/crypto/aesgcm/aesni_x64_gas_macosx.s
+++ b/crypto/aesgcm/aesni_x64_gas_macosx.s
--- a/crypto/aesgcm/aesni_x64_nasm.asm
+++ b/crypto/aesgcm/aesni_x64_nasm.asm
--- a/crypto/aesgcm/ghash-x86.pl
+++ b/crypto/aesgcm/ghash-x86.pl
--- a/crypto/aesgcm/ghash-x86_64.pl
+++ b/crypto/aesgcm/ghash-x86_64.pl
--- a/crypto/aesgcm/ghash_x64_gas.s
+++ b/crypto/aesgcm/ghash_x64_gas.s
--- a/crypto/aesgcm/ghash_x64_gas_macosx.s
+++ b/crypto/aesgcm/ghash_x64_gas_macosx.s
--- a/crypto/aesgcm/ghash_x64_nasm.asm
+++ b/crypto/aesgcm/ghash_x64_nasm.asm
--- a/crypto/aesgcm/ghashp8-ppc.pl
+++ b/crypto/aesgcm/ghashp8-ppc.pl
@ -0,0 +1,670 @@
+#! /usr/bin/env perl
+# Copyright 2014-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# GHASH for for PowerISA v2.07.
+#
+# July 2014
+#
+# Accurate performance measurements are problematic, because it's
+# always virtualized setup with possibly throttled processor.
+# Relative comparison is therefore more informative. This initial
+# version is ~2.1x slower than hardware-assisted AES-128-CTR, ~12x
+# faster than "4-bit" integer-only compiler-generated 64-bit code.
+# "Initial version" means that there is room for futher improvement.
+
+# May 2016
+#
+# 2x aggregated reduction improves performance by 50% (resulting
+# performance on POWER8 is 1 cycle per processed byte), and 4x
+# aggregated reduction - by 170% or 2.7x (resulting in 0.55 cpb).
+
+$flavour=shift;
+$output =shift;
+
+if ($flavour =~ /64/) {
+	$SIZE_T=8;
+	$LRSAVE=2*$SIZE_T;
+	$STU="stdu";
+	$POP="ld";
+	$PUSH="std";
+	$UCMP="cmpld";
+	$SHRI="srdi";
+} elsif ($flavour =~ /32/) {
+	$SIZE_T=4;
+	$LRSAVE=$SIZE_T;
+	$STU="stwu";
+	$POP="lwz";
+	$PUSH="stw";
+	$UCMP="cmplw";
+	$SHRI="srwi";
+} else { die "nonsense $flavour"; }
+
+$sp="r1";
+$FRAME=6*$SIZE_T+13*16;	# 13*16 is for v20-v31 offload
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}ppc-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../../../perlasm/ppc-xlate.pl" and -f $xlate) or
+die "can't locate ppc-xlate.pl";
+
+open STDOUT,"| $^X $xlate $flavour $output" || die "can't call $xlate: $!";
+
+my ($Xip,$Htbl,$inp,$len)=map("r$_",(3..6));	# argument block
+
+my ($Xl,$Xm,$Xh,$IN)=map("v$_",(0..3));
+my ($zero,$t0,$t1,$t2,$xC2,$H,$Hh,$Hl,$lemask)=map("v$_",(4..12));
+my ($Xl1,$Xm1,$Xh1,$IN1,$H2,$H2h,$H2l)=map("v$_",(13..19));
+my $vrsave="r12";
+
+$code=<<___;
+.machine	"any"
+
+.text
+
+.globl	.gcm_init_p8
+.align	5
+.gcm_init_p8:
+	li		r0,-4096
+	li		r8,0x10
+	mfspr		$vrsave,256
+	li		r9,0x20
+	mtspr		256,r0
+	li		r10,0x30
+	lvx_u		$H,0,r4			# load H
+
+	vspltisb	$xC2,-16		# 0xf0
+	vspltisb	$t0,1			# one
+	vaddubm		$xC2,$xC2,$xC2		# 0xe0
+	vxor		$zero,$zero,$zero
+	vor		$xC2,$xC2,$t0		# 0xe1
+	vsldoi		$xC2,$xC2,$zero,15	# 0xe1...
+	vsldoi		$t1,$zero,$t0,1		# ...1
+	vaddubm		$xC2,$xC2,$xC2		# 0xc2...
+	vspltisb	$t2,7
+	vor		$xC2,$xC2,$t1		# 0xc2....01
+	vspltb		$t1,$H,0		# most significant byte
+	vsl		$H,$H,$t0		# H<<=1
+	vsrab		$t1,$t1,$t2		# broadcast carry bit
+	vand		$t1,$t1,$xC2
+	vxor		$IN,$H,$t1		# twisted H
+
+	vsldoi		$H,$IN,$IN,8		# twist even more ...
+	vsldoi		$xC2,$zero,$xC2,8	# 0xc2.0
+	vsldoi		$Hl,$zero,$H,8		# ... and split
+	vsldoi		$Hh,$H,$zero,8
+
+	stvx_u		$xC2,0,r3		# save pre-computed table
+	stvx_u		$Hl,r8,r3
+	li		r8,0x40
+	stvx_u		$H, r9,r3
+	li		r9,0x50
+	stvx_u		$Hh,r10,r3
+	li		r10,0x60
+
+	vpmsumd		$Xl,$IN,$Hl		# H.lo·H.lo
+	vpmsumd		$Xm,$IN,$H		# H.hi·H.lo+H.lo·H.hi
+	vpmsumd		$Xh,$IN,$Hh		# H.hi·H.hi
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	vxor		$t1,$t1,$Xh
+	vxor		$IN1,$Xl,$t1
+
+	vsldoi		$H2,$IN1,$IN1,8
+	vsldoi		$H2l,$zero,$H2,8
+	vsldoi		$H2h,$H2,$zero,8
+
+	stvx_u		$H2l,r8,r3		# save H^2
+	li		r8,0x70
+	stvx_u		$H2,r9,r3
+	li		r9,0x80
+	stvx_u		$H2h,r10,r3
+	li		r10,0x90
+___
+{
+my ($t4,$t5,$t6) = ($Hl,$H,$Hh);
+$code.=<<___;
+	vpmsumd		$Xl,$IN,$H2l		# H.lo·H^2.lo
+	 vpmsumd	$Xl1,$IN1,$H2l		# H^2.lo·H^2.lo
+	vpmsumd		$Xm,$IN,$H2		# H.hi·H^2.lo+H.lo·H^2.hi
+	 vpmsumd	$Xm1,$IN1,$H2		# H^2.hi·H^2.lo+H^2.lo·H^2.hi
+	vpmsumd		$Xh,$IN,$H2h		# H.hi·H^2.hi
+	 vpmsumd	$Xh1,$IN1,$H2h		# H^2.hi·H^2.hi
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+	 vpmsumd	$t6,$Xl1,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	 vsldoi		$t4,$Xm1,$zero,8
+	 vsldoi		$t5,$zero,$Xm1,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+	 vxor		$Xl1,$Xl1,$t4
+	 vxor		$Xh1,$Xh1,$t5
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	 vsldoi		$Xl1,$Xl1,$Xl1,8
+	vxor		$Xl,$Xl,$t2
+	 vxor		$Xl1,$Xl1,$t6
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	 vsldoi		$t5,$Xl1,$Xl1,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	 vpmsumd	$Xl1,$Xl1,$xC2
+	vxor		$t1,$t1,$Xh
+	 vxor		$t5,$t5,$Xh1
+	vxor		$Xl,$Xl,$t1
+	 vxor		$Xl1,$Xl1,$t5
+
+	vsldoi		$H,$Xl,$Xl,8
+	 vsldoi		$H2,$Xl1,$Xl1,8
+	vsldoi		$Hl,$zero,$H,8
+	vsldoi		$Hh,$H,$zero,8
+	 vsldoi		$H2l,$zero,$H2,8
+	 vsldoi		$H2h,$H2,$zero,8
+
+	stvx_u		$Hl,r8,r3		# save H^3
+	li		r8,0xa0
+	stvx_u		$H,r9,r3
+	li		r9,0xb0
+	stvx_u		$Hh,r10,r3
+	li		r10,0xc0
+	 stvx_u		$H2l,r8,r3		# save H^4
+	 stvx_u		$H2,r9,r3
+	 stvx_u		$H2h,r10,r3
+
+	mtspr		256,$vrsave
+	blr
+	.long		0
+	.byte		0,12,0x14,0,0,0,2,0
+	.long		0
+.size	.gcm_init_p8,.-.gcm_init_p8
+___
+}
+$code.=<<___;
+.globl	.gcm_gmult_p8
+.align	5
+.gcm_gmult_p8:
+	lis		r0,0xfff8
+	li		r8,0x10
+	mfspr		$vrsave,256
+	li		r9,0x20
+	mtspr		256,r0
+	li		r10,0x30
+	lvx_u		$IN,0,$Xip		# load Xi
+
+	lvx_u		$Hl,r8,$Htbl		# load pre-computed table
+	 le?lvsl	$lemask,r0,r0
+	lvx_u		$H, r9,$Htbl
+	 le?vspltisb	$t0,0x07
+	lvx_u		$Hh,r10,$Htbl
+	 le?vxor	$lemask,$lemask,$t0
+	lvx_u		$xC2,0,$Htbl
+	 le?vperm	$IN,$IN,$IN,$lemask
+	vxor		$zero,$zero,$zero
+
+	vpmsumd		$Xl,$IN,$Hl		# H.lo·Xi.lo
+	vpmsumd		$Xm,$IN,$H		# H.hi·Xi.lo+H.lo·Xi.hi
+	vpmsumd		$Xh,$IN,$Hh		# H.hi·Xi.hi
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	vxor		$t1,$t1,$Xh
+	vxor		$Xl,$Xl,$t1
+
+	le?vperm	$Xl,$Xl,$Xl,$lemask
+	stvx_u		$Xl,0,$Xip		# write out Xi
+
+	mtspr		256,$vrsave
+	blr
+	.long		0
+	.byte		0,12,0x14,0,0,0,2,0
+	.long		0
+.size	.gcm_gmult_p8,.-.gcm_gmult_p8
+
+.globl	.gcm_ghash_p8
+.align	5
+.gcm_ghash_p8:
+	li		r0,-4096
+	li		r8,0x10
+	mfspr		$vrsave,256
+	li		r9,0x20
+	mtspr		256,r0
+	li		r10,0x30
+	lvx_u		$Xl,0,$Xip		# load Xi
+
+	lvx_u		$Hl,r8,$Htbl		# load pre-computed table
+	li		r8,0x40
+	 le?lvsl	$lemask,r0,r0
+	lvx_u		$H, r9,$Htbl
+	li		r9,0x50
+	 le?vspltisb	$t0,0x07
+	lvx_u		$Hh,r10,$Htbl
+	li		r10,0x60
+	 le?vxor	$lemask,$lemask,$t0
+	lvx_u		$xC2,0,$Htbl
+	 le?vperm	$Xl,$Xl,$Xl,$lemask
+	vxor		$zero,$zero,$zero
+
+	${UCMP}i	$len,64
+	bge		Lgcm_ghash_p8_4x
+
+	lvx_u		$IN,0,$inp
+	addi		$inp,$inp,16
+	subic.		$len,$len,16
+	 le?vperm	$IN,$IN,$IN,$lemask
+	vxor		$IN,$IN,$Xl
+	beq		Lshort
+
+	lvx_u		$H2l,r8,$Htbl		# load H^2
+	li		r8,16
+	lvx_u		$H2, r9,$Htbl
+	add		r9,$inp,$len		# end of input
+	lvx_u		$H2h,r10,$Htbl
+	be?b		Loop_2x
+
+.align	5
+Loop_2x:
+	lvx_u		$IN1,0,$inp
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+
+	 subic		$len,$len,32
+	vpmsumd		$Xl,$IN,$H2l		# H^2.lo·Xi.lo
+	 vpmsumd	$Xl1,$IN1,$Hl		# H.lo·Xi+1.lo
+	 subfe		r0,r0,r0		# borrow?-1:0
+	vpmsumd		$Xm,$IN,$H2		# H^2.hi·Xi.lo+H^2.lo·Xi.hi
+	 vpmsumd	$Xm1,$IN1,$H		# H.hi·Xi+1.lo+H.lo·Xi+1.hi
+	 and		r0,r0,$len
+	vpmsumd		$Xh,$IN,$H2h		# H^2.hi·Xi.hi
+	 vpmsumd	$Xh1,$IN1,$Hh		# H.hi·Xi+1.hi
+	 add		$inp,$inp,r0
+
+	vxor		$Xl,$Xl,$Xl1
+	vxor		$Xm,$Xm,$Xm1
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	 vxor		$Xh,$Xh,$Xh1
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+	 lvx_u		$IN,r8,$inp
+	 addi		$inp,$inp,32
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	 le?vperm	$IN,$IN,$IN,$lemask
+	vxor		$t1,$t1,$Xh
+	vxor		$IN,$IN,$t1
+	vxor		$IN,$IN,$Xl
+	$UCMP		r9,$inp
+	bgt		Loop_2x			# done yet?
+
+	cmplwi		$len,0
+	bne		Leven
+
+Lshort:
+	vpmsumd		$Xl,$IN,$Hl		# H.lo·Xi.lo
+	vpmsumd		$Xm,$IN,$H		# H.hi·Xi.lo+H.lo·Xi.hi
+	vpmsumd		$Xh,$IN,$Hh		# H.hi·Xi.hi
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	vxor		$t1,$t1,$Xh
+
+Leven:
+	vxor		$Xl,$Xl,$t1
+	le?vperm	$Xl,$Xl,$Xl,$lemask
+	stvx_u		$Xl,0,$Xip		# write out Xi
+
+	mtspr		256,$vrsave
+	blr
+	.long		0
+	.byte		0,12,0x14,0,0,0,4,0
+	.long		0
+___
+{
+my ($Xl3,$Xm2,$IN2,$H3l,$H3,$H3h,
+    $Xh3,$Xm3,$IN3,$H4l,$H4,$H4h) = map("v$_",(20..31));
+my $IN0=$IN;
+my ($H21l,$H21h,$loperm,$hiperm) = ($Hl,$Hh,$H2l,$H2h);
+
+$code.=<<___;
+.align	5
+.gcm_ghash_p8_4x:
+Lgcm_ghash_p8_4x:
+	$STU		$sp,-$FRAME($sp)
+	li		r10,`15+6*$SIZE_T`
+	li		r11,`31+6*$SIZE_T`
+	stvx		v20,r10,$sp
+	addi		r10,r10,32
+	stvx		v21,r11,$sp
+	addi		r11,r11,32
+	stvx		v22,r10,$sp
+	addi		r10,r10,32
+	stvx		v23,r11,$sp
+	addi		r11,r11,32
+	stvx		v24,r10,$sp
+	addi		r10,r10,32
+	stvx		v25,r11,$sp
+	addi		r11,r11,32
+	stvx		v26,r10,$sp
+	addi		r10,r10,32
+	stvx		v27,r11,$sp
+	addi		r11,r11,32
+	stvx		v28,r10,$sp
+	addi		r10,r10,32
+	stvx		v29,r11,$sp
+	addi		r11,r11,32
+	stvx		v30,r10,$sp
+	li		r10,0x60
+	stvx		v31,r11,$sp
+	li		r0,-1
+	stw		$vrsave,`$FRAME-4`($sp)	# save vrsave
+	mtspr		256,r0			# preserve all AltiVec registers
+
+	lvsl		$t0,0,r8		# 0x0001..0e0f
+	#lvx_u		$H2l,r8,$Htbl		# load H^2
+	li		r8,0x70
+	lvx_u		$H2, r9,$Htbl
+	li		r9,0x80
+	vspltisb	$t1,8			# 0x0808..0808
+	#lvx_u		$H2h,r10,$Htbl
+	li		r10,0x90
+	lvx_u		$H3l,r8,$Htbl		# load H^3
+	li		r8,0xa0
+	lvx_u		$H3, r9,$Htbl
+	li		r9,0xb0
+	lvx_u		$H3h,r10,$Htbl
+	li		r10,0xc0
+	lvx_u		$H4l,r8,$Htbl		# load H^4
+	li		r8,0x10
+	lvx_u		$H4, r9,$Htbl
+	li		r9,0x20
+	lvx_u		$H4h,r10,$Htbl
+	li		r10,0x30
+
+	vsldoi		$t2,$zero,$t1,8		# 0x0000..0808
+	vaddubm		$hiperm,$t0,$t2		# 0x0001..1617
+	vaddubm		$loperm,$t1,$hiperm	# 0x0809..1e1f
+
+	$SHRI		$len,$len,4		# this allows to use sign bit
+						# as carry
+	lvx_u		$IN0,0,$inp		# load input
+	lvx_u		$IN1,r8,$inp
+	subic.		$len,$len,8
+	lvx_u		$IN2,r9,$inp
+	lvx_u		$IN3,r10,$inp
+	addi		$inp,$inp,0x40
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+	le?vperm	$IN2,$IN2,$IN2,$lemask
+	le?vperm	$IN3,$IN3,$IN3,$lemask
+
+	vxor		$Xh,$IN0,$Xl
+
+	 vpmsumd	$Xl1,$IN1,$H3l
+	 vpmsumd	$Xm1,$IN1,$H3
+	 vpmsumd	$Xh1,$IN1,$H3h
+
+	 vperm		$H21l,$H2,$H,$hiperm
+	 vperm		$t0,$IN2,$IN3,$loperm
+	 vperm		$H21h,$H2,$H,$loperm
+	 vperm		$t1,$IN2,$IN3,$hiperm
+	 vpmsumd	$Xm2,$IN2,$H2		# H^2.lo·Xi+2.hi+H^2.hi·Xi+2.lo
+	 vpmsumd	$Xl3,$t0,$H21l		# H^2.lo·Xi+2.lo+H.lo·Xi+3.lo
+	 vpmsumd	$Xm3,$IN3,$H		# H.hi·Xi+3.lo  +H.lo·Xi+3.hi
+	 vpmsumd	$Xh3,$t1,$H21h		# H^2.hi·Xi+2.hi+H.hi·Xi+3.hi
+
+	 vxor		$Xm2,$Xm2,$Xm1
+	 vxor		$Xl3,$Xl3,$Xl1
+	 vxor		$Xm3,$Xm3,$Xm2
+	 vxor		$Xh3,$Xh3,$Xh1
+
+	blt		Ltail_4x
+
+Loop_4x:
+	lvx_u		$IN0,0,$inp
+	lvx_u		$IN1,r8,$inp
+	subic.		$len,$len,4
+	lvx_u		$IN2,r9,$inp
+	lvx_u		$IN3,r10,$inp
+	addi		$inp,$inp,0x40
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+	le?vperm	$IN2,$IN2,$IN2,$lemask
+	le?vperm	$IN3,$IN3,$IN3,$lemask
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+
+	vpmsumd		$Xl,$Xh,$H4l		# H^4.lo·Xi.lo
+	vpmsumd		$Xm,$Xh,$H4		# H^4.hi·Xi.lo+H^4.lo·Xi.hi
+	vpmsumd		$Xh,$Xh,$H4h		# H^4.hi·Xi.hi
+	 vpmsumd	$Xl1,$IN1,$H3l
+	 vpmsumd	$Xm1,$IN1,$H3
+	 vpmsumd	$Xh1,$IN1,$H3h
+
+	vxor		$Xl,$Xl,$Xl3
+	vxor		$Xm,$Xm,$Xm3
+	vxor		$Xh,$Xh,$Xh3
+	 vperm		$t0,$IN2,$IN3,$loperm
+	 vperm		$t1,$IN2,$IN3,$hiperm
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+	 vpmsumd	$Xl3,$t0,$H21l		# H.lo·Xi+3.lo  +H^2.lo·Xi+2.lo
+	 vpmsumd	$Xh3,$t1,$H21h		# H.hi·Xi+3.hi  +H^2.hi·Xi+2.hi
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	 vpmsumd	$Xm2,$IN2,$H2		# H^2.hi·Xi+2.lo+H^2.lo·Xi+2.hi
+	 vpmsumd	$Xm3,$IN3,$H		# H.hi·Xi+3.lo  +H.lo·Xi+3.hi
+	vpmsumd		$Xl,$Xl,$xC2
+
+	 vxor		$Xl3,$Xl3,$Xl1
+	 vxor		$Xh3,$Xh3,$Xh1
+	vxor		$Xh,$Xh,$IN0
+	 vxor		$Xm2,$Xm2,$Xm1
+	vxor		$Xh,$Xh,$t1
+	 vxor		$Xm3,$Xm3,$Xm2
+	vxor		$Xh,$Xh,$Xl
+	bge		Loop_4x
+
+Ltail_4x:
+	vpmsumd		$Xl,$Xh,$H4l		# H^4.lo·Xi.lo
+	vpmsumd		$Xm,$Xh,$H4		# H^4.hi·Xi.lo+H^4.lo·Xi.hi
+	vpmsumd		$Xh,$Xh,$H4h		# H^4.hi·Xi.hi
+
+	vxor		$Xl,$Xl,$Xl3
+	vxor		$Xm,$Xm,$Xm3
+
+	vpmsumd		$t2,$Xl,$xC2		# 1st reduction phase
+
+	vsldoi		$t0,$Xm,$zero,8
+	vsldoi		$t1,$zero,$Xm,8
+	 vxor		$Xh,$Xh,$Xh3
+	vxor		$Xl,$Xl,$t0
+	vxor		$Xh,$Xh,$t1
+
+	vsldoi		$Xl,$Xl,$Xl,8
+	vxor		$Xl,$Xl,$t2
+
+	vsldoi		$t1,$Xl,$Xl,8		# 2nd reduction phase
+	vpmsumd		$Xl,$Xl,$xC2
+	vxor		$t1,$t1,$Xh
+	vxor		$Xl,$Xl,$t1
+
+	addic.		$len,$len,4
+	beq		Ldone_4x
+
+	lvx_u		$IN0,0,$inp
+	${UCMP}i	$len,2
+	li		$len,-4
+	blt		Lone
+	lvx_u		$IN1,r8,$inp
+	beq		Ltwo
+
+Lthree:
+	lvx_u		$IN2,r9,$inp
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+	le?vperm	$IN2,$IN2,$IN2,$lemask
+
+	vxor		$Xh,$IN0,$Xl
+	vmr		$H4l,$H3l
+	vmr		$H4, $H3
+	vmr		$H4h,$H3h
+
+	vperm		$t0,$IN1,$IN2,$loperm
+	vperm		$t1,$IN1,$IN2,$hiperm
+	vpmsumd		$Xm2,$IN1,$H2		# H^2.lo·Xi+1.hi+H^2.hi·Xi+1.lo
+	vpmsumd		$Xm3,$IN2,$H		# H.hi·Xi+2.lo  +H.lo·Xi+2.hi
+	vpmsumd		$Xl3,$t0,$H21l		# H^2.lo·Xi+1.lo+H.lo·Xi+2.lo
+	vpmsumd		$Xh3,$t1,$H21h		# H^2.hi·Xi+1.hi+H.hi·Xi+2.hi
+
+	vxor		$Xm3,$Xm3,$Xm2
+	b		Ltail_4x
+
+.align	4
+Ltwo:
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+	le?vperm	$IN1,$IN1,$IN1,$lemask
+
+	vxor		$Xh,$IN0,$Xl
+	vperm		$t0,$zero,$IN1,$loperm
+	vperm		$t1,$zero,$IN1,$hiperm
+
+	vsldoi		$H4l,$zero,$H2,8
+	vmr		$H4, $H2
+	vsldoi		$H4h,$H2,$zero,8
+
+	vpmsumd		$Xl3,$t0, $H21l		# H.lo·Xi+1.lo
+	vpmsumd		$Xm3,$IN1,$H		# H.hi·Xi+1.lo+H.lo·Xi+2.hi
+	vpmsumd		$Xh3,$t1, $H21h		# H.hi·Xi+1.hi
+
+	b		Ltail_4x
+
+.align	4
+Lone:
+	le?vperm	$IN0,$IN0,$IN0,$lemask
+
+	vsldoi		$H4l,$zero,$H,8
+	vmr		$H4, $H
+	vsldoi		$H4h,$H,$zero,8
+
+	vxor		$Xh,$IN0,$Xl
+	vxor		$Xl3,$Xl3,$Xl3
+	vxor		$Xm3,$Xm3,$Xm3
+	vxor		$Xh3,$Xh3,$Xh3
+
+	b		Ltail_4x
+
+Ldone_4x:
+	le?vperm	$Xl,$Xl,$Xl,$lemask
+	stvx_u		$Xl,0,$Xip		# write out Xi
+
+	li		r10,`15+6*$SIZE_T`
+	li		r11,`31+6*$SIZE_T`
+	mtspr		256,$vrsave
+	lvx		v20,r10,$sp
+	addi		r10,r10,32
+	lvx		v21,r11,$sp
+	addi		r11,r11,32
+	lvx		v22,r10,$sp
+	addi		r10,r10,32
+	lvx		v23,r11,$sp
+	addi		r11,r11,32
+	lvx		v24,r10,$sp
+	addi		r10,r10,32
+	lvx		v25,r11,$sp
+	addi		r11,r11,32
+	lvx		v26,r10,$sp
+	addi		r10,r10,32
+	lvx		v27,r11,$sp
+	addi		r11,r11,32
+	lvx		v28,r10,$sp
+	addi		r10,r10,32
+	lvx		v29,r11,$sp
+	addi		r11,r11,32
+	lvx		v30,r10,$sp
+	lvx		v31,r11,$sp
+	addi		$sp,$sp,$FRAME
+	blr
+	.long		0
+	.byte		0,12,0x04,0,0x80,0,4,0
+	.long		0
+___
+}
+$code.=<<___;
+.size	.gcm_ghash_p8,.-.gcm_ghash_p8
+
+.asciz  "GHASH for PowerISA 2.07, CRYPTOGAMS by <appro\@openssl.org>"
+.align  2
+___
+
+foreach (split("\n",$code)) {
+	s/\`([^\`]*)\`/eval $1/geo;
+
+	if ($flavour =~ /le$/o) {	# little-endian
+	    s/le\?//o		or
+	    s/be\?/#be#/o;
+	} else {
+	    s/le\?/#le#/o	or
+	    s/be\?//o;
+	}
+	print $_,"\n";
+}
+
+close STDOUT; # enforce flush
--- a/crypto/aesgcm/ghashv8-armx.pl
+++ b/crypto/aesgcm/ghashv8-armx.pl
@ -0,0 +1,430 @@
+#! /usr/bin/env perl
+# Copyright 2014-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+#
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# GHASH for ARMv8 Crypto Extension, 64-bit polynomial multiplication.
+#
+# June 2014
+#
+# Initial version was developed in tight cooperation with Ard
+# Biesheuvel <ard.biesheuvel@linaro.org> from bits-n-pieces from
+# other assembly modules. Just like aesv8-armx.pl this module
+# supports both AArch32 and AArch64 execution modes.
+#
+# July 2014
+#
+# Implement 2x aggregated reduction [see ghash-x86.pl for background
+# information].
+#
+# Current performance in cycles per processed byte:
+#
+#		PMULL[2]	32-bit NEON(*)
+# Apple A7	0.92		5.62
+# Cortex-A53	1.01		8.39
+# Cortex-A57	1.17		7.61
+# Denver	0.71		6.02
+# Mongoose	1.10		8.06
+#
+# (*)	presented for reference/comparison purposes;
+
+$flavour = shift;
+$output  = shift;
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+( $xlate="${dir}arm-xlate.pl" and -f $xlate ) or
+( $xlate="${dir}../../../perlasm/arm-xlate.pl" and -f $xlate) or
+die "can't locate arm-xlate.pl";
+
+open OUT,"| \"$^X\" $xlate $flavour $output";
+*STDOUT=*OUT;
+
+$Xi="x0";	# argument block
+$Htbl="x1";
+$inp="x2";
+$len="x3";
+
+$inc="x12";
+
+{
+my ($Xl,$Xm,$Xh,$IN)=map("q$_",(0..3));
+my ($t0,$t1,$t2,$xC2,$H,$Hhl,$H2)=map("q$_",(8..14));
+
+$code=<<___;
+#include <openssl/arm_arch.h>
+
+.text
+___
+$code.=".arch	armv8-a+crypto\n"	if ($flavour =~ /64/);
+$code.=<<___				if ($flavour !~ /64/);
+.fpu	neon
+.code	32
+#undef	__thumb2__
+___
+
+################################################################################
+# void gcm_init_v8(u128 Htable[16],const u64 H[2]);
+#
+# input:	128-bit H - secret parameter E(K,0^128)
+# output:	precomputed table filled with degrees of twisted H;
+#		H is twisted to handle reverse bitness of GHASH;
+#		only few of 16 slots of Htable[16] are used;
+#		data is opaque to outside world (which allows to
+#		optimize the code independently);
+#
+$code.=<<___;
+.global	gcm_init_v8
+.type	gcm_init_v8,%function
+.align	4
+gcm_init_v8:
+	vld1.64		{$t1},[x1]		@ load input H
+	vmov.i8		$xC2,#0xe1
+	vshl.i64	$xC2,$xC2,#57		@ 0xc2.0
+	vext.8		$IN,$t1,$t1,#8
+	vshr.u64	$t2,$xC2,#63
+	vdup.32		$t1,${t1}[1]
+	vext.8		$t0,$t2,$xC2,#8		@ t0=0xc2....01
+	vshr.u64	$t2,$IN,#63
+	vshr.s32	$t1,$t1,#31		@ broadcast carry bit
+	vand		$t2,$t2,$t0
+	vshl.i64	$IN,$IN,#1
+	vext.8		$t2,$t2,$t2,#8
+	vand		$t0,$t0,$t1
+	vorr		$IN,$IN,$t2		@ H<<<=1
+	veor		$H,$IN,$t0		@ twisted H
+	vst1.64		{$H},[x0],#16		@ store Htable[0]
+
+	@ calculate H^2
+	vext.8		$t0,$H,$H,#8		@ Karatsuba pre-processing
+	vpmull.p64	$Xl,$H,$H
+	veor		$t0,$t0,$H
+	vpmull2.p64	$Xh,$H,$H
+	vpmull.p64	$Xm,$t0,$t0
+
+	vext.8		$t1,$Xl,$Xh,#8		@ Karatsuba post-processing
+	veor		$t2,$Xl,$Xh
+	veor		$Xm,$Xm,$t1
+	veor		$Xm,$Xm,$t2
+	vpmull.p64	$t2,$Xl,$xC2		@ 1st phase
+
+	vmov		$Xh#lo,$Xm#hi		@ Xh|Xm - 256-bit result
+	vmov		$Xm#hi,$Xl#lo		@ Xm is rotated Xl
+	veor		$Xl,$Xm,$t2
+
+	vext.8		$t2,$Xl,$Xl,#8		@ 2nd phase
+	vpmull.p64	$Xl,$Xl,$xC2
+	veor		$t2,$t2,$Xh
+	veor		$H2,$Xl,$t2
+
+	vext.8		$t1,$H2,$H2,#8		@ Karatsuba pre-processing
+	veor		$t1,$t1,$H2
+	vext.8		$Hhl,$t0,$t1,#8		@ pack Karatsuba pre-processed
+	vst1.64		{$Hhl-$H2},[x0]		@ store Htable[1..2]
+
+	ret
+.size	gcm_init_v8,.-gcm_init_v8
+___
+################################################################################
+# void gcm_gmult_v8(u64 Xi[2],const u128 Htable[16]);
+#
+# input:	Xi - current hash value;
+#		Htable - table precomputed in gcm_init_v8;
+# output:	Xi - next hash value Xi;
+#
+$code.=<<___;
+.global	gcm_gmult_v8
+.type	gcm_gmult_v8,%function
+.align	4
+gcm_gmult_v8:
+	vld1.64		{$t1},[$Xi]		@ load Xi
+	vmov.i8		$xC2,#0xe1
+	vld1.64		{$H-$Hhl},[$Htbl]	@ load twisted H, ...
+	vshl.u64	$xC2,$xC2,#57
+#ifndef __ARMEB__
+	vrev64.8	$t1,$t1
+#endif
+	vext.8		$IN,$t1,$t1,#8
+
+	vpmull.p64	$Xl,$H,$IN		@ H.lo·Xi.lo
+	veor		$t1,$t1,$IN		@ Karatsuba pre-processing
+	vpmull2.p64	$Xh,$H,$IN		@ H.hi·Xi.hi
+	vpmull.p64	$Xm,$Hhl,$t1		@ (H.lo+H.hi)·(Xi.lo+Xi.hi)
+
+	vext.8		$t1,$Xl,$Xh,#8		@ Karatsuba post-processing
+	veor		$t2,$Xl,$Xh
+	veor		$Xm,$Xm,$t1
+	veor		$Xm,$Xm,$t2
+	vpmull.p64	$t2,$Xl,$xC2		@ 1st phase of reduction
+
+	vmov		$Xh#lo,$Xm#hi		@ Xh|Xm - 256-bit result
+	vmov		$Xm#hi,$Xl#lo		@ Xm is rotated Xl
+	veor		$Xl,$Xm,$t2
+
+	vext.8		$t2,$Xl,$Xl,#8		@ 2nd phase of reduction
+	vpmull.p64	$Xl,$Xl,$xC2
+	veor		$t2,$t2,$Xh
+	veor		$Xl,$Xl,$t2
+
+#ifndef __ARMEB__
+	vrev64.8	$Xl,$Xl
+#endif
+	vext.8		$Xl,$Xl,$Xl,#8
+	vst1.64		{$Xl},[$Xi]		@ write out Xi
+
+	ret
+.size	gcm_gmult_v8,.-gcm_gmult_v8
+___
+################################################################################
+# void gcm_ghash_v8(u64 Xi[2],const u128 Htable[16],const u8 *inp,size_t len);
+#
+# input:	table precomputed in gcm_init_v8;
+#		current hash value Xi;
+#		pointer to input data;
+#		length of input data in bytes, but divisible by block size;
+# output:	next hash value Xi;
+#
+$code.=<<___;
+.global	gcm_ghash_v8
+.type	gcm_ghash_v8,%function
+.align	4
+gcm_ghash_v8:
+___
+$code.=<<___		if ($flavour !~ /64/);
+	vstmdb		sp!,{d8-d15}		@ 32-bit ABI says so
+___
+$code.=<<___;
+	vld1.64		{$Xl},[$Xi]		@ load [rotated] Xi
+						@ "[rotated]" means that
+						@ loaded value would have
+						@ to be rotated in order to
+						@ make it appear as in
+						@ alorithm specification
+	subs		$len,$len,#32		@ see if $len is 32 or larger
+	mov		$inc,#16		@ $inc is used as post-
+						@ increment for input pointer;
+						@ as loop is modulo-scheduled
+						@ $inc is zeroed just in time
+						@ to preclude oversteping
+						@ inp[len], which means that
+						@ last block[s] are actually
+						@ loaded twice, but last
+						@ copy is not processed
+	vld1.64		{$H-$Hhl},[$Htbl],#32	@ load twisted H, ..., H^2
+	vmov.i8		$xC2,#0xe1
+	vld1.64		{$H2},[$Htbl]
+	cclr		$inc,eq			@ is it time to zero $inc?
+	vext.8		$Xl,$Xl,$Xl,#8		@ rotate Xi
+	vld1.64		{$t0},[$inp],#16	@ load [rotated] I[0]
+	vshl.u64	$xC2,$xC2,#57		@ compose 0xc2.0 constant
+#ifndef __ARMEB__
+	vrev64.8	$t0,$t0
+	vrev64.8	$Xl,$Xl
+#endif
+	vext.8		$IN,$t0,$t0,#8		@ rotate I[0]
+	b.lo		.Lodd_tail_v8		@ $len was less than 32
+___
+{ my ($Xln,$Xmn,$Xhn,$In) = map("q$_",(4..7));
+	#######
+	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
+	#	[(H*Ii+1) + (H*Xi+1)] mod P =
+	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
+	#
+$code.=<<___;
+	vld1.64		{$t1},[$inp],$inc	@ load [rotated] I[1]
+#ifndef __ARMEB__
+	vrev64.8	$t1,$t1
+#endif
+	vext.8		$In,$t1,$t1,#8
+	veor		$IN,$IN,$Xl		@ I[i]^=Xi
+	vpmull.p64	$Xln,$H,$In		@ H·Ii+1
+	veor		$t1,$t1,$In		@ Karatsuba pre-processing
+	vpmull2.p64	$Xhn,$H,$In
+	b		.Loop_mod2x_v8
+
+.align	4
+.Loop_mod2x_v8:
+	vext.8		$t2,$IN,$IN,#8
+	subs		$len,$len,#32		@ is there more data?
+	vpmull.p64	$Xl,$H2,$IN		@ H^2.lo·Xi.lo
+	cclr		$inc,lo			@ is it time to zero $inc?
+
+	 vpmull.p64	$Xmn,$Hhl,$t1
+	veor		$t2,$t2,$IN		@ Karatsuba pre-processing
+	vpmull2.p64	$Xh,$H2,$IN		@ H^2.hi·Xi.hi
+	veor		$Xl,$Xl,$Xln		@ accumulate
+	vpmull2.p64	$Xm,$Hhl,$t2		@ (H^2.lo+H^2.hi)·(Xi.lo+Xi.hi)
+	 vld1.64	{$t0},[$inp],$inc	@ load [rotated] I[i+2]
+
+	veor		$Xh,$Xh,$Xhn
+	 cclr		$inc,eq			@ is it time to zero $inc?
+	veor		$Xm,$Xm,$Xmn
+
+	vext.8		$t1,$Xl,$Xh,#8		@ Karatsuba post-processing
+	veor		$t2,$Xl,$Xh
+	veor		$Xm,$Xm,$t1
+	 vld1.64	{$t1},[$inp],$inc	@ load [rotated] I[i+3]
+#ifndef __ARMEB__
+	 vrev64.8	$t0,$t0
+#endif
+	veor		$Xm,$Xm,$t2
+	vpmull.p64	$t2,$Xl,$xC2		@ 1st phase of reduction
+
+#ifndef __ARMEB__
+	 vrev64.8	$t1,$t1
+#endif
+	vmov		$Xh#lo,$Xm#hi		@ Xh|Xm - 256-bit result
+	vmov		$Xm#hi,$Xl#lo		@ Xm is rotated Xl
+	 vext.8		$In,$t1,$t1,#8
+	 vext.8		$IN,$t0,$t0,#8
+	veor		$Xl,$Xm,$t2
+	 vpmull.p64	$Xln,$H,$In		@ H·Ii+1
+	veor		$IN,$IN,$Xh		@ accumulate $IN early
+
+	vext.8		$t2,$Xl,$Xl,#8		@ 2nd phase of reduction
+	vpmull.p64	$Xl,$Xl,$xC2
+	veor		$IN,$IN,$t2
+	 veor		$t1,$t1,$In		@ Karatsuba pre-processing
+	veor		$IN,$IN,$Xl
+	 vpmull2.p64	$Xhn,$H,$In
+	b.hs		.Loop_mod2x_v8		@ there was at least 32 more bytes
+
+	veor		$Xh,$Xh,$t2
+	vext.8		$IN,$t0,$t0,#8		@ re-construct $IN
+	adds		$len,$len,#32		@ re-construct $len
+	veor		$Xl,$Xl,$Xh		@ re-construct $Xl
+	b.eq		.Ldone_v8		@ is $len zero?
+___
+}
+$code.=<<___;
+.Lodd_tail_v8:
+	vext.8		$t2,$Xl,$Xl,#8
+	veor		$IN,$IN,$Xl		@ inp^=Xi
+	veor		$t1,$t0,$t2		@ $t1 is rotated inp^Xi
+
+	vpmull.p64	$Xl,$H,$IN		@ H.lo·Xi.lo
+	veor		$t1,$t1,$IN		@ Karatsuba pre-processing
+	vpmull2.p64	$Xh,$H,$IN		@ H.hi·Xi.hi
+	vpmull.p64	$Xm,$Hhl,$t1		@ (H.lo+H.hi)·(Xi.lo+Xi.hi)
+
+	vext.8		$t1,$Xl,$Xh,#8		@ Karatsuba post-processing
+	veor		$t2,$Xl,$Xh
+	veor		$Xm,$Xm,$t1
+	veor		$Xm,$Xm,$t2
+	vpmull.p64	$t2,$Xl,$xC2		@ 1st phase of reduction
+
+	vmov		$Xh#lo,$Xm#hi		@ Xh|Xm - 256-bit result
+	vmov		$Xm#hi,$Xl#lo		@ Xm is rotated Xl
+	veor		$Xl,$Xm,$t2
+
+	vext.8		$t2,$Xl,$Xl,#8		@ 2nd phase of reduction
+	vpmull.p64	$Xl,$Xl,$xC2
+	veor		$t2,$t2,$Xh
+	veor		$Xl,$Xl,$t2
+
+.Ldone_v8:
+#ifndef __ARMEB__
+	vrev64.8	$Xl,$Xl
+#endif
+	vext.8		$Xl,$Xl,$Xl,#8
+	vst1.64		{$Xl},[$Xi]		@ write out Xi
+
+___
+$code.=<<___		if ($flavour !~ /64/);
+	vldmia		sp!,{d8-d15}		@ 32-bit ABI says so
+___
+$code.=<<___;
+	ret
+.size	gcm_ghash_v8,.-gcm_ghash_v8
+___
+}
+$code.=<<___;
+.asciz  "GHASH for ARMv8, CRYPTOGAMS by <appro\@openssl.org>"
+.align  2
+___
+
+if ($flavour =~ /64/) {			######## 64-bit code
+    sub unvmov {
+	my $arg=shift;
+
+	$arg =~ m/q([0-9]+)#(lo|hi),\s*q([0-9]+)#(lo|hi)/o &&
+	sprintf	"ins	v%d.d[%d],v%d.d[%d]",$1,($2 eq "lo")?0:1,$3,($4 eq "lo")?0:1;
+    }
+    foreach(split("\n",$code)) {
+	s/cclr\s+([wx])([^,]+),\s*([a-z]+)/csel	$1$2,$1zr,$1$2,$3/o	or
+	s/vmov\.i8/movi/o		or	# fix up legacy mnemonics
+	s/vmov\s+(.*)/unvmov($1)/geo	or
+	s/vext\.8/ext/o			or
+	s/vshr\.s/sshr\.s/o		or
+	s/vshr/ushr/o			or
+	s/^(\s+)v/$1/o			or	# strip off v prefix
+	s/\bbx\s+lr\b/ret/o;
+
+	s/\bq([0-9]+)\b/"v".($1<8?$1:$1+8).".16b"/geo;	# old->new registers
+	s/@\s/\/\//o;				# old->new style commentary
+
+	# fix up remainig legacy suffixes
+	s/\.[ui]?8(\s)/$1/o;
+	s/\.[uis]?32//o and s/\.16b/\.4s/go;
+	m/\.p64/o and s/\.16b/\.1q/o;		# 1st pmull argument
+	m/l\.p64/o and s/\.16b/\.1d/go;		# 2nd and 3rd pmull arguments
+	s/\.[uisp]?64//o and s/\.16b/\.2d/go;
+	s/\.[42]([sd])\[([0-3])\]/\.$1\[$2\]/o;
+
+	print $_,"\n";
+    }
+} else {				######## 32-bit code
+    sub unvdup32 {
+	my $arg=shift;
+
+	$arg =~ m/q([0-9]+),\s*q([0-9]+)\[([0-3])\]/o &&
+	sprintf	"vdup.32	q%d,d%d[%d]",$1,2*$2+($3>>1),$3&1;
+    }
+    sub unvpmullp64 {
+	my ($mnemonic,$arg)=@_;
+
+	if ($arg =~ m/q([0-9]+),\s*q([0-9]+),\s*q([0-9]+)/o) {
+	    my $word = 0xf2a00e00|(($1&7)<<13)|(($1&8)<<19)
+				 |(($2&7)<<17)|(($2&8)<<4)
+				 |(($3&7)<<1) |(($3&8)<<2);
+	    $word |= 0x00010001	 if ($mnemonic =~ "2");
+	    # since ARMv7 instructions are always encoded little-endian.
+	    # correct solution is to use .inst directive, but older
+	    # assemblers don't implement it:-(
+	    sprintf ".byte\t0x%02x,0x%02x,0x%02x,0x%02x\t@ %s %s",
+			$word&0xff,($word>>8)&0xff,
+			($word>>16)&0xff,($word>>24)&0xff,
+			$mnemonic,$arg;
+	}
+    }
+
+    foreach(split("\n",$code)) {
+	s/\b[wx]([0-9]+)\b/r$1/go;		# new->old registers
+	s/\bv([0-9])\.[12468]+[bsd]\b/q$1/go;	# new->old registers
+	s/\/\/\s?/@ /o;				# new->old style commentary
+
+	# fix up remainig new-style suffixes
+	s/\],#[0-9]+/]!/o;
+
+	s/cclr\s+([^,]+),\s*([a-z]+)/mov$2	$1,#0/o			or
+	s/vdup\.32\s+(.*)/unvdup32($1)/geo				or
+	s/v?(pmull2?)\.p64\s+(.*)/unvpmullp64($1,$2)/geo		or
+	s/\bq([0-9]+)#(lo|hi)/sprintf "d%d",2*$1+($2 eq "hi")/geo	or
+	s/^(\s+)b\./$1b/o						or
+	s/^(\s+)ret/$1bx\tlr/o;
+
+	print $_,"\n";
+    }
+}
+
+close STDOUT; # enforce flush
--- a/crypto/blake2s-load-sse2.h
+++ b/crypto/blake2s-load-sse2.h
@ -0,0 +1,60 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2S_LOAD_SSE2_H
+#define BLAKE2S_LOAD_SSE2_H
+
+#define LOAD_MSG_0_1(buf) buf = _mm_set_epi32(m6,m4,m2,m0)
+#define LOAD_MSG_0_2(buf) buf = _mm_set_epi32(m7,m5,m3,m1)
+#define LOAD_MSG_0_3(buf) buf = _mm_set_epi32(m14,m12,m10,m8)
+#define LOAD_MSG_0_4(buf) buf = _mm_set_epi32(m15,m13,m11,m9)
+#define LOAD_MSG_1_1(buf) buf = _mm_set_epi32(m13,m9,m4,m14)
+#define LOAD_MSG_1_2(buf) buf = _mm_set_epi32(m6,m15,m8,m10)
+#define LOAD_MSG_1_3(buf) buf = _mm_set_epi32(m5,m11,m0,m1)
+#define LOAD_MSG_1_4(buf) buf = _mm_set_epi32(m3,m7,m2,m12)
+#define LOAD_MSG_2_1(buf) buf = _mm_set_epi32(m15,m5,m12,m11)
+#define LOAD_MSG_2_2(buf) buf = _mm_set_epi32(m13,m2,m0,m8)
+#define LOAD_MSG_2_3(buf) buf = _mm_set_epi32(m9,m7,m3,m10)
+#define LOAD_MSG_2_4(buf) buf = _mm_set_epi32(m4,m1,m6,m14)
+#define LOAD_MSG_3_1(buf) buf = _mm_set_epi32(m11,m13,m3,m7)
+#define LOAD_MSG_3_2(buf) buf = _mm_set_epi32(m14,m12,m1,m9)
+#define LOAD_MSG_3_3(buf) buf = _mm_set_epi32(m15,m4,m5,m2)
+#define LOAD_MSG_3_4(buf) buf = _mm_set_epi32(m8,m0,m10,m6)
+#define LOAD_MSG_4_1(buf) buf = _mm_set_epi32(m10,m2,m5,m9)
+#define LOAD_MSG_4_2(buf) buf = _mm_set_epi32(m15,m4,m7,m0)
+#define LOAD_MSG_4_3(buf) buf = _mm_set_epi32(m3,m6,m11,m14)
+#define LOAD_MSG_4_4(buf) buf = _mm_set_epi32(m13,m8,m12,m1)
+#define LOAD_MSG_5_1(buf) buf = _mm_set_epi32(m8,m0,m6,m2)
+#define LOAD_MSG_5_2(buf) buf = _mm_set_epi32(m3,m11,m10,m12)
+#define LOAD_MSG_5_3(buf) buf = _mm_set_epi32(m1,m15,m7,m4)
+#define LOAD_MSG_5_4(buf) buf = _mm_set_epi32(m9,m14,m5,m13)
+#define LOAD_MSG_6_1(buf) buf = _mm_set_epi32(m4,m14,m1,m12)
+#define LOAD_MSG_6_2(buf) buf = _mm_set_epi32(m10,m13,m15,m5)
+#define LOAD_MSG_6_3(buf) buf = _mm_set_epi32(m8,m9,m6,m0)
+#define LOAD_MSG_6_4(buf) buf = _mm_set_epi32(m11,m2,m3,m7)
+#define LOAD_MSG_7_1(buf) buf = _mm_set_epi32(m3,m12,m7,m13)
+#define LOAD_MSG_7_2(buf) buf = _mm_set_epi32(m9,m1,m14,m11)
+#define LOAD_MSG_7_3(buf) buf = _mm_set_epi32(m2,m8,m15,m5)
+#define LOAD_MSG_7_4(buf) buf = _mm_set_epi32(m10,m6,m4,m0)
+#define LOAD_MSG_8_1(buf) buf = _mm_set_epi32(m0,m11,m14,m6)
+#define LOAD_MSG_8_2(buf) buf = _mm_set_epi32(m8,m3,m9,m15)
+#define LOAD_MSG_8_3(buf) buf = _mm_set_epi32(m10,m1,m13,m12)
+#define LOAD_MSG_8_4(buf) buf = _mm_set_epi32(m5,m4,m7,m2)
+#define LOAD_MSG_9_1(buf) buf = _mm_set_epi32(m1,m7,m8,m10)
+#define LOAD_MSG_9_2(buf) buf = _mm_set_epi32(m5,m6,m4,m2)
+#define LOAD_MSG_9_3(buf) buf = _mm_set_epi32(m13,m3,m9,m15)
+#define LOAD_MSG_9_4(buf) buf = _mm_set_epi32(m0,m12,m14,m11)
+
+
+#endif
--- a/crypto/blake2s-load-sse41.h
+++ b/crypto/blake2s-load-sse41.h
@ -0,0 +1,229 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2S_LOAD_SSE41_H
+#define BLAKE2S_LOAD_SSE41_H
+
+#define LOAD_MSG_0_1(buf) \
+buf = TOI(_mm_shuffle_ps(TOF(m0), TOF(m1), _MM_SHUFFLE(2,0,2,0)));
+
+#define LOAD_MSG_0_2(buf) \
+buf = TOI(_mm_shuffle_ps(TOF(m0), TOF(m1), _MM_SHUFFLE(3,1,3,1)));
+
+#define LOAD_MSG_0_3(buf) \
+buf = TOI(_mm_shuffle_ps(TOF(m2), TOF(m3), _MM_SHUFFLE(2,0,2,0)));
+
+#define LOAD_MSG_0_4(buf) \
+buf = TOI(_mm_shuffle_ps(TOF(m2), TOF(m3), _MM_SHUFFLE(3,1,3,1)));
+
+#define LOAD_MSG_1_1(buf) \
+t0 = _mm_blend_epi16(m1, m2, 0x0C); \
+t1 = _mm_slli_si128(m3, 4); \
+t2 = _mm_blend_epi16(t0, t1, 0xF0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,1,0,3));
+
+#define LOAD_MSG_1_2(buf) \
+t0 = _mm_shuffle_epi32(m2,_MM_SHUFFLE(0,0,2,0)); \
+t1 = _mm_blend_epi16(m1,m3,0xC0); \
+t2 = _mm_blend_epi16(t0, t1, 0xF0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1));
+
+#define LOAD_MSG_1_3(buf) \
+t0 = _mm_slli_si128(m1, 4); \
+t1 = _mm_blend_epi16(m2, t0, 0x30); \
+t2 = _mm_blend_epi16(m0, t1, 0xF0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1));
+
+#define LOAD_MSG_1_4(buf) \
+t0 = _mm_unpackhi_epi32(m0,m1); \
+t1 = _mm_slli_si128(m3, 4); \
+t2 = _mm_blend_epi16(t0, t1, 0x0C); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,3,0,1));
+
+#define LOAD_MSG_2_1(buf) \
+t0 = _mm_unpackhi_epi32(m2,m3); \
+t1 = _mm_blend_epi16(m3,m1,0x0C); \
+t2 = _mm_blend_epi16(t0, t1, 0x0F); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(3,1,0,2));
+
+#define LOAD_MSG_2_2(buf) \
+t0 = _mm_unpacklo_epi32(m2,m0); \
+t1 = _mm_blend_epi16(t0, m0, 0xF0); \
+t2 = _mm_slli_si128(m3, 8); \
+buf = _mm_blend_epi16(t1, t2, 0xC0);
+
+#define LOAD_MSG_2_3(buf) \
+t0 = _mm_blend_epi16(m0, m2, 0x3C); \
+t1 = _mm_srli_si128(m1, 12); \
+t2 = _mm_blend_epi16(t0,t1,0x03); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,0,3,2));
+
+#define LOAD_MSG_2_4(buf) \
+t0 = _mm_slli_si128(m3, 4); \
+t1 = _mm_blend_epi16(m0, m1, 0x33); \
+t2 = _mm_blend_epi16(t1, t0, 0xC0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(0,1,2,3));
+
+#define LOAD_MSG_3_1(buf) \
+t0 = _mm_unpackhi_epi32(m0,m1); \
+t1 = _mm_unpackhi_epi32(t0, m2); \
+t2 = _mm_blend_epi16(t1, m3, 0x0C); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(3,1,0,2));
+
+#define LOAD_MSG_3_2(buf) \
+t0 = _mm_slli_si128(m2, 8); \
+t1 = _mm_blend_epi16(m3,m0,0x0C); \
+t2 = _mm_blend_epi16(t1, t0, 0xC0); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,0,1,3));
+
+#define LOAD_MSG_3_3(buf) \
+t0 = _mm_blend_epi16(m0,m1,0x0F); \
+t1 = _mm_blend_epi16(t0, m3, 0xC0); \
+buf = _mm_shuffle_epi32(t1, _MM_SHUFFLE(3,0,1,2));
+
+#define LOAD_MSG_3_4(buf) \
+t0 = _mm_unpacklo_epi32(m0,m2); \
+t1 = _mm_unpackhi_epi32(m1,m2); \
+buf = _mm_unpacklo_epi64(t1,t0);
+
+#define LOAD_MSG_4_1(buf) \
+t0 = _mm_unpacklo_epi64(m1,m2); \
+t1 = _mm_unpackhi_epi64(m0,m2); \
+t2 = _mm_blend_epi16(t0,t1,0x33); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,0,1,3));
+
+#define LOAD_MSG_4_2(buf) \
+t0 = _mm_unpackhi_epi64(m1,m3); \
+t1 = _mm_unpacklo_epi64(m0,m1); \
+buf = _mm_blend_epi16(t0,t1,0x33);
+
+#define LOAD_MSG_4_3(buf) \
+t0 = _mm_unpackhi_epi64(m3,m1); \
+t1 = _mm_unpackhi_epi64(m2,m0); \
+buf = _mm_blend_epi16(t1,t0,0x33);
+
+#define LOAD_MSG_4_4(buf) \
+t0 = _mm_blend_epi16(m0,m2,0x03); \
+t1 = _mm_slli_si128(t0, 8); \
+t2 = _mm_blend_epi16(t1,m3,0x0F); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,2,0,3));
+
+#define LOAD_MSG_5_1(buf) \
+t0 = _mm_unpackhi_epi32(m0,m1); \
+t1 = _mm_unpacklo_epi32(m0,m2); \
+buf = _mm_unpacklo_epi64(t0,t1);
+
+#define LOAD_MSG_5_2(buf) \
+t0 = _mm_srli_si128(m2, 4); \
+t1 = _mm_blend_epi16(m0,m3,0x03); \
+buf = _mm_blend_epi16(t1,t0,0x3C);
+
+#define LOAD_MSG_5_3(buf) \
+t0 = _mm_blend_epi16(m1,m0,0x0C); \
+t1 = _mm_srli_si128(m3, 4); \
+t2 = _mm_blend_epi16(t0,t1,0x30); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,2,3,0));
+
+#define LOAD_MSG_5_4(buf) \
+t0 = _mm_unpacklo_epi64(m1,m2); \
+t1= _mm_shuffle_epi32(m3, _MM_SHUFFLE(0,2,0,1)); \
+buf = _mm_blend_epi16(t0,t1,0x33);
+
+#define LOAD_MSG_6_1(buf) \
+t0 = _mm_slli_si128(m1, 12); \
+t1 = _mm_blend_epi16(m0,m3,0x33); \
+buf = _mm_blend_epi16(t1,t0,0xC0);
+
+#define LOAD_MSG_6_2(buf) \
+t0 = _mm_blend_epi16(m3,m2,0x30); \
+t1 = _mm_srli_si128(m1, 4); \
+t2 = _mm_blend_epi16(t0,t1,0x03); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(2,1,3,0));
+
+#define LOAD_MSG_6_3(buf) \
+t0 = _mm_unpacklo_epi64(m0,m2); \
+t1 = _mm_srli_si128(m1, 4); \
+buf = _mm_shuffle_epi32(_mm_blend_epi16(t0,t1,0x0C), _MM_SHUFFLE(2,3,1,0));
+
+#define LOAD_MSG_6_4(buf) \
+t0 = _mm_unpackhi_epi32(m1,m2); \
+t1 = _mm_unpackhi_epi64(m0,t0); \
+buf = _mm_shuffle_epi32(t1, _MM_SHUFFLE(3,0,1,2));
+
+#define LOAD_MSG_7_1(buf) \
+t0 = _mm_unpackhi_epi32(m0,m1); \
+t1 = _mm_blend_epi16(t0,m3,0x0F); \
+buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(2,0,3,1));
+
+#define LOAD_MSG_7_2(buf) \
+t0 = _mm_blend_epi16(m2,m3,0x30); \
+t1 = _mm_srli_si128(m0,4); \
+t2 = _mm_blend_epi16(t0,t1,0x03); \
+buf = _mm_shuffle_epi32(t2, _MM_SHUFFLE(1,0,2,3));
+
+#define LOAD_MSG_7_3(buf) \
+t0 = _mm_unpackhi_epi64(m0,m3); \
+t1 = _mm_unpacklo_epi64(m1,m2); \
+t2 = _mm_blend_epi16(t0,t1,0x3C); \
+buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(0,2,3,1));
+
+#define LOAD_MSG_7_4(buf) \
+t0 = _mm_unpacklo_epi32(m0,m1); \
+t1 = _mm_unpackhi_epi32(m1,m2); \
+buf = _mm_unpacklo_epi64(t0,t1);
+
+#define LOAD_MSG_8_1(buf) \
+t0 = _mm_unpackhi_epi32(m1,m3); \
+t1 = _mm_unpacklo_epi64(t0,m0); \
+t2 = _mm_blend_epi16(t1,m2,0xC0); \
+buf = _mm_shufflehi_epi16(t2,_MM_SHUFFLE(1,0,3,2));
+
+#define LOAD_MSG_8_2(buf) \
+t0 = _mm_unpackhi_epi32(m0,m3); \
+t1 = _mm_blend_epi16(m2,t0,0xF0); \
+buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(0,2,1,3));
+
+#define LOAD_MSG_8_3(buf) \
+t0 = _mm_blend_epi16(m2,m0,0x0C); \
+t1 = _mm_slli_si128(t0,4); \
+buf = _mm_blend_epi16(t1,m3,0x0F);
+
+#define LOAD_MSG_8_4(buf) \
+t0 = _mm_blend_epi16(m1,m0,0x30); \
+buf = _mm_shuffle_epi32(t0,_MM_SHUFFLE(1,0,3,2));
+
+#define LOAD_MSG_9_1(buf) \
+t0 = _mm_blend_epi16(m0,m2,0x03); \
+t1 = _mm_blend_epi16(m1,m2,0x30); \
+t2 = _mm_blend_epi16(t1,t0,0x0F); \
+buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(1,3,0,2));
+
+#define LOAD_MSG_9_2(buf) \
+t0 = _mm_slli_si128(m0,4); \
+t1 = _mm_blend_epi16(m1,t0,0xC0); \
+buf = _mm_shuffle_epi32(t1,_MM_SHUFFLE(1,2,0,3));
+
+#define LOAD_MSG_9_3(buf) \
+t0 = _mm_unpackhi_epi32(m0,m3); \
+t1 = _mm_unpacklo_epi32(m2,m3); \
+t2 = _mm_unpackhi_epi64(t0,t1); \
+buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(3,0,2,1));
+
+#define LOAD_MSG_9_4(buf) \
+t0 = _mm_blend_epi16(m3,m2,0xC0); \
+t1 = _mm_unpacklo_epi32(m0,m3); \
+t2 = _mm_blend_epi16(t0,t1,0x0F); \
+buf = _mm_shuffle_epi32(t2,_MM_SHUFFLE(0,1,2,3));
+
+#endif
--- a/crypto/blake2s-load-xop.h
+++ b/crypto/blake2s-load-xop.h
@ -0,0 +1,191 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2S_LOAD_XOP_H
+#define BLAKE2S_LOAD_XOP_H
+
+#define TOB(x) ((x)*4*0x01010101 + 0x03020100) /* ..or not TOB */
+
+#if 0
+/* Basic VPPERM emulation, for testing purposes */
+static __m128i _mm_perm_epi8(const __m128i src1, const __m128i src2, const __m128i sel)
+{
+   const __m128i sixteen = _mm_set1_epi8(16);
+   const __m128i t0 = _mm_shuffle_epi8(src1, sel);
+   const __m128i s1 = _mm_shuffle_epi8(src2, _mm_sub_epi8(sel, sixteen));
+   const __m128i mask = _mm_or_si128(_mm_cmpeq_epi8(sel, sixteen),
+                                     _mm_cmpgt_epi8(sel, sixteen)); /* (>=16) = 0xff : 00 */
+   return _mm_blendv_epi8(t0, s1, mask);
+}
+#endif
+
+#define LOAD_MSG_0_1(buf) \
+buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(6),TOB(4),TOB(2),TOB(0)) );
+
+#define LOAD_MSG_0_2(buf) \
+buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(7),TOB(5),TOB(3),TOB(1)) );
+
+#define LOAD_MSG_0_3(buf) \
+buf = _mm_perm_epi8(m2, m3, _mm_set_epi32(TOB(6),TOB(4),TOB(2),TOB(0)) );
+
+#define LOAD_MSG_0_4(buf) \
+buf = _mm_perm_epi8(m2, m3, _mm_set_epi32(TOB(7),TOB(5),TOB(3),TOB(1)) );
+
+#define LOAD_MSG_1_1(buf) \
+t0 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(0),TOB(5),TOB(0),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(6)) );
+
+#define LOAD_MSG_1_2(buf) \
+t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(2),TOB(0),TOB(4),TOB(6)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_1_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(0),TOB(0),TOB(1)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_1_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(7),TOB(2),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(4)) );
+
+#define LOAD_MSG_2_1(buf) \
+t0 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(0),TOB(1),TOB(0),TOB(7)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(4),TOB(0)) );
+
+#define LOAD_MSG_2_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(2),TOB(0),TOB(4)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_2_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(7),TOB(3),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(6)) );
+
+#define LOAD_MSG_2_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(4),TOB(1),TOB(6),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(6)) );
+
+#define LOAD_MSG_3_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(3),TOB(7)) ); \
+t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(5),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_3_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(0),TOB(1),TOB(5)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(6),TOB(4),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_3_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(4),TOB(5),TOB(2)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_3_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(6)) ); \
+buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(4),TOB(2),TOB(6),TOB(0)) );
+
+#define LOAD_MSG_4_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(2),TOB(5),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(6),TOB(2),TOB(1),TOB(5)) );
+
+#define LOAD_MSG_4_2(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(4),TOB(7),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_4_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(6),TOB(0),TOB(0)) ); \
+t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(2),TOB(7),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(6)) );
+
+#define LOAD_MSG_4_4(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(4),TOB(0),TOB(1)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(4),TOB(0)) );
+
+#define LOAD_MSG_5_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(6),TOB(2)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(4),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_5_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(6),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(4)) );
+
+#define LOAD_MSG_5_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(1),TOB(0),TOB(7),TOB(4)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_5_4(buf) \
+t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(5),TOB(0),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(6),TOB(1),TOB(5)) );
+
+#define LOAD_MSG_6_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(4),TOB(0),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(6),TOB(1),TOB(4)) );
+
+#define LOAD_MSG_6_2(buf) \
+t1 = _mm_perm_epi8(m1, m2, _mm_set_epi32(TOB(6),TOB(0),TOB(0),TOB(1)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(5),TOB(7),TOB(0)) );
+
+#define LOAD_MSG_6_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(6),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(4),TOB(5),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_6_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(2),TOB(3),TOB(7)) ); \
+buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(7),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_7_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(3),TOB(0),TOB(7),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(4),TOB(1),TOB(5)) );
+
+#define LOAD_MSG_7_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(5),TOB(1),TOB(0),TOB(7)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(6),TOB(0)) );
+
+#define LOAD_MSG_7_3(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(2),TOB(0),TOB(0),TOB(5)) ); \
+t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(4),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(7),TOB(0)) );
+
+#define LOAD_MSG_7_4(buf) \
+t1 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(6),TOB(4),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m2, _mm_set_epi32(TOB(6),TOB(2),TOB(1),TOB(0)) );
+
+#define LOAD_MSG_8_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(6)) ); \
+t0 = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(7),TOB(1),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(6),TOB(0)) );
+
+#define LOAD_MSG_8_2(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(4),TOB(3),TOB(5),TOB(0)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(1),TOB(7)) );
+
+#define LOAD_MSG_8_3(buf) \
+t0 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(6),TOB(1),TOB(0),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(3),TOB(2),TOB(5),TOB(4)) ); \
+
+#define LOAD_MSG_8_4(buf) \
+buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(4),TOB(7),TOB(2)) );
+
+#define LOAD_MSG_9_1(buf) \
+t0 = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(1),TOB(7),TOB(0),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m2, _mm_set_epi32(TOB(3),TOB(2),TOB(4),TOB(6)) );
+
+#define LOAD_MSG_9_2(buf) \
+buf = _mm_perm_epi8(m0, m1, _mm_set_epi32(TOB(5),TOB(6),TOB(4),TOB(2)) );
+
+#define LOAD_MSG_9_3(buf) \
+t0 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(3),TOB(5),TOB(0)) ); \
+buf = _mm_perm_epi8(t0, m3, _mm_set_epi32(TOB(5),TOB(2),TOB(1),TOB(7)) );
+
+#define LOAD_MSG_9_4(buf) \
+t1 = _mm_perm_epi8(m0, m2, _mm_set_epi32(TOB(0),TOB(0),TOB(0),TOB(7)) ); \
+buf = _mm_perm_epi8(t1, m3, _mm_set_epi32(TOB(3),TOB(4),TOB(6),TOB(0)) );
+
+#endif
--- a/crypto/blake2s-round.h
+++ b/crypto/blake2s-round.h
@ -0,0 +1,88 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2S_ROUND_H
+#define BLAKE2S_ROUND_H
+
+#define LOADU(p)  _mm_loadu_si128( (const __m128i *)(p) )
+#define STOREU(p,r) _mm_storeu_si128((__m128i *)(p), r)
+
+#define TOF(reg) _mm_castsi128_ps((reg))
+#define TOI(reg) _mm_castps_si128((reg))
+
+#define LIKELY(x) __builtin_expect((x),1)
+
+
+/* Microarchitecture-specific macros */
+#ifndef HAVE_XOP
+#ifdef HAVE_SSSE3
+#define _mm_roti_epi32(r, c) ( \
+                (8==-(c)) ? _mm_shuffle_epi8(r,r8) \
+              : (16==-(c)) ? _mm_shuffle_epi8(r,r16) \
+              : _mm_xor_si128(_mm_srli_epi32( (r), -(c) ),_mm_slli_epi32( (r), 32-(-(c)) )) )
+#else
+#define _mm_roti_epi32(r, c) _mm_xor_si128(_mm_srli_epi32( (r), -(c) ),_mm_slli_epi32( (r), 32-(-(c)) ))
+#endif
+#else
+/* ... */
+#endif
+
+
+#define G1(row1,row2,row3,row4,buf) \
+  row1 = _mm_add_epi32( _mm_add_epi32( row1, buf), row2 ); \
+  row4 = _mm_xor_si128( row4, row1 ); \
+  row4 = _mm_roti_epi32(row4, -16); \
+  row3 = _mm_add_epi32( row3, row4 );   \
+  row2 = _mm_xor_si128( row2, row3 ); \
+  row2 = _mm_roti_epi32(row2, -12);
+
+#define G2(row1,row2,row3,row4,buf) \
+  row1 = _mm_add_epi32( _mm_add_epi32( row1, buf), row2 ); \
+  row4 = _mm_xor_si128( row4, row1 ); \
+  row4 = _mm_roti_epi32(row4, -8); \
+  row3 = _mm_add_epi32( row3, row4 );   \
+  row2 = _mm_xor_si128( row2, row3 ); \
+  row2 = _mm_roti_epi32(row2, -7);
+
+#define DIAGONALIZE(row1,row2,row3,row4) \
+  row4 = _mm_shuffle_epi32( row4, _MM_SHUFFLE(2,1,0,3) ); \
+  row3 = _mm_shuffle_epi32( row3, _MM_SHUFFLE(1,0,3,2) ); \
+  row2 = _mm_shuffle_epi32( row2, _MM_SHUFFLE(0,3,2,1) );
+
+#define UNDIAGONALIZE(row1,row2,row3,row4) \
+  row4 = _mm_shuffle_epi32( row4, _MM_SHUFFLE(0,3,2,1) ); \
+  row3 = _mm_shuffle_epi32( row3, _MM_SHUFFLE(1,0,3,2) ); \
+  row2 = _mm_shuffle_epi32( row2, _MM_SHUFFLE(2,1,0,3) );
+
+#if defined(HAVE_XOP)
+#include "blake2s-load-xop.h"
+#elif defined(HAVE_SSE41)
+#include "blake2s-load-sse41.h"
+#else
+#include "blake2s-load-sse2.h"
+#endif
+
+#define ROUND(r)  \
+  LOAD_MSG_ ##r ##_1(buf1); \
+  G1(row1,row2,row3,row4,buf1); \
+  LOAD_MSG_ ##r ##_2(buf2); \
+  G2(row1,row2,row3,row4,buf2); \
+  DIAGONALIZE(row1,row2,row3,row4); \
+  LOAD_MSG_ ##r ##_3(buf3); \
+  G1(row1,row2,row3,row4,buf3); \
+  LOAD_MSG_ ##r ##_4(buf4); \
+  G2(row1,row2,row3,row4,buf4); \
+  UNDIAGONALIZE(row1,row2,row3,row4); \
+
+#endif
--- a/crypto/blake2s.cpp
+++ b/crypto/blake2s.cpp
@ -0,0 +1,446 @@
+/*
+BLAKE2 reference source code package - reference C implementations
+
+Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+your option.  The terms of these licenses can be found at:
+
+- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+- OpenSSL license   : https://www.openssl.org/source/license.html
+- Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+More information about the BLAKE2 hash function can be found at
+https://blake2.net.
+*/
+
+#include "stdafx.h"
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+#include <assert.h>
+#include "tunsafe_types.h"
+#include "blake2s.h"
+#include "crypto_ops.h"
+
+void blake2s_compress_sse(blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES]);
+
+#if !defined(__cplusplus) && (!defined(__STDC_VERSION__) || __STDC_VERSION__ < 199901L)
+#if   defined(_MSC_VER)
+#define BLAKE2_INLINE __inline
+#elif defined(__GNUC__)
+#define BLAKE2_INLINE __inline__
+#else
+#define BLAKE2_INLINE
+#endif
+#else
+#define BLAKE2_INLINE inline
+#endif
+
+static BLAKE2_INLINE uint32_t load32(const void *src) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  uint32_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = (const uint8_t *)src;
+  return ((uint32_t)(p[0]) << 0) |
+    ((uint32_t)(p[1]) << 8) |
+    ((uint32_t)(p[2]) << 16) |
+    ((uint32_t)(p[3]) << 24);
+#endif
+}
+
+static BLAKE2_INLINE uint16_t load16(const void *src) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  uint16_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = (const uint8_t *)src;
+  return ((uint16_t)(p[0]) << 0) |
+    ((uint16_t)(p[1]) << 8);
+#endif
+}
+
+static BLAKE2_INLINE void store16(void *dst, uint16_t w) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = (uint8_t *)dst;
+  *p++ = (uint8_t)w; w >>= 8;
+  *p++ = (uint8_t)w;
+#endif
+}
+
+static BLAKE2_INLINE void store32(void *dst, uint32_t w) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = (uint8_t *)dst;
+  p[0] = (uint8_t)(w >> 0);
+  p[1] = (uint8_t)(w >> 8);
+  p[2] = (uint8_t)(w >> 16);
+  p[3] = (uint8_t)(w >> 24);
+#endif
+}
+
+static BLAKE2_INLINE uint32_t rotr32(const uint32_t w, const unsigned c) {
+  return (w >> c) | (w << (32 - c));
+}
+
+static BLAKE2_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
+  return (w >> c) | (w << (64 - c));
+}
+
+static const uint32_t blake2s_IV[8] = {
+  0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL,
+  0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL
+};
+
+static const uint8_t blake2s_sigma[10][16] =
+{
+  {0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15} ,
+  {14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3} ,
+  {11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4} ,
+  {7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8} ,
+  {9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13} ,
+  {2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9} ,
+  {12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11} ,
+  {13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10} ,
+  {6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5} ,
+  {10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0} ,
+};
+
+static void blake2s_set_lastnode(blake2s_state *S) {
+  S->f[1] = (uint32_t)-1;
+}
+
+/* Some helper functions, not necessarily useful */
+static int blake2s_is_lastblock(const blake2s_state *S) {
+  return S->f[0] != 0;
+}
+
+static void blake2s_set_lastblock(blake2s_state *S) {
+  if (S->last_node) blake2s_set_lastnode(S);
+
+  S->f[0] = (uint32_t)-1;
+}
+
+static void blake2s_increment_counter(blake2s_state *S, const uint32_t inc) {
+  S->t[0] += inc;
+  S->t[1] += (S->t[0] < inc);
+}
+
+void blake2s_init_with_len(blake2s_state *S, size_t outlen, size_t keylen) {
+  memset(S, 0, sizeof(blake2s_state));
+
+  blake2s_param *P = &S->param;
+  size_t i;
+
+  /* Move interval verification here? */
+  assert(outlen && outlen <= BLAKE2S_OUTBYTES);
+
+  P->digest_length = (uint8_t)outlen;
+  S->outlen = (uint8_t)outlen;
+  P->key_length = (uint8_t)keylen;
+  P->fanout = 1;
+  P->depth = 1;
+  //  store32(&P.leaf_length, 0);
+  //  store32(&P.node_offset, 0);
+  //  store16(&P.xof_length, 0);
+  //  P.node_depth = 0;
+  //  P.inner_length = 0;
+  /* memset(P->reserved, 0, sizeof(P->reserved) ); */
+  //  memset(P.salt, 0, sizeof(P.salt));
+  //  memset(P.personal, 0, sizeof(P.personal));
+  for (i = 0; i < 8; ++i)
+    S->h[i] = load32(&S->h[i]) ^ blake2s_IV[i];
+
+}
+
+/* Sequential blake2s initialization */
+void blake2s_init(blake2s_state *S, size_t outlen) {
+  blake2s_init_with_len(S, outlen, 0);
+}
+
+void blake2s_init_key(blake2s_state *S, size_t outlen, const void *key, size_t keylen) {
+  uint8_t block[BLAKE2S_BLOCKBYTES];
+
+  assert(outlen && outlen <= BLAKE2S_OUTBYTES);
+  assert(key && keylen && keylen <= BLAKE2S_KEYBYTES);
+
+  blake2s_init_with_len(S, outlen, keylen);
+
+  memset(block, 0, BLAKE2S_BLOCKBYTES);
+  memcpy(block, key, keylen);
+  blake2s_update(S, block, BLAKE2S_BLOCKBYTES);
+  memzero_crypto(block, BLAKE2S_BLOCKBYTES); /* Burn the key from stack */
+}
+
+#define G(r,i,a,b,c,d)                      \
+  do {                                      \
+    a = a + b + m[blake2s_sigma[r][2*i+0]]; \
+    d = rotr32(d ^ a, 16);                  \
+    c = c + d;                              \
+    b = rotr32(b ^ c, 12);                  \
+    a = a + b + m[blake2s_sigma[r][2*i+1]]; \
+    d = rotr32(d ^ a, 8);                   \
+    c = c + d;                              \
+    b = rotr32(b ^ c, 7);                   \
+  } while(0)
+
+#define ROUND(r)                    \
+  do {                              \
+    G(r,0,v[ 0],v[ 4],v[ 8],v[12]); \
+    G(r,1,v[ 1],v[ 5],v[ 9],v[13]); \
+    G(r,2,v[ 2],v[ 6],v[10],v[14]); \
+    G(r,3,v[ 3],v[ 7],v[11],v[15]); \
+    G(r,4,v[ 0],v[ 5],v[10],v[15]); \
+    G(r,5,v[ 1],v[ 6],v[11],v[12]); \
+    G(r,6,v[ 2],v[ 7],v[ 8],v[13]); \
+    G(r,7,v[ 3],v[ 4],v[ 9],v[14]); \
+  } while(0)
+
+static void blake2s_compress(blake2s_state *S, const uint8_t in[BLAKE2S_BLOCKBYTES]) {
+  uint32_t m[16];
+  uint32_t v[16];
+  size_t i;
+
+  for (i = 0; i < 16; ++i) {
+    m[i] = load32(in + i * sizeof(m[i]));
+  }
+
+  for (i = 0; i < 8; ++i) {
+    v[i] = S->h[i];
+  }
+
+  v[8] = blake2s_IV[0];
+  v[9] = blake2s_IV[1];
+  v[10] = blake2s_IV[2];
+  v[11] = blake2s_IV[3];
+  v[12] = S->t[0] ^ blake2s_IV[4];
+  v[13] = S->t[1] ^ blake2s_IV[5];
+  v[14] = S->f[0] ^ blake2s_IV[6];
+  v[15] = S->f[1] ^ blake2s_IV[7];
+
+  ROUND(0);
+  ROUND(1);
+  ROUND(2);
+  ROUND(3);
+  ROUND(4);
+  ROUND(5);
+  ROUND(6);
+  ROUND(7);
+  ROUND(8);
+  ROUND(9);
+
+  for (i = 0; i < 8; ++i) {
+    S->h[i] = S->h[i] ^ v[i] ^ v[i + 8];
+  }
+}
+
+#undef G
+#undef ROUND
+
+  static inline void blake2s_compress_impl(blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES]) {
+#if defined(ARCH_CPU_X86_64)
+  blake2s_compress_sse(S, block);
+#else
+  blake2s_compress(S, block);
+#endif
+}
+
+void blake2s_update(blake2s_state *S, const void *pin, size_t inlen) {
+  const unsigned char * in = (const unsigned char *)pin;
+  if (inlen > 0) {
+    size_t left = S->buflen;
+    size_t fill = BLAKE2S_BLOCKBYTES - left;
+    if (inlen > fill) {
+      S->buflen = 0;
+      memcpy(S->buf + left, in, fill); /* Fill buffer */
+      blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES);
+      blake2s_compress_impl(S, S->buf); /* Compress */
+      in += fill; inlen -= fill;
+      while (inlen > BLAKE2S_BLOCKBYTES) {
+        blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES);
+        blake2s_compress_impl(S, in);
+        in += BLAKE2S_BLOCKBYTES;
+        inlen -= BLAKE2S_BLOCKBYTES;
+      }
+    }
+    memcpy(S->buf + S->buflen, in, inlen);
+    S->buflen += inlen;
+  }
+}
+
+void blake2s_final(blake2s_state *S, void *out, size_t outlen) {
+  size_t i;
+
+  assert(out != NULL && outlen >= S->outlen);
+  assert(!blake2s_is_lastblock(S));
+
+  blake2s_increment_counter(S, (uint32_t)S->buflen);
+  blake2s_set_lastblock(S);
+  memset(S->buf + S->buflen, 0, BLAKE2S_BLOCKBYTES - S->buflen); /* Padding */
+  blake2s_compress_impl(S, S->buf);
+
+  for (i = 0; i < 8; ++i) /* Output full hash to temp buffer */
+    store32(&S->h[i], S->h[i]);
+
+  memcpy(out, S->h, outlen);
+}
+
+SAFEBUFFERS void blake2s(void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen) {
+  blake2s_state S;
+
+  /* Verify parameters */
+  assert(!((NULL == in && inlen > 0)));
+  assert(out);
+  assert(!(NULL == key && keylen > 0));
+  assert(!(!outlen || outlen > BLAKE2S_OUTBYTES));
+  assert(!(keylen > BLAKE2S_KEYBYTES));
+
+  if (keylen > 0) {
+    blake2s_init_key(&S, outlen, key, keylen);
+  } else {
+    blake2s_init(&S, outlen);
+  }
+  blake2s_update(&S, (const uint8_t *)in, inlen);
+  blake2s_final(&S, out, outlen);
+}
+
+SAFEBUFFERS void blake2s_hmac(uint8_t *out, size_t outlen, const uint8_t *in, size_t inlen, const uint8_t *key, size_t keylen) {
+  blake2s_state b2s;
+  uint64_t temp[BLAKE2S_OUTBYTES / 8];
+  uint64_t key_temp[BLAKE2S_BLOCKBYTES / 8] = { 0 };
+
+  if (keylen > BLAKE2S_BLOCKBYTES) {
+    blake2s_init(&b2s, BLAKE2S_OUTBYTES);
+    blake2s_update(&b2s, key, keylen);
+    blake2s_final(&b2s, key_temp, BLAKE2S_OUTBYTES);
+  } else {
+    memcpy(key_temp, key, keylen);
+  }
+
+  for (size_t i = 0; i < BLAKE2S_BLOCKBYTES / 8; i++)
+    key_temp[i] ^= 0x3636363636363636ull;
+
+  blake2s_init(&b2s, BLAKE2S_OUTBYTES);
+  blake2s_update(&b2s, key_temp, BLAKE2S_BLOCKBYTES);
+  blake2s_update(&b2s, in, inlen);
+  blake2s_final(&b2s, temp, BLAKE2S_OUTBYTES);
+
+  for (size_t i = 0; i < BLAKE2S_BLOCKBYTES / 8; i++)
+    key_temp[i] ^= 0x5c5c5c5c5c5c5c5cull ^ 0x3636363636363636ull;
+
+  blake2s_init(&b2s, BLAKE2S_OUTBYTES);
+  blake2s_update(&b2s, key_temp, BLAKE2S_BLOCKBYTES);
+  blake2s_update(&b2s, temp, BLAKE2S_OUTBYTES);
+  blake2s_final(&b2s, temp, BLAKE2S_OUTBYTES);
+
+  memcpy(out, temp, outlen);
+  memzero_crypto(key_temp, sizeof(key_temp));
+  memzero_crypto(temp, sizeof(temp));
+}
+
+SAFEBUFFERS
+void blake2s_hkdf(uint8 *dst1, size_t dst1_size,
+                  uint8 *dst2, size_t dst2_size,
+                  uint8 *dst3, size_t dst3_size,
+                  const uint8 *data, size_t data_size,
+                  const uint8 *key, size_t key_size) {
+  struct {
+    uint8 prk[BLAKE2S_OUTBYTES];
+    uint8 temp[BLAKE2S_OUTBYTES + 1];
+  } t;
+  blake2s_hmac(t.prk, BLAKE2S_OUTBYTES, data, data_size, key, key_size);
+  // first-key = HMAC(secret, 0x1)
+  t.temp[0] = 0x1;
+  blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, 1, t.prk, BLAKE2S_OUTBYTES);
+  memcpy(dst1, t.temp, dst1_size);
+  if (dst2 != NULL) {
+    // second-key = HMAC(secret, first-key || 0x2)
+    t.temp[BLAKE2S_OUTBYTES] = 0x2;
+    blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, BLAKE2S_OUTBYTES + 1, t.prk,  BLAKE2S_OUTBYTES);
+    memcpy(dst2, t.temp, dst2_size);
+    if (dst3 != NULL) {
+      // third-key = HMAC(secret, second-key || 0x3)
+      t.temp[BLAKE2S_OUTBYTES] = 0x3;
+      blake2s_hmac(t.temp, BLAKE2S_OUTBYTES, t.temp, BLAKE2S_OUTBYTES + 1, t.prk, BLAKE2S_OUTBYTES);
+      memcpy(dst3, t.temp, dst3_size);
+    }
+  }
+  memzero_crypto(&t, sizeof(t));
+}
+
+
+#if defined(SUPERCOP)
+int crypto_hash(unsigned char *out, unsigned char *in, unsigned long long inlen) {
+  return blake2s(out, BLAKE2S_OUTBYTES in, inlen, NULL, 0);
+}
+#endif
+
+#if defined(BLAKE2S_SELFTEST)
+#include <string.h>
+#include "blake2-kat.h"
+int main(void) {
+  uint8_t key[BLAKE2S_KEYBYTES];
+  uint8_t buf[BLAKE2_KAT_LENGTH];
+  size_t i, step;
+
+  for (i = 0; i < BLAKE2S_KEYBYTES; ++i)
+    key[i] = (uint8_t)i;
+
+  for (i = 0; i < BLAKE2_KAT_LENGTH; ++i)
+    buf[i] = (uint8_t)i;
+
+  /* Test simple API */
+  for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) {
+    uint8_t hash[BLAKE2S_OUTBYTES];
+    blake2s(hash, BLAKE2S_OUTBYTES, buf, i, key, BLAKE2S_KEYBYTES);
+
+    if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) {
+      goto fail;
+    }
+  }
+
+  /* Test streaming API */
+  for (step = 1; step < BLAKE2S_BLOCKBYTES; ++step) {
+    for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) {
+      uint8_t hash[BLAKE2S_OUTBYTES];
+      blake2s_state S;
+      uint8_t * p = buf;
+      size_t mlen = i;
+      int err = 0;
+
+      if ((err = blake2s_init_key(&S, BLAKE2S_OUTBYTES, key, BLAKE2S_KEYBYTES)) < 0) {
+        goto fail;
+      }
+
+      while (mlen >= step) {
+        if ((err = blake2s_update(&S, p, step)) < 0) {
+          goto fail;
+        }
+        mlen -= step;
+        p += step;
+      }
+      if ((err = blake2s_update(&S, p, mlen)) < 0) {
+        goto fail;
+      }
+      if ((err = blake2s_final(&S, hash, BLAKE2S_OUTBYTES)) < 0) {
+        goto fail;
+      }
+
+      if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) {
+        goto fail;
+      }
+    }
+  }
+
+  puts("ok");
+  return 0;
+fail:
+  puts("error");
+  return -1;
+}
+#endif
--- a/crypto/blake2s.h
+++ b/crypto/blake2s.h
@ -0,0 +1,100 @@
+/*
+BLAKE2 reference source code package - reference C implementations
+
+Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+your option.  The terms of these licenses can be found at:
+
+- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+- OpenSSL license   : https://www.openssl.org/source/license.html
+- Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+More information about the BLAKE2 hash function can be found at
+https://blake2.net.
+*/
+#ifndef BLAKE2_H
+#define BLAKE2_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include "tunsafe_types.h"
+#if defined(_MSC_VER)
+#define BLAKE2_PACKED(x) __pragma(pack(push, 1)) x __pragma(pack(pop))
+#else
+#define BLAKE2_PACKED(x) x __attribute__((packed))
+#endif
+
+#if defined(__cplusplus)
+//extern "C" {
+#endif
+
+enum blake2s_constant {
+  BLAKE2S_BLOCKBYTES = 64,
+  BLAKE2S_OUTBYTES = 32,
+  BLAKE2S_KEYBYTES = 32,
+  BLAKE2S_SALTBYTES = 8,
+  BLAKE2S_PERSONALBYTES = 8
+};
+
+BLAKE2_PACKED(struct blake2s_param__ {
+  uint8_t  digest_length; /* 1 */
+  uint8_t  key_length;    /* 2 */
+  uint8_t  fanout;        /* 3 */
+  uint8_t  depth;         /* 4 */
+  uint32_t leaf_length;   /* 8 */
+  uint32_t node_offset;  /* 12 */
+  uint16_t xof_length;    /* 14 */
+  uint8_t  node_depth;    /* 15 */
+  uint8_t  inner_length;  /* 16 */
+                          /* uint8_t  reserved[0]; */
+  uint32_t  salt[BLAKE2S_SALTBYTES / 4]; /* 24 */
+  uint32_t  personal[BLAKE2S_PERSONALBYTES / 4];  /* 32 */
+});
+
+
+typedef struct blake2s_param__ blake2s_param;
+
+/* Padded structs result in a compile-time error */
+enum {
+  BLAKE2_DUMMY_1 = 1 / (sizeof(blake2s_param) == BLAKE2S_OUTBYTES ? 1 : 0),
+};
+
+
+typedef struct blake2s_state__ {
+  union {
+    uint32_t h[8];
+    blake2s_param param;
+  };
+  uint32_t t[2];
+  uint32_t f[2];
+  uint8_t  buf[BLAKE2S_BLOCKBYTES];
+  size_t   buflen;
+  size_t   outlen;
+  uint8_t  last_node;
+} blake2s_state;
+
+
+
+/* Streaming API */
+void blake2s_init(blake2s_state *S, size_t outlen);
+void blake2s_init_key(blake2s_state *S, size_t outlen, const void *key, size_t keylen);
+void blake2s_init_param(blake2s_state *S, const blake2s_param *P);
+void blake2s_update(blake2s_state *S, const void *in, size_t inlen);
+void blake2s_final(blake2s_state *S, void *out, size_t outlen);
+
+/* Simple API */
+void blake2s(void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen);
+
+void blake2s_hmac(uint8_t *out, size_t outlen, const uint8_t *in, size_t inlen, const uint8_t *key, size_t keylen);
+
+void blake2s_hkdf(uint8 *dst1, size_t dst1_size,
+                  uint8 *dst2, size_t dst2_size,
+                  uint8 *dst3, size_t dst3_size,
+                  const uint8 *data, size_t data_size,
+                  const uint8 *key, size_t key_size);
+
+#if defined(__cplusplus)
+//}
+#endif
+
+#endif
--- a/crypto/blake2s_sse.cpp
+++ b/crypto/blake2s_sse.cpp
@ -0,0 +1,399 @@
+/*
+   BLAKE2 reference source code package - optimized C implementations
+
+   Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#include "stdafx.h"
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+
+#include "blake2s.h"
+#include "crypto_ops.h"
+
+#include <emmintrin.h>
+#if defined(HAVE_SSSE3)
+#include <tmmintrin.h>
+#endif
+#if defined(HAVE_SSE41)
+#include <smmintrin.h>
+#endif
+#if defined(HAVE_AVX)
+#include <immintrin.h>
+#endif
+#if defined(HAVE_XOP)
+#include <x86intrin.h>
+#endif
+
+#include "blake2s-round.h"
+
+#if !defined(__cplusplus) && (!defined(__STDC_VERSION__) || __STDC_VERSION__ < 199901L)
+#if   defined(_MSC_VER)
+#define BLAKE2_INLINE __inline
+#elif defined(__GNUC__)
+#define BLAKE2_INLINE __inline__
+#else
+#define BLAKE2_INLINE
+#endif
+#else
+#define BLAKE2_INLINE inline
+#endif
+
+static BLAKE2_INLINE uint32_t load32(const void *src) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  uint32_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = (const uint8_t *)src;
+  return ((uint32_t)(p[0]) << 0) |
+    ((uint32_t)(p[1]) << 8) |
+    ((uint32_t)(p[2]) << 16) |
+    ((uint32_t)(p[3]) << 24);
+#endif
+}
+
+static BLAKE2_INLINE void store32(void *dst, uint32_t w) {
+#if defined(ARCH_CPU_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = (uint8_t *)dst;
+  p[0] = (uint8_t)(w >> 0);
+  p[1] = (uint8_t)(w >> 8);
+  p[2] = (uint8_t)(w >> 16);
+  p[3] = (uint8_t)(w >> 24);
+#endif
+}
+
+
+
+static const uint32_t blake2s_IV[8] =
+{
+  0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL,
+  0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL
+};
+
+/* Some helper functions */
+static void blake2s_set_lastnode( blake2s_state *S )
+{
+  S->f[1] = (uint32_t)-1;
+}
+
+static int blake2s_is_lastblock( const blake2s_state *S )
+{
+  return S->f[0] != 0;
+}
+
+static void blake2s_set_lastblock( blake2s_state *S )
+{
+  if( S->last_node ) blake2s_set_lastnode( S );
+
+  S->f[0] = (uint32_t)-1;
+}
+
+static void blake2s_increment_counter( blake2s_state *S, const uint32_t inc )
+{
+  uint64_t t = ( ( uint64_t )S->t[1] << 32 ) | S->t[0];
+  t += inc;
+  S->t[0] = ( uint32_t )( t >>  0 );
+  S->t[1] = ( uint32_t )( t >> 32 );
+}
+
+/* init2 xors IV with input parameter block */
+#if 0
+void blake2s_init_param( blake2s_state *S, const blake2s_param *P )
+{
+  size_t i;
+  /*blake2s_init0( S ); */
+  const uint8_t * v = ( const uint8_t * )( blake2s_IV );
+  const uint8_t * p = ( const uint8_t * )( P );
+  uint8_t * h = ( uint8_t * )( S->h );
+  /* IV XOR ParamBlock */
+  memset( S, 0, sizeof( blake2s_state ) );
+
+  for( i = 0; i < BLAKE2S_OUTBYTES; ++i ) h[i] = v[i] ^ p[i];
+
+  S->outlen = P->digest_length;
+}
+
+/* Some sort of default parameter block initialization, for sequential blake2s */
+void blake2s_init( blake2s_state *S, size_t outlen )
+{
+  blake2s_param P[1];
+  assert(outlen && outlen <= BLAKE2S_OUTBYTES);
+
+  P->digest_length = (uint8_t)outlen;
+  P->key_length    = 0;
+  P->fanout        = 1;
+  P->depth         = 1;
+  store32( &P->leaf_length, 0 );
+  store32( &P->node_offset, 0 );
+  store16( &P->xof_length, 0 );
+  P->node_depth    = 0;
+  P->inner_length  = 0;
+  /* memset(P->reserved, 0, sizeof(P->reserved) ); */
+  memset( P->salt,     0, sizeof( P->salt ) );
+  memset( P->personal, 0, sizeof( P->personal ) );
+
+  blake2s_init_param( S, P );
+}
+
+int blake2s_init_key( blake2s_state *S, size_t outlen, const void *key, size_t keylen )
+{
+  blake2s_param P[1];
+
+  /* Move interval verification here? */
+  if ( ( !outlen ) || ( outlen > BLAKE2S_OUTBYTES ) ) return -1;
+
+  if ( ( !key ) || ( !keylen ) || keylen > BLAKE2S_KEYBYTES ) return -1;
+
+  P->digest_length = (uint8_t)outlen;
+  P->key_length    = (uint8_t)keylen;
+  P->fanout        = 1;
+  P->depth         = 1;
+  store32( &P->leaf_length, 0 );
+  store32( &P->node_offset, 0 );
+  store16( &P->xof_length, 0 );
+  P->node_depth    = 0;
+  P->inner_length  = 0;
+  /* memset(P->reserved, 0, sizeof(P->reserved) ); */
+  memset( P->salt,     0, sizeof( P->salt ) );
+  memset( P->personal, 0, sizeof( P->personal ) );
+
+  if( blake2s_init_param( S, P ) < 0 )
+    return -1;
+
+  {
+    uint8_t block[BLAKE2S_BLOCKBYTES];
+    memset( block, 0, BLAKE2S_BLOCKBYTES );
+    memcpy( block, key, keylen );
+    blake2s_update( S, block, BLAKE2S_BLOCKBYTES );
+    memzero_crypto( block, BLAKE2S_BLOCKBYTES ); /* Burn the key from stack */
+  }
+  return 0;
+}
+#endif
+
+
+void blake2s_compress_sse( blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES] )
+{
+  __m128i row1, row2, row3, row4;
+  __m128i buf1, buf2, buf3, buf4;
+#if defined(HAVE_SSE41)
+  __m128i t0, t1;
+#if !defined(HAVE_XOP)
+  __m128i t2;
+#endif
+#endif
+  __m128i ff0, ff1;
+#if defined(HAVE_SSSE3) && !defined(HAVE_XOP)
+  const __m128i r8 = _mm_set_epi8( 12, 15, 14, 13, 8, 11, 10, 9, 4, 7, 6, 5, 0, 3, 2, 1 );
+  const __m128i r16 = _mm_set_epi8( 13, 12, 15, 14, 9, 8, 11, 10, 5, 4, 7, 6, 1, 0, 3, 2 );
+#endif
+#if defined(HAVE_SSE41)
+  const __m128i m0 = LOADU( block +  00 );
+  const __m128i m1 = LOADU( block +  16 );
+  const __m128i m2 = LOADU( block +  32 );
+  const __m128i m3 = LOADU( block +  48 );
+#else
+  const uint32_t  m0 = load32(block +  0 * sizeof(uint32_t));
+  const uint32_t  m1 = load32(block +  1 * sizeof(uint32_t));
+  const uint32_t  m2 = load32(block +  2 * sizeof(uint32_t));
+  const uint32_t  m3 = load32(block +  3 * sizeof(uint32_t));
+  const uint32_t  m4 = load32(block +  4 * sizeof(uint32_t));
+  const uint32_t  m5 = load32(block +  5 * sizeof(uint32_t));
+  const uint32_t  m6 = load32(block +  6 * sizeof(uint32_t));
+  const uint32_t  m7 = load32(block +  7 * sizeof(uint32_t));
+  const uint32_t  m8 = load32(block +  8 * sizeof(uint32_t));
+  const uint32_t  m9 = load32(block +  9 * sizeof(uint32_t));
+  const uint32_t m10 = load32(block + 10 * sizeof(uint32_t));
+  const uint32_t m11 = load32(block + 11 * sizeof(uint32_t));
+  const uint32_t m12 = load32(block + 12 * sizeof(uint32_t));
+  const uint32_t m13 = load32(block + 13 * sizeof(uint32_t));
+  const uint32_t m14 = load32(block + 14 * sizeof(uint32_t));
+  const uint32_t m15 = load32(block + 15 * sizeof(uint32_t));
+#endif
+  row1 = ff0 = LOADU( &S->h[0] );
+  row2 = ff1 = LOADU( &S->h[4] );
+  row3 = _mm_loadu_si128( (__m128i const *)&blake2s_IV[0] );
+  row4 = _mm_xor_si128( _mm_loadu_si128( (__m128i const *)&blake2s_IV[4] ), LOADU( &S->t[0] ) );
+  ROUND( 0 );
+  ROUND( 1 );
+  ROUND( 2 );
+  ROUND( 3 );
+  ROUND( 4 );
+  ROUND( 5 );
+  ROUND( 6 );
+  ROUND( 7 );
+  ROUND( 8 );
+  ROUND( 9 );
+  STOREU( &S->h[0], _mm_xor_si128( ff0, _mm_xor_si128( row1, row3 ) ) );
+  STOREU( &S->h[4], _mm_xor_si128( ff1, _mm_xor_si128( row2, row4 ) ) );
+}
+
+#if 0
+int blake2s_update( blake2s_state *S, const void *pin, size_t inlen )
+{
+  const unsigned char * in = (const unsigned char *)pin;
+  if( inlen > 0 )
+  {
+    size_t left = S->buflen;
+    size_t fill = BLAKE2S_BLOCKBYTES - left;
+    if( inlen > fill )
+    {
+      S->buflen = 0;
+      memcpy( S->buf + left, in, fill ); /* Fill buffer */
+      blake2s_increment_counter( S, BLAKE2S_BLOCKBYTES );
+      blake2s_compress( S, S->buf ); /* Compress */
+      in += fill; inlen -= fill;
+      while(inlen > BLAKE2S_BLOCKBYTES) {
+        blake2s_increment_counter(S, BLAKE2S_BLOCKBYTES);
+        blake2s_compress( S, in );
+        in += BLAKE2S_BLOCKBYTES;
+        inlen -= BLAKE2S_BLOCKBYTES;
+      }
+    }
+    memcpy( S->buf + S->buflen, in, inlen );
+    S->buflen += inlen;
+  }
+  return 0;
+}
+
+int blake2s_final( blake2s_state *S, void *out, size_t outlen )
+{
+  uint8_t buffer[BLAKE2S_OUTBYTES] = {0};
+  size_t i;
+
+  if( out == NULL || outlen < S->outlen )
+    return -1;
+
+  if( blake2s_is_lastblock( S ) )
+    return -1;
+
+  blake2s_increment_counter( S, (uint32_t)S->buflen );
+  blake2s_set_lastblock( S );
+  memset( S->buf + S->buflen, 0, BLAKE2S_BLOCKBYTES - S->buflen ); /* Padding */
+  blake2s_compress( S, S->buf );
+
+  for( i = 0; i < 8; ++i ) /* Output full hash to temp buffer */
+    store32( buffer + sizeof( S->h[i] ) * i, S->h[i] );
+
+  memcpy( out, buffer, S->outlen );
+  memzero_crypto( buffer, sizeof(buffer) );
+  return 0;
+}
+
+/* inlen, at least, should be uint64_t. Others can be size_t. */
+int blake2s( void *out, size_t outlen, const void *in, size_t inlen, const void *key, size_t keylen )
+{
+  blake2s_state S[1];
+
+  /* Verify parameters */
+  if ( NULL == in && inlen > 0 ) return -1;
+
+  if ( NULL == out ) return -1;
+
+  if ( NULL == key && keylen > 0) return -1;
+
+  if( !outlen || outlen > BLAKE2S_OUTBYTES ) return -1;
+
+  if( keylen > BLAKE2S_KEYBYTES ) return -1;
+
+  if( keylen > 0 )
+  {
+    if( blake2s_init_key( S, outlen, key, keylen ) < 0 ) return -1;
+  }
+  else
+  {
+    if( blake2s_init( S, outlen ) < 0 ) return -1;
+  }
+
+  blake2s_update( S, ( const uint8_t * )in, inlen );
+  blake2s_final( S, out, outlen );
+  return 0;
+}
+#endif
+
+#if defined(SUPERCOP)
+int crypto_hash( unsigned char *out, unsigned char *in, unsigned long long inlen )
+{
+  return blake2s( out, BLAKE2S_OUTBYTES, in, inlen, NULL, 0 );
+}
+#endif
+
+#if defined(BLAKE2S_SELFTEST)
+#include <string.h>
+#include "blake2-kat.h"
+int main( void )
+{
+  uint8_t key[BLAKE2S_KEYBYTES];
+  uint8_t buf[BLAKE2_KAT_LENGTH];
+  size_t i, step;
+
+  for( i = 0; i < BLAKE2S_KEYBYTES; ++i )
+    key[i] = ( uint8_t )i;
+
+  for( i = 0; i < BLAKE2_KAT_LENGTH; ++i )
+    buf[i] = ( uint8_t )i;
+
+  /* Test simple API */
+  for( i = 0; i < BLAKE2_KAT_LENGTH; ++i )
+  {
+    uint8_t hash[BLAKE2S_OUTBYTES];
+    blake2s( hash, BLAKE2S_OUTBYTES, buf, i, key, BLAKE2S_KEYBYTES );
+
+    if( 0 != memcmp( hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES ) )
+    {
+      goto fail;
+    }
+  }
+
+  /* Test streaming API */
+  for(step = 1; step < BLAKE2S_BLOCKBYTES; ++step) {
+    for (i = 0; i < BLAKE2_KAT_LENGTH; ++i) {
+      uint8_t hash[BLAKE2S_OUTBYTES];
+      blake2s_state S;
+      uint8_t * p = buf;
+      size_t mlen = i;
+      int err = 0;
+
+      if( (err = blake2s_init_key(&S, BLAKE2S_OUTBYTES, key, BLAKE2S_KEYBYTES)) < 0 ) {
+        goto fail;
+      }
+
+      while (mlen >= step) {
+        if ( (err = blake2s_update(&S, p, step)) < 0 ) {
+          goto fail;
+        }
+        mlen -= step;
+        p += step;
+      }
+      if ( (err = blake2s_update(&S, p, mlen)) < 0) {
+        goto fail;
+      }
+      if ( (err = blake2s_final(&S, hash, BLAKE2S_OUTBYTES)) < 0) {
+        goto fail;
+      }
+
+      if (0 != memcmp(hash, blake2s_keyed_kat[i], BLAKE2S_OUTBYTES)) {
+        goto fail;
+      }
+    }
+  }
+
+  puts( "ok" );
+  return 0;
+fail:
+  puts("error");
+  return -1;
+}
+#endif
--- a/crypto/chacha20_x64.asm
+++ b/crypto/chacha20_x64.asm
--- a/crypto/chacha20_x64_gas.s
+++ b/crypto/chacha20_x64_gas.s
--- a/crypto/chacha20_x64_gas_macosx.s
+++ b/crypto/chacha20_x64_gas_macosx.s
--- a/crypto/chacha20poly1305.cpp
+++ b/crypto/chacha20poly1305.cpp
@ -0,0 +1,596 @@
+/* SPDX-License-Identifier: OpenSSL OR (BSD-3-Clause OR GPL-2.0)
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ * Copyright 2016 The OpenSSL Project Authors. All Rights Reserved.
+ */
+
+#include "stdafx.h"
+#include "crypto/chacha20poly1305.h"
+#include "tunsafe_types.h"
+#include "tunsafe_endian.h"
+#include "build_config.h"
+#include "tunsafe_cpu.h"
+#include "crypto_ops.h"
+#include <string.h>
+#include <assert.h>
+
+enum {
+	CHACHA20_IV_SIZE = 16,
+	CHACHA20_KEY_SIZE = 32,
+	CHACHA20_BLOCK_SIZE = 64,
+	POLY1305_BLOCK_SIZE = 16,
+	POLY1305_KEY_SIZE = 32,
+	POLY1305_MAC_SIZE = 16
+};
+
+
+#if defined(OS_MACOSX) || !WITH_AVX512_OPTIMIZATIONS
+#define CHACHA20_WITH_AVX512 0
+#else
+#define CHACHA20_WITH_AVX512 1
+#endif
+
+extern "C" {
+void _cdecl hchacha20_ssse3(uint8 *derived_key, const uint8 *nonce, const uint8 *key);
+void _cdecl chacha20_ssse3(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]);
+void _cdecl chacha20_avx2(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]);
+void _cdecl chacha20_avx512(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]);
+void _cdecl chacha20_avx512vl(uint8 *out, const uint8 *in, size_t len, const uint32 key[8], const uint32 counter[4]);
+void _cdecl poly1305_init_x86_64(void *ctx, const uint8 key[16]);
+void _cdecl poly1305_blocks_x86_64(void *ctx, const uint8 *inp, size_t len, uint32 padbit);
+void _cdecl poly1305_emit_x86_64(void *ctx, uint8 mac[16], const uint32 nonce[4]);
+void _cdecl poly1305_emit_avx(void *ctx, uint8 mac[16], const uint32 nonce[4]);
+void _cdecl poly1305_blocks_avx(void *ctx, const uint8 *inp, size_t len, uint32 padbit);
+void _cdecl poly1305_blocks_avx2(void *ctx, const uint8 *inp, size_t len, uint32 padbit);
+void _cdecl poly1305_blocks_avx512(void *ctx, const uint8 *inp, size_t len, uint32 padbit);
+}
+
+struct chacha20_ctx {
+	uint32 state[CHACHA20_BLOCK_SIZE / sizeof(uint32)];
+};
+
+void crypto_xor(uint8 *dst, const uint8 *src, size_t n) {
+  for (; n >= 4; n -= 4, dst += 4, src += 4)
+    *(uint32*)dst ^= *(uint32*)src;
+  for (; n; n--)
+    *dst++ ^= *src++;
+}
+
+int memcmp_crypto(const uint8 *a, const uint8 *b, size_t n) {
+  int rv = 0;
+  for (; n >= 4; n -= 4, a += 4, b += 4)
+    rv |= *(uint32*)a ^ *(uint32*)b;
+  for (; n; n--)
+    rv |= *a++ ^ *b++;
+  return rv;
+}
+
+#define QUARTER_ROUND(x, a, b, c, d) ( \
+	x[a] += x[b], \
+	x[d] = rol32((x[d] ^ x[a]), 16), \
+	x[c] += x[d], \
+	x[b] = rol32((x[b] ^ x[c]), 12), \
+	x[a] += x[b], \
+	x[d] = rol32((x[d] ^ x[a]), 8), \
+	x[c] += x[d], \
+	x[b] = rol32((x[b] ^ x[c]), 7) \
+)
+
+#define C(i, j) (i * 4 + j)
+
+#define DOUBLE_ROUND(x) ( \
+	/* Column Round */ \
+	QUARTER_ROUND(x, C(0, 0), C(1, 0), C(2, 0), C(3, 0)), \
+	QUARTER_ROUND(x, C(0, 1), C(1, 1), C(2, 1), C(3, 1)), \
+	QUARTER_ROUND(x, C(0, 2), C(1, 2), C(2, 2), C(3, 2)), \
+	QUARTER_ROUND(x, C(0, 3), C(1, 3), C(2, 3), C(3, 3)), \
+	/* Diagonal Round */ \
+	QUARTER_ROUND(x, C(0, 0), C(1, 1), C(2, 2), C(3, 3)), \
+	QUARTER_ROUND(x, C(0, 1), C(1, 2), C(2, 3), C(3, 0)), \
+	QUARTER_ROUND(x, C(0, 2), C(1, 3), C(2, 0), C(3, 1)), \
+	QUARTER_ROUND(x, C(0, 3), C(1, 0), C(2, 1), C(3, 2)) \
+)
+
+#define TWENTY_ROUNDS(x) ( \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x), \
+	DOUBLE_ROUND(x) \
+)
+
+SAFEBUFFERS static void chacha20_block_generic(struct chacha20_ctx *ctx, uint32 *stream)
+{
+	uint32 x[CHACHA20_BLOCK_SIZE / sizeof(uint32)];
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(x); ++i)
+		x[i] = ctx->state[i];
+
+	TWENTY_ROUNDS(x);
+
+	for (i = 0; i < ARRAY_SIZE(x); ++i)
+		stream[i] = ToLE32(x[i] + ctx->state[i]);
+
+	++ctx->state[12];
+}
+
+SAFEBUFFERS static void hchacha20_generic(uint8 derived_key[CHACHA20POLY1305_KEYLEN], const uint8 nonce[16], const uint8 key[CHACHA20POLY1305_KEYLEN])
+{
+	uint32 *out = (uint32 *)derived_key;
+	uint32 x[] = {
+		0x61707865, 0x3320646e, 0x79622d32, 0x6b206574,
+    ReadLE32(key + 0), ReadLE32(key + 4), ReadLE32(key + 8), ReadLE32(key + 12),
+    ReadLE32(key + 16), ReadLE32(key + 20), ReadLE32(key + 24), ReadLE32(key + 28),
+    ReadLE32(nonce +  0), ReadLE32(nonce +  4), ReadLE32(nonce +  8), ReadLE32(nonce + 12)
+	};
+
+	TWENTY_ROUNDS(x);
+
+	out[0] = ToLE32(x[0]);
+	out[1] = ToLE32(x[1]);
+	out[2] = ToLE32(x[2]);
+	out[3] = ToLE32(x[3]);
+	out[4] = ToLE32(x[12]);
+	out[5] = ToLE32(x[13]);
+	out[6] = ToLE32(x[14]);
+	out[7] = ToLE32(x[15]);
+}
+
+static inline void hchacha20(uint8 derived_key[CHACHA20POLY1305_KEYLEN], const uint8 nonce[16], const uint8 key[CHACHA20POLY1305_KEYLEN])
+{
+#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC)
+	if (X86_PCAP_SSSE3) {
+		hchacha20_ssse3(derived_key, nonce, key);
+		return;
+	}
+#endif  // defined(ARCH_CPU_X86_64)
+	hchacha20_generic(derived_key, nonce, key);
+}
+
+#define chacha20_initial_state(key, nonce) {{ \
+	0x61707865, 0x3320646e, 0x79622d32, 0x6b206574, \
+	ReadLE32((key) + 0), ReadLE32((key) + 4), ReadLE32((key) + 8), ReadLE32((key) + 12), \
+	ReadLE32((key) + 16), ReadLE32((key) + 20), ReadLE32((key) + 24), ReadLE32((key) + 28), \
+	0, 0, ReadLE32((nonce) +  0), ReadLE32((nonce) + 4) \
+}}
+
+SAFEBUFFERS static void chacha20_crypt(struct chacha20_ctx *ctx, uint8 *dst, const uint8 *src, uint32 bytes)
+{
+	uint32 buf[CHACHA20_BLOCK_SIZE / sizeof(uint32)];
+
+  if (bytes == 0)
+    return;
+
+#if defined(ARCH_CPU_X86_64)
+#if CHACHA20_WITH_AVX512
+	if (X86_PCAP_AVX512F) {
+		chacha20_avx512(dst, src, bytes, &ctx->state[4], &ctx->state[12]);
+		ctx->state[12] += (bytes + 63) / 64;
+		return;
+	}
+	if (X86_PCAP_AVX512VL) {
+		chacha20_avx512vl(dst, src, bytes, &ctx->state[4], &ctx->state[12]);
+		ctx->state[12] += (bytes + 63) / 64;
+		return;
+	}
+#endif  // CHACHA20_WITH_AVX512
+  if (X86_PCAP_AVX2) {
+    chacha20_avx2(dst, src, bytes, &ctx->state[4], &ctx->state[12]);
+    ctx->state[12] += (bytes + 63) / 64;
+    return;
+  }
+  if (X86_PCAP_SSSE3) {
+    assert(bytes);
+    chacha20_ssse3(dst, src, bytes, &ctx->state[4], &ctx->state[12]);
+    ctx->state[12] += (bytes + 63) / 64;
+    return;
+  }
+#endif  // defined(ARCH_CPU_X86_64)
+
+	if (dst != src)
+		memcpy(dst, src, bytes);
+
+	while (bytes >= CHACHA20_BLOCK_SIZE) {
+		chacha20_block_generic(ctx, buf);
+		crypto_xor(dst, (uint8 *)buf, CHACHA20_BLOCK_SIZE);
+		bytes -= CHACHA20_BLOCK_SIZE;
+		dst += CHACHA20_BLOCK_SIZE;
+	}
+	if (bytes) {
+		chacha20_block_generic(ctx, buf);
+		crypto_xor(dst, (uint8 *)buf, bytes);
+	}
+}
+
+struct poly1305_ctx {
+	uint8 opaque[24 * sizeof(uint64)];
+	uint32 nonce[4];
+	uint8 data[POLY1305_BLOCK_SIZE];
+	size_t num;
+};
+
+#if !(defined(CONFIG_X86_64) || defined(CONFIG_ARM) || defined(CONFIG_ARM64) || (defined(CONFIG_MIPS) && defined(CONFIG_64BIT)))
+struct poly1305_internal {
+	uint32 h[5];
+	uint32 r[4];
+};
+
+static void poly1305_init_generic(void *ctx, const uint8 key[16]) {
+	struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+
+	/* h = 0 */
+	st->h[0] = 0;
+	st->h[1] = 0;
+	st->h[2] = 0;
+	st->h[3] = 0;
+	st->h[4] = 0;
+
+	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
+	st->r[0] = ReadLE32(&key[ 0]) & 0x0fffffff;
+	st->r[1] = ReadLE32(&key[ 4]) & 0x0ffffffc;
+	st->r[2] = ReadLE32(&key[ 8]) & 0x0ffffffc;
+	st->r[3] = ReadLE32(&key[12]) & 0x0ffffffc;
+}
+
+static void poly1305_blocks_generic(void *ctx, const uint8 *inp, size_t len, uint32 padbit)
+{
+#define CONSTANT_TIME_CARRY(a,b) ((a ^ ((a ^ b) | ((a - b) ^ b))) >> (sizeof(a) * 8 - 1))
+	struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+	uint32 r0, r1, r2, r3;
+	uint32 s1, s2, s3;
+	uint32 h0, h1, h2, h3, h4, c;
+	uint64 d0, d1, d2, d3;
+
+	r0 = st->r[0];
+	r1 = st->r[1];
+	r2 = st->r[2];
+	r3 = st->r[3];
+
+	s1 = r1 + (r1 >> 2);
+	s2 = r2 + (r2 >> 2);
+	s3 = r3 + (r3 >> 2);
+
+	h0 = st->h[0];
+	h1 = st->h[1];
+	h2 = st->h[2];
+	h3 = st->h[3];
+	h4 = st->h[4];
+
+	while (len >= POLY1305_BLOCK_SIZE) {
+		/* h += m[i] */
+		h0 = (uint32)(d0 = (uint64)h0 + ReadLE32(inp + 0));
+		h1 = (uint32)(d1 = (uint64)h1 + (d0 >> 32) + ReadLE32(inp + 4));
+		h2 = (uint32)(d2 = (uint64)h2 + (d1 >> 32) + ReadLE32(inp + 8));
+		h3 = (uint32)(d3 = (uint64)h3 + (d2 >> 32) + ReadLE32(inp + 12));
+		h4 += (uint32)(d3 >> 32) + padbit;
+
+		/* h *= r "%" p, where "%" stands for "partial remainder" */
+		d0 = ((uint64)h0 * r0) +
+		     ((uint64)h1 * s3) +
+		     ((uint64)h2 * s2) +
+		     ((uint64)h3 * s1);
+		d1 = ((uint64)h0 * r1) +
+		     ((uint64)h1 * r0) +
+		     ((uint64)h2 * s3) +
+		     ((uint64)h3 * s2) +
+		     (h4 * s1);
+		d2 = ((uint64)h0 * r2) +
+		     ((uint64)h1 * r1) +
+		     ((uint64)h2 * r0) +
+		     ((uint64)h3 * s3) +
+		     (h4 * s2);
+		d3 = ((uint64)h0 * r3) +
+		     ((uint64)h1 * r2) +
+		     ((uint64)h2 * r1) +
+		     ((uint64)h3 * r0) +
+		     (h4 * s3);
+		h4 = (h4 * r0);
+
+		/* last reduction step: */
+		/* a) h4:h0 = h4<<128 + d3<<96 + d2<<64 + d1<<32 + d0 */
+		h0 = (uint32)d0;
+		h1 = (uint32)(d1 += d0 >> 32);
+		h2 = (uint32)(d2 += d1 >> 32);
+		h3 = (uint32)(d3 += d2 >> 32);
+		h4 += (uint32)(d3 >> 32);
+		/* b) (h4:h0 += (h4:h0>>130) * 5) %= 2^130 */
+		c = (h4 >> 2) + (h4 & ~3U);
+		h4 &= 3;
+		h0 += c;
+		h1 += (c = CONSTANT_TIME_CARRY(h0,c));
+		h2 += (c = CONSTANT_TIME_CARRY(h1,c));
+		h3 += (c = CONSTANT_TIME_CARRY(h2,c));
+		h4 += CONSTANT_TIME_CARRY(h3,c);
+		/*
+		 * Occasional overflows to 3rd bit of h4 are taken care of
+		 * "naturally". If after this point we end up at the top of
+		 * this loop, then the overflow bit will be accounted for
+		 * in next iteration. If we end up in poly1305_emit, then
+		 * comparison to modulus below will still count as "carry
+		 * into 131st bit", so that properly reduced value will be
+		 * picked in conditional move.
+		 */
+
+		inp += POLY1305_BLOCK_SIZE;
+		len -= POLY1305_BLOCK_SIZE;
+	}
+
+	st->h[0] = h0;
+	st->h[1] = h1;
+	st->h[2] = h2;
+	st->h[3] = h3;
+	st->h[4] = h4;
+#undef CONSTANT_TIME_CARRY
+}
+
+static void poly1305_emit_generic(void *ctx, uint8 mac[16], const uint32 nonce[4])
+{
+	struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+	uint32 *omac = (uint32 *)mac;
+	uint32 h0, h1, h2, h3, h4;
+	uint32 g0, g1, g2, g3, g4;
+	uint64 t;
+	uint32 mask;
+
+	h0 = st->h[0];
+	h1 = st->h[1];
+	h2 = st->h[2];
+	h3 = st->h[3];
+	h4 = st->h[4];
+
+	/* compare to modulus by computing h + -p */
+	g0 = (uint32)(t = (uint64)h0 + 5);
+	g1 = (uint32)(t = (uint64)h1 + (t >> 32));
+	g2 = (uint32)(t = (uint64)h2 + (t >> 32));
+	g3 = (uint32)(t = (uint64)h3 + (t >> 32));
+	g4 = h4 + (uint32)(t >> 32);
+
+	/* if there was carry into 131st bit, h3:h0 = g3:g0 */
+	mask = 0 - (g4 >> 2);
+	g0 &= mask;
+	g1 &= mask;
+	g2 &= mask;
+	g3 &= mask;
+	mask = ~mask;
+	h0 = (h0 & mask) | g0;
+	h1 = (h1 & mask) | g1;
+	h2 = (h2 & mask) | g2;
+	h3 = (h3 & mask) | g3;
+
+	/* mac = (h + nonce) % (2^128) */
+	h0 = (uint32)(t = (uint64)h0 + nonce[0]);
+	h1 = (uint32)(t = (uint64)h1 + (t >> 32) + nonce[1]);
+	h2 = (uint32)(t = (uint64)h2 + (t >> 32) + nonce[2]);
+	h3 = (uint32)(t = (uint64)h3 + (t >> 32) + nonce[3]);
+
+	omac[0] = ToLE32(h0);
+	omac[1] = ToLE32(h1);
+	omac[2] = ToLE32(h2);
+	omac[3] = ToLE32(h3);
+}
+#endif
+
+SAFEBUFFERS static void poly1305_init(struct poly1305_ctx *ctx, const uint8 key[POLY1305_KEY_SIZE])
+{
+	ctx->nonce[0] = ReadLE32(&key[16]);
+	ctx->nonce[1] = ReadLE32(&key[20]);
+	ctx->nonce[2] = ReadLE32(&key[24]);
+	ctx->nonce[3] = ReadLE32(&key[28]);
+
+#if defined(ARCH_CPU_X86_64)
+	poly1305_init_x86_64(ctx->opaque, key);
+#elif defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+	poly1305_init_arm(ctx->opaque, key);
+#elif defined(CONFIG_MIPS) && defined(CONFIG_64BIT)
+	poly1305_init_mips(ctx->opaque, key);
+#else
+	poly1305_init_generic(ctx->opaque, key);
+#endif
+	ctx->num = 0;
+}
+
+static inline void poly1305_blocks(void *ctx, const uint8 *inp, size_t len, uint32 padbit)
+{
+#if defined(ARCH_CPU_X86_64)
+#if CHACHA20_WITH_AVX512
+	if(X86_PCAP_AVX512F)
+		poly1305_blocks_avx512(ctx, inp, len, padbit);
+	else 
+#endif  // CHACHA20_WITH_AVX512
+  if (X86_PCAP_AVX2)
+    poly1305_blocks_avx2(ctx, inp, len, padbit);
+  else if (X86_PCAP_AVX)
+    poly1305_blocks_avx(ctx, inp, len, padbit);
+  else
+		poly1305_blocks_x86_64(ctx, inp, len, padbit);
+#else  // defined(ARCH_CPU_X86_64)
+  poly1305_blocks_generic(ctx, inp, len, padbit);
+#endif  // defined(ARCH_CPU_X86_64)
+}
+
+static inline void poly1305_emit(void *ctx, uint8 mac[16], const uint32 nonce[4])
+{
+#if defined(ARCH_CPU_X86_64)
+  if (X86_PCAP_AVX)
+    poly1305_emit_avx(ctx, mac, nonce);
+  else
+    poly1305_emit_x86_64(ctx, mac, nonce);
+#else  // defined(ARCH_CPU_X86_64)
+	poly1305_emit_generic(ctx, mac, nonce);
+#endif  // defined(ARCH_CPU_X86_64)
+} 
+
+SAFEBUFFERS static void poly1305_update(struct poly1305_ctx *ctx, const uint8 *inp, size_t len)
+{
+	const size_t num = ctx->num;
+	size_t rem;
+
+	if (num) {
+		rem = POLY1305_BLOCK_SIZE - num;
+		if (len >= rem) {
+			memcpy(ctx->data + num, inp, rem);
+			poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 1);
+			inp += rem;
+			len -= rem;
+		} else {
+			/* Still not enough data to process a block. */
+			memcpy(ctx->data + num, inp, len);
+			ctx->num = num + len;
+			return;
+		}
+	}
+
+	rem = len % POLY1305_BLOCK_SIZE;
+	len -= rem;
+
+	if (len >= POLY1305_BLOCK_SIZE) {
+		poly1305_blocks(ctx->opaque, inp, len, 1);
+		inp += len;
+	}
+
+	if (rem)
+		memcpy(ctx->data, inp, rem);
+
+	ctx->num = rem;
+}
+
+SAFEBUFFERS static void poly1305_finish(struct poly1305_ctx *ctx, uint8 mac[16])
+{
+	size_t num = ctx->num;
+
+	if (num) {
+		ctx->data[num++] = 1;   /* pad bit */
+		while (num < POLY1305_BLOCK_SIZE)
+			ctx->data[num++] = 0;
+		poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 0);
+	}
+
+	poly1305_emit(ctx->opaque, mac, ctx->nonce);
+
+	/* zero out the state */
+	memzero_crypto(ctx, sizeof(*ctx));
+}
+
+static const uint8 pad0[16] = { 0 };
+
+SAFEBUFFERS static FORCEINLINE void poly1305_getmac(const uint8 *ad, size_t ad_len, const uint8 *src, size_t src_len, const uint8 key[POLY1305_KEY_SIZE], uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) {
+  uint64 len[2];
+  struct poly1305_ctx poly1305_state;
+
+  poly1305_init(&poly1305_state, key);
+  poly1305_update(&poly1305_state, ad, ad_len);
+  poly1305_update(&poly1305_state, pad0, (0 - ad_len) & 0xf);
+  poly1305_update(&poly1305_state, src, src_len);
+  poly1305_update(&poly1305_state, pad0, (0 - src_len) & 0xf);
+  len[0] = ToLE64(ad_len);
+  len[1] = ToLE64(src_len);
+  poly1305_update(&poly1305_state, (uint8 *)&len, sizeof(len));
+  poly1305_finish(&poly1305_state, mac);
+}
+
+struct ChaChaState {
+  struct chacha20_ctx chacha20_state;
+  uint8 block0[CHACHA20_BLOCK_SIZE];
+};
+
+static inline void InitializeChaChaState(ChaChaState *st, const uint8 key[CHACHA20POLY1305_KEYLEN], uint64 nonce) {
+  uint64 le_nonce = ToLE64(nonce);
+  WriteLE64((uint8*)st, 0x3320646e61707865);
+  WriteLE64((uint8*)st + 8, 0x6b20657479622d32);
+  Write64((uint8*)st + 16, Read64(key + 0));
+  Write64((uint8*)st + 24, Read64(key + 8));
+  Write64((uint8*)st + 32, Read64(key + 16));
+  Write64((uint8*)st + 40, Read64(key + 24));
+  Write64((uint8*)st + 48, 0);
+  Write64((uint8*)st + 56, Read64((uint8*)&le_nonce));
+
+  Write64((uint8*)st + 64 + 0 * 8, 0);
+  Write64((uint8*)st + 64 + 1 * 8, 0);
+  Write64((uint8*)st + 64 + 2 * 8, 0);
+  Write64((uint8*)st + 64 + 3 * 8, 0);
+  Write64((uint8*)st + 64 + 4 * 8, 0);
+  Write64((uint8*)st + 64 + 5 * 8, 0);
+  Write64((uint8*)st + 64 + 6 * 8, 0);
+  Write64((uint8*)st + 64 + 7 * 8, 0);
+}
+
+SAFEBUFFERS void poly1305_get_mac(const uint8 *src, size_t src_len,
+                     const uint8 *ad, const size_t ad_len,
+                     const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN],
+                     uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) {
+  ChaChaState st;
+
+  InitializeChaChaState(&st, key, nonce);
+  chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0));
+  poly1305_getmac(ad, ad_len, src, src_len, st.block0, mac);
+  memzero_crypto(&st, sizeof(st));
+}
+
+SAFEBUFFERS void chacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+					      const uint8 *ad, const size_t ad_len,
+					      const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]) {
+  ChaChaState st;
+
+  InitializeChaChaState(&st, key, nonce);
+	chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0));
+  chacha20_crypt(&st.chacha20_state, dst, src, (uint32)src_len);
+  poly1305_getmac(ad, ad_len, dst, src_len, st.block0, dst + src_len);
+  memzero_crypto(&st, sizeof(st));
+}
+
+SAFEBUFFERS void chacha20poly1305_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len,
+                                      const uint8 *ad, const size_t ad_len,
+                                      const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN],
+                                      uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]) {
+  ChaChaState st;
+
+  InitializeChaChaState(&st, key, nonce);
+  chacha20_crypt(&st.chacha20_state, st.block0, st.block0, sizeof(st.block0));
+  poly1305_getmac(ad, ad_len, src, src_len, st.block0, mac);
+  chacha20_crypt(&st.chacha20_state, dst, src, (uint32)src_len);
+  memzero_crypto(&st, sizeof(st));
+}
+
+SAFEBUFFERS bool chacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                              const uint8 *ad, const size_t ad_len,
+                              const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]) {
+  uint8 mac[POLY1305_MAC_SIZE];
+
+  if (src_len < CHACHA20POLY1305_AUTHTAGLEN)
+    return false;
+  chacha20poly1305_decrypt_get_mac(dst, src, src_len - CHACHA20POLY1305_AUTHTAGLEN, ad, ad_len, nonce, key, mac);
+  return memcmp_crypto(mac, src + src_len - CHACHA20POLY1305_AUTHTAGLEN, CHACHA20POLY1305_AUTHTAGLEN) == 0;
+}
+
+void xchacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+			       const uint8 *ad, const size_t ad_len,
+			       const uint8 nonce[XCHACHA20POLY1305_NONCELEN],
+			       const uint8 key[CHACHA20POLY1305_KEYLEN])
+{
+  __aligned(16) uint8 derived_key[CHACHA20POLY1305_KEYLEN];
+
+	hchacha20(derived_key, nonce, key);
+	chacha20poly1305_encrypt(dst, src, src_len, ad, ad_len, ReadLE64(nonce + 16), derived_key);
+	memzero_crypto(derived_key, CHACHA20POLY1305_KEYLEN);
+}
+
+bool xchacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+			       const uint8 *ad, const size_t ad_len,
+			       const uint8 nonce[XCHACHA20POLY1305_NONCELEN],
+			       const uint8 key[CHACHA20POLY1305_KEYLEN]) {
+  bool ret;
+  __aligned(16) uint8 derived_key[CHACHA20POLY1305_KEYLEN];
+
+	hchacha20(derived_key, nonce, key);
+	ret = chacha20poly1305_decrypt(dst, src, src_len, ad, ad_len, ReadLE64(nonce + 16), derived_key);
+	memzero_crypto(derived_key, CHACHA20POLY1305_KEYLEN);
+
+	return ret;
+}
+
--- a/crypto/chacha20poly1305.h
+++ b/crypto/chacha20poly1305.h
@ -0,0 +1,39 @@
+#pragma once
+#include "tunsafe_types.h"
+
+
+enum {
+  XCHACHA20POLY1305_NONCELEN = 24,
+  CHACHA20POLY1305_KEYLEN = 32,
+  CHACHA20POLY1305_AUTHTAGLEN = 16
+};
+
+
+void chacha20poly1305_decrypt_get_mac(uint8 *dst, const uint8 *src, const size_t src_len,
+                                      const uint8 *ad, const size_t ad_len,
+                                      const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN],
+                                      uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]);
+
+bool chacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                              const uint8 *ad, const size_t ad_len,
+                              const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]);
+
+void chacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                              const uint8 *ad, const size_t ad_len,
+                              const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN]);
+
+
+void xchacha20poly1305_encrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                               const uint8 *ad, const size_t ad_len,
+                               const uint8 nonce[XCHACHA20POLY1305_NONCELEN],
+                               const uint8 key[CHACHA20POLY1305_KEYLEN]);
+
+bool xchacha20poly1305_decrypt(uint8 *dst, const uint8 *src, const size_t src_len,
+                               const uint8 *ad, const size_t ad_len,
+                               const uint8 nonce[XCHACHA20POLY1305_NONCELEN],
+                               const uint8 key[CHACHA20POLY1305_KEYLEN]);
+
+void poly1305_get_mac(const uint8 *src, size_t src_len,
+                     const uint8 *ad, const size_t ad_len,
+                     const uint64 nonce, const uint8 key[CHACHA20POLY1305_KEYLEN],
+                     uint8 mac[CHACHA20POLY1305_AUTHTAGLEN]);
--- a/crypto/curve25519-donna.cpp
+++ b/crypto/curve25519-donna.cpp
@ -0,0 +1,737 @@
+/* Copyright 2008, Google Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *     * Neither the name of Google Inc. nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * curve25519-donna: Curve25519 elliptic curve, public key function
+ *
+ * http://code.google.com/p/curve25519-donna/
+ *
+ * Adam Langley <agl@imperialviolet.org>
+ *
+ * Derived from public domain C code by Daniel J. Bernstein <djb@cr.yp.to>
+ *
+ * More information about curve25519 can be found here
+ *   http://cr.yp.to/ecdh.html
+ *
+ * djb's sample implementation of curve25519 is written in a special assembly
+ * language called qhasm and uses the floating point registers.
+ *
+ * This is, almost, a clean room reimplementation from the curve25519 paper. It
+ * uses many of the tricks described therein. Only the crecip function is taken
+ * from the sample implementation.
+ */
+
+#include <string.h>
+#include <stdint.h>
+
+#ifdef _MSC_VER
+#define inline __inline
+#endif
+
+typedef uint8_t u8;
+typedef int32_t s32;
+typedef int64_t limb;
+
+/* Field element representation:
+ *
+ * Field elements are written as an array of signed, 64-bit limbs, least
+ * significant first. The value of the field element is:
+ *   x[0] + 2^26·x[1] + x^51·x[2] + 2^102·x[3] + ...
+ *
+ * i.e. the limbs are 26, 25, 26, 25, ... bits wide.
+ */
+
+/* Sum two numbers: output += in */
+static void fsum(limb *output, const limb *in) {
+  unsigned i;
+  for (i = 0; i < 10; i += 2) {
+    output[0+i] = (output[0+i] + in[0+i]);
+    output[1+i] = (output[1+i] + in[1+i]);
+  }
+}
+
+/* Find the difference of two numbers: output = in - output
+ * (note the order of the arguments!)
+ */
+static void fdifference(limb *output, const limb *in) {
+  unsigned i;
+  for (i = 0; i < 10; ++i) {
+    output[i] = (in[i] - output[i]);
+  }
+}
+
+/* Multiply a number by a scalar: output = in * scalar */
+static void fscalar_product(limb *output, const limb *in, const limb scalar) {
+  unsigned i;
+  for (i = 0; i < 10; ++i) {
+    output[i] = in[i] * scalar;
+  }
+}
+
+/* Multiply two numbers: output = in2 * in
+ *
+ * output must be distinct to both inputs. The inputs are reduced coefficient
+ * form, the output is not.
+ */
+static void fproduct(limb *output, const limb *in2, const limb *in) {
+  output[0] =       ((limb) ((s32) in2[0])) * ((s32) in[0]);
+  output[1] =       ((limb) ((s32) in2[0])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[0]);
+  output[2] =  2 *  ((limb) ((s32) in2[1])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[0]);
+  output[3] =       ((limb) ((s32) in2[1])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[0]);
+  output[4] =       ((limb) ((s32) in2[2])) * ((s32) in[2]) +
+               2 * (((limb) ((s32) in2[1])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[1])) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[0]);
+  output[5] =       ((limb) ((s32) in2[2])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[0]);
+  output[6] =  2 * (((limb) ((s32) in2[3])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[1])) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[0]);
+  output[7] =       ((limb) ((s32) in2[3])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[0]);
+  output[8] =       ((limb) ((s32) in2[4])) * ((s32) in[4]) +
+               2 * (((limb) ((s32) in2[3])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[1])) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[0]);
+  output[9] =       ((limb) ((s32) in2[4])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[2]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[1]) +
+                    ((limb) ((s32) in2[0])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[0]);
+  output[10] = 2 * (((limb) ((s32) in2[5])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[1])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[1])) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[2]);
+  output[11] =      ((limb) ((s32) in2[5])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[4]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[3]) +
+                    ((limb) ((s32) in2[2])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[2]);
+  output[12] =      ((limb) ((s32) in2[6])) * ((s32) in[6]) +
+               2 * (((limb) ((s32) in2[5])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[3])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[3])) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[4]);
+  output[13] =      ((limb) ((s32) in2[6])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[7])) * ((s32) in[6]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[5]) +
+                    ((limb) ((s32) in2[4])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[4]);
+  output[14] = 2 * (((limb) ((s32) in2[7])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[5])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[5])) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[6]);
+  output[15] =      ((limb) ((s32) in2[7])) * ((s32) in[8]) +
+                    ((limb) ((s32) in2[8])) * ((s32) in[7]) +
+                    ((limb) ((s32) in2[6])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[6]);
+  output[16] =      ((limb) ((s32) in2[8])) * ((s32) in[8]) +
+               2 * (((limb) ((s32) in2[7])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[7]));
+  output[17] =      ((limb) ((s32) in2[8])) * ((s32) in[9]) +
+                    ((limb) ((s32) in2[9])) * ((s32) in[8]);
+  output[18] = 2 *  ((limb) ((s32) in2[9])) * ((s32) in[9]);
+}
+
+/* Reduce a long form to a short form by taking the input mod 2^255 - 19. */
+static void freduce_degree(limb *output) {
+  /* Each of these shifts and adds ends up multiplying the value by 19. */
+  output[8] += output[18] << 4;
+  output[8] += output[18] << 1;
+  output[8] += output[18];
+  output[7] += output[17] << 4;
+  output[7] += output[17] << 1;
+  output[7] += output[17];
+  output[6] += output[16] << 4;
+  output[6] += output[16] << 1;
+  output[6] += output[16];
+  output[5] += output[15] << 4;
+  output[5] += output[15] << 1;
+  output[5] += output[15];
+  output[4] += output[14] << 4;
+  output[4] += output[14] << 1;
+  output[4] += output[14];
+  output[3] += output[13] << 4;
+  output[3] += output[13] << 1;
+  output[3] += output[13];
+  output[2] += output[12] << 4;
+  output[2] += output[12] << 1;
+  output[2] += output[12];
+  output[1] += output[11] << 4;
+  output[1] += output[11] << 1;
+  output[1] += output[11];
+  output[0] += output[10] << 4;
+  output[0] += output[10] << 1;
+  output[0] += output[10];
+}
+
+#if (-1 & 3) != 3
+#error "This code only works on a two's complement system"
+#endif
+
+/* return v / 2^26, using only shifts and adds. */
+static inline limb
+div_by_2_26(const limb v)
+{
+  /* High word of v; no shift needed*/
+  const uint32_t highword = (uint32_t) (((uint64_t) v) >> 32);
+  /* Set to all 1s if v was negative; else set to 0s. */
+  const int32_t sign = ((int32_t) highword) >> 31;
+  /* Set to 0x3ffffff if v was negative; else set to 0. */
+  const int32_t roundoff = ((uint32_t) sign) >> 6;
+  /* Should return v / (1<<26) */
+  return (v + roundoff) >> 26;
+}
+
+/* return v / (2^25), using only shifts and adds. */
+static inline limb
+div_by_2_25(const limb v)
+{
+  /* High word of v; no shift needed*/
+  const uint32_t highword = (uint32_t) (((uint64_t) v) >> 32);
+  /* Set to all 1s if v was negative; else set to 0s. */
+  const int32_t sign = ((int32_t) highword) >> 31;
+  /* Set to 0x1ffffff if v was negative; else set to 0. */
+  const int32_t roundoff = ((uint32_t) sign) >> 7;
+  /* Should return v / (1<<25) */
+  return (v + roundoff) >> 25;
+}
+
+static inline s32
+div_s32_by_2_25(const s32 v)
+{
+   const s32 roundoff = ((uint32_t)(v >> 31)) >> 7;
+   return (v + roundoff) >> 25;
+}
+
+/* Reduce all coefficients of the short form input so that |x| < 2^26.
+ *
+ * On entry: |output[i]| < 2^62
+ */
+static void freduce_coefficients(limb *output) {
+  unsigned i;
+
+  output[10] = 0;
+
+  for (i = 0; i < 10; i += 2) {
+    limb over = div_by_2_26(output[i]);
+    output[i] -= over << 26;
+    output[i+1] += over;
+
+    over = div_by_2_25(output[i+1]);
+    output[i+1] -= over << 25;
+    output[i+2] += over;
+  }
+  /* Now |output[10]| < 2 ^ 38 and all other coefficients are reduced. */
+  output[0] += output[10] << 4;
+  output[0] += output[10] << 1;
+  output[0] += output[10];
+
+  output[10] = 0;
+
+  /* Now output[1..9] are reduced, and |output[0]| < 2^26 + 19 * 2^38
+   * So |over| will be no more than 77825  */
+  {
+    limb over = div_by_2_26(output[0]);
+    output[0] -= over << 26;
+    output[1] += over;
+  }
+
+  /* Now output[0,2..9] are reduced, and |output[1]| < 2^25 + 77825
+   * So |over| will be no more than 1. */
+  {
+    /* output[1] fits in 32 bits, so we can use div_s32_by_2_25 here. */
+    s32 over32 = div_s32_by_2_25((s32) output[1]);
+    output[1] -= over32 << 25;
+    output[2] += over32;
+  }
+
+  /* Finally, output[0,1,3..9] are reduced, and output[2] is "nearly reduced":
+   * we have |output[2]| <= 2^26.  This is good enough for all of our math,
+   * but it will require an extra freduce_coefficients before fcontract. */
+}
+
+/* A helpful wrapper around fproduct: output = in * in2.
+ *
+ * output must be distinct to both inputs. The output is reduced degree and
+ * reduced coefficient.
+ */
+static void
+fmul(limb *output, const limb *in, const limb *in2) {
+  limb t[19];
+  fproduct(t, in, in2);
+  freduce_degree(t);
+  freduce_coefficients(t);
+  memcpy(output, t, sizeof(limb) * 10);
+}
+
+static void fsquare_inner(limb *output, const limb *in) {
+  output[0] =       ((limb) ((s32) in[0])) * ((s32) in[0]);
+  output[1] =  2 *  ((limb) ((s32) in[0])) * ((s32) in[1]);
+  output[2] =  2 * (((limb) ((s32) in[1])) * ((s32) in[1]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[2]));
+  output[3] =  2 * (((limb) ((s32) in[1])) * ((s32) in[2]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[3]));
+  output[4] =       ((limb) ((s32) in[2])) * ((s32) in[2]) +
+               4 *  ((limb) ((s32) in[1])) * ((s32) in[3]) +
+               2 *  ((limb) ((s32) in[0])) * ((s32) in[4]);
+  output[5] =  2 * (((limb) ((s32) in[2])) * ((s32) in[3]) +
+                    ((limb) ((s32) in[1])) * ((s32) in[4]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[5]));
+  output[6] =  2 * (((limb) ((s32) in[3])) * ((s32) in[3]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[4]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[6]) +
+               2 *  ((limb) ((s32) in[1])) * ((s32) in[5]));
+  output[7] =  2 * (((limb) ((s32) in[3])) * ((s32) in[4]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[5]) +
+                    ((limb) ((s32) in[1])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[7]));
+  output[8] =       ((limb) ((s32) in[4])) * ((s32) in[4]) +
+               2 * (((limb) ((s32) in[2])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[8]) +
+               2 * (((limb) ((s32) in[1])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[3])) * ((s32) in[5])));
+  output[9] =  2 * (((limb) ((s32) in[4])) * ((s32) in[5]) +
+                    ((limb) ((s32) in[3])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[1])) * ((s32) in[8]) +
+                    ((limb) ((s32) in[0])) * ((s32) in[9]));
+  output[10] = 2 * (((limb) ((s32) in[5])) * ((s32) in[5]) +
+                    ((limb) ((s32) in[4])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[8]) +
+               2 * (((limb) ((s32) in[3])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[1])) * ((s32) in[9])));
+  output[11] = 2 * (((limb) ((s32) in[5])) * ((s32) in[6]) +
+                    ((limb) ((s32) in[4])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[3])) * ((s32) in[8]) +
+                    ((limb) ((s32) in[2])) * ((s32) in[9]));
+  output[12] =      ((limb) ((s32) in[6])) * ((s32) in[6]) +
+               2 * (((limb) ((s32) in[4])) * ((s32) in[8]) +
+               2 * (((limb) ((s32) in[5])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[3])) * ((s32) in[9])));
+  output[13] = 2 * (((limb) ((s32) in[6])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[5])) * ((s32) in[8]) +
+                    ((limb) ((s32) in[4])) * ((s32) in[9]));
+  output[14] = 2 * (((limb) ((s32) in[7])) * ((s32) in[7]) +
+                    ((limb) ((s32) in[6])) * ((s32) in[8]) +
+               2 *  ((limb) ((s32) in[5])) * ((s32) in[9]));
+  output[15] = 2 * (((limb) ((s32) in[7])) * ((s32) in[8]) +
+                    ((limb) ((s32) in[6])) * ((s32) in[9]));
+  output[16] =      ((limb) ((s32) in[8])) * ((s32) in[8]) +
+               4 *  ((limb) ((s32) in[7])) * ((s32) in[9]);
+  output[17] = 2 *  ((limb) ((s32) in[8])) * ((s32) in[9]);
+  output[18] = 2 *  ((limb) ((s32) in[9])) * ((s32) in[9]);
+}
+
+static void
+fsquare(limb *output, const limb *in) {
+  limb t[19];
+  fsquare_inner(t, in);
+  freduce_degree(t);
+  freduce_coefficients(t);
+  memcpy(output, t, sizeof(limb) * 10);
+}
+
+/* Take a little-endian, 32-byte number and expand it into polynomial form */
+static void
+fexpand(limb *output, const u8 *input) {
+#define F(n,start,shift,mask) \
+  output[n] = ((((limb) input[start + 0]) | \
+                ((limb) input[start + 1]) << 8 | \
+                ((limb) input[start + 2]) << 16 | \
+                ((limb) input[start + 3]) << 24) >> shift) & mask;
+  F(0, 0, 0, 0x3ffffff);
+  F(1, 3, 2, 0x1ffffff);
+  F(2, 6, 3, 0x3ffffff);
+  F(3, 9, 5, 0x1ffffff);
+  F(4, 12, 6, 0x3ffffff);
+  F(5, 16, 0, 0x1ffffff);
+  F(6, 19, 1, 0x3ffffff);
+  F(7, 22, 3, 0x1ffffff);
+  F(8, 25, 4, 0x3ffffff);
+  F(9, 28, 6, 0x3ffffff);
+#undef F
+}
+
+#if (-32 >> 1) != -16
+#error "This code only works when >> does sign-extension on negative numbers"
+#endif
+
+/* Take a fully reduced polynomial form number and contract it into a
+ * little-endian, 32-byte array
+ */
+static void
+fcontract(u8 *output, limb *input) {
+  int i;
+  int j;
+
+  for (j = 0; j < 2; ++j) {
+    for (i = 0; i < 9; ++i) {
+      if ((i & 1) == 1) {
+        /* This calculation is a time-invariant way to make input[i] positive
+           by borrowing from the next-larger limb.
+        */
+        const s32 mask = (s32)(input[i]) >> 31;
+        const s32 carry = -(((s32)(input[i]) & mask) >> 25);
+        input[i] = (s32)(input[i]) + (carry << 25);
+        input[i+1] = (s32)(input[i+1]) - carry;
+      } else {
+        const s32 mask = (s32)(input[i]) >> 31;
+        const s32 carry = -(((s32)(input[i]) & mask) >> 26);
+        input[i] = (s32)(input[i]) + (carry << 26);
+        input[i+1] = (s32)(input[i+1]) - carry;
+      }
+    }
+    {
+      const s32 mask = (s32)(input[9]) >> 31;
+      const s32 carry = -(((s32)(input[9]) & mask) >> 25);
+      input[9] = (s32)(input[9]) + (carry << 25);
+      input[0] = (s32)(input[0]) - (carry * 19);
+    }
+  }
+
+  /* The first borrow-propagation pass above ended with every limb
+     except (possibly) input[0] non-negative.
+
+     Since each input limb except input[0] is decreased by at most 1
+     by a borrow-propagation pass, the second borrow-propagation pass
+     could only have wrapped around to decrease input[0] again if the
+     first pass left input[0] negative *and* input[1] through input[9]
+     were all zero.  In that case, input[1] is now 2^25 - 1, and this
+     last borrow-propagation step will leave input[1] non-negative.
+  */
+  {
+    const s32 mask = (s32)(input[0]) >> 31;
+    const s32 carry = -(((s32)(input[0]) & mask) >> 26);
+    input[0] = (s32)(input[0]) + (carry << 26);
+    input[1] = (s32)(input[1]) - carry;
+  }
+
+  /* Both passes through the above loop, plus the last 0-to-1 step, are
+     necessary: if input[9] is -1 and input[0] through input[8] are 0,
+     negative values will remain in the array until the end.
+   */
+
+  input[1] <<= 2;
+  input[2] <<= 3;
+  input[3] <<= 5;
+  input[4] <<= 6;
+  input[6] <<= 1;
+  input[7] <<= 3;
+  input[8] <<= 4;
+  input[9] <<= 6;
+#define F(i, s) \
+  output[s+0] |=  input[i] & 0xff; \
+  output[s+1]  = (input[i] >> 8) & 0xff; \
+  output[s+2]  = (input[i] >> 16) & 0xff; \
+  output[s+3]  = (input[i] >> 24) & 0xff;
+  output[0] = 0;
+  output[16] = 0;
+  F(0,0);
+  F(1,3);
+  F(2,6);
+  F(3,9);
+  F(4,12);
+  F(5,16);
+  F(6,19);
+  F(7,22);
+  F(8,25);
+  F(9,28);
+#undef F
+}
+
+/* Input: Q, Q', Q-Q'
+ * Output: 2Q, Q+Q'
+ *
+ *   x2 z3: long form
+ *   x3 z3: long form
+ *   x z: short form, destroyed
+ *   xprime zprime: short form, destroyed
+ *   qmqp: short form, preserved
+ */
+static void fmonty(limb *x2, limb *z2,  /* output 2Q */
+                   limb *x3, limb *z3,  /* output Q + Q' */
+                   limb *x, limb *z,    /* input Q */
+                   limb *xprime, limb *zprime,  /* input Q' */
+                   const limb *qmqp /* input Q - Q' */) {
+  limb origx[10], origxprime[10], zzz[19], xx[19], zz[19], xxprime[19],
+        zzprime[19], zzzprime[19], xxxprime[19];
+
+  memcpy(origx, x, 10 * sizeof(limb));
+  fsum(x, z);
+  fdifference(z, origx);  // does x - z
+
+  memcpy(origxprime, xprime, sizeof(limb) * 10);
+  fsum(xprime, zprime);
+  fdifference(zprime, origxprime);
+  fproduct(xxprime, xprime, z);
+  fproduct(zzprime, x, zprime);
+  freduce_degree(xxprime);
+  freduce_coefficients(xxprime);
+  freduce_degree(zzprime);
+  freduce_coefficients(zzprime);
+  memcpy(origxprime, xxprime, sizeof(limb) * 10);
+  fsum(xxprime, zzprime);
+  fdifference(zzprime, origxprime);
+  fsquare(xxxprime, xxprime);
+  fsquare(zzzprime, zzprime);
+  fproduct(zzprime, zzzprime, qmqp);
+  freduce_degree(zzprime);
+  freduce_coefficients(zzprime);
+  memcpy(x3, xxxprime, sizeof(limb) * 10);
+  memcpy(z3, zzprime, sizeof(limb) * 10);
+
+  fsquare(xx, x);
+  fsquare(zz, z);
+  fproduct(x2, xx, zz);
+  freduce_degree(x2);
+  freduce_coefficients(x2);
+  fdifference(zz, xx);  // does zz = xx - zz
+  memset(zzz + 10, 0, sizeof(limb) * 9);
+  fscalar_product(zzz, zz, 121665);
+  /* No need to call freduce_degree here:
+     fscalar_product doesn't increase the degree of its input. */
+  freduce_coefficients(zzz);
+  fsum(zzz, xx);
+  fproduct(z2, zz, zzz);
+  freduce_degree(z2);
+  freduce_coefficients(z2);
+}
+
+/* Conditionally swap two reduced-form limb arrays if 'iswap' is 1, but leave
+ * them unchanged if 'iswap' is 0.  Runs in data-invariant time to avoid
+ * side-channel attacks.
+ *
+ * NOTE that this function requires that 'iswap' be 1 or 0; other values give
+ * wrong results.  Also, the two limb arrays must be in reduced-coefficient,
+ * reduced-degree form: the values in a[10..19] or b[10..19] aren't swapped,
+ * and all all values in a[0..9],b[0..9] must have magnitude less than
+ * INT32_MAX.
+ */
+static void
+swap_conditional(limb a[19], limb b[19], limb iswap) {
+  unsigned i;
+  const s32 swap = (s32) -iswap;
+
+  for (i = 0; i < 10; ++i) {
+    const s32 x = swap & ( ((s32)a[i]) ^ ((s32)b[i]) );
+    a[i] = ((s32)a[i]) ^ x;
+    b[i] = ((s32)b[i]) ^ x;
+  }
+}
+
+/* Calculates nQ where Q is the x-coordinate of a point on the curve
+ *
+ *   resultx/resultz: the x coordinate of the resulting curve point (short form)
+ *   n: a little endian, 32-byte number
+ *   q: a point of the curve (short form)
+ */
+static void
+cmult(limb *resultx, limb *resultz, const u8 *n, const limb *q) {
+  limb a[19] = {0}, b[19] = {1}, c[19] = {1}, d[19] = {0};
+  limb *nqpqx = a, *nqpqz = b, *nqx = c, *nqz = d, *t;
+  limb e[19] = {0}, f[19] = {1}, g[19] = {0}, h[19] = {1};
+  limb *nqpqx2 = e, *nqpqz2 = f, *nqx2 = g, *nqz2 = h;
+
+  unsigned i, j;
+
+  memcpy(nqpqx, q, sizeof(limb) * 10);
+
+  for (i = 0; i < 32; ++i) {
+    u8 byte = n[31 - i];
+    for (j = 0; j < 8; ++j) {
+      const limb bit = byte >> 7;
+
+      swap_conditional(nqx, nqpqx, bit);
+      swap_conditional(nqz, nqpqz, bit);
+      fmonty(nqx2, nqz2,
+             nqpqx2, nqpqz2,
+             nqx, nqz,
+             nqpqx, nqpqz,
+             q);
+      swap_conditional(nqx2, nqpqx2, bit);
+      swap_conditional(nqz2, nqpqz2, bit);
+
+      t = nqx;
+      nqx = nqx2;
+      nqx2 = t;
+      t = nqz;
+      nqz = nqz2;
+      nqz2 = t;
+      t = nqpqx;
+      nqpqx = nqpqx2;
+      nqpqx2 = t;
+      t = nqpqz;
+      nqpqz = nqpqz2;
+      nqpqz2 = t;
+
+      byte <<= 1;
+    }
+  }
+
+  memcpy(resultx, nqx, sizeof(limb) * 10);
+  memcpy(resultz, nqz, sizeof(limb) * 10);
+}
+
+// -----------------------------------------------------------------------------
+// Shamelessly copied from djb's code
+// -----------------------------------------------------------------------------
+static void
+crecip(limb *out, const limb *z) {
+  limb z2[10];
+  limb z9[10];
+  limb z11[10];
+  limb z2_5_0[10];
+  limb z2_10_0[10];
+  limb z2_20_0[10];
+  limb z2_50_0[10];
+  limb z2_100_0[10];
+  limb t0[10];
+  limb t1[10];
+  int i;
+
+  /* 2 */ fsquare(z2,z);
+  /* 4 */ fsquare(t1,z2);
+  /* 8 */ fsquare(t0,t1);
+  /* 9 */ fmul(z9,t0,z);
+  /* 11 */ fmul(z11,z9,z2);
+  /* 22 */ fsquare(t0,z11);
+  /* 2^5 - 2^0 = 31 */ fmul(z2_5_0,t0,z9);
+
+  /* 2^6 - 2^1 */ fsquare(t0,z2_5_0);
+  /* 2^7 - 2^2 */ fsquare(t1,t0);
+  /* 2^8 - 2^3 */ fsquare(t0,t1);
+  /* 2^9 - 2^4 */ fsquare(t1,t0);
+  /* 2^10 - 2^5 */ fsquare(t0,t1);
+  /* 2^10 - 2^0 */ fmul(z2_10_0,t0,z2_5_0);
+
+  /* 2^11 - 2^1 */ fsquare(t0,z2_10_0);
+  /* 2^12 - 2^2 */ fsquare(t1,t0);
+  /* 2^20 - 2^10 */ for (i = 2;i < 10;i += 2) { fsquare(t0,t1); fsquare(t1,t0); }
+  /* 2^20 - 2^0 */ fmul(z2_20_0,t1,z2_10_0);
+
+  /* 2^21 - 2^1 */ fsquare(t0,z2_20_0);
+  /* 2^22 - 2^2 */ fsquare(t1,t0);
+  /* 2^40 - 2^20 */ for (i = 2;i < 20;i += 2) { fsquare(t0,t1); fsquare(t1,t0); }
+  /* 2^40 - 2^0 */ fmul(t0,t1,z2_20_0);
+
+  /* 2^41 - 2^1 */ fsquare(t1,t0);
+  /* 2^42 - 2^2 */ fsquare(t0,t1);
+  /* 2^50 - 2^10 */ for (i = 2;i < 10;i += 2) { fsquare(t1,t0); fsquare(t0,t1); }
+  /* 2^50 - 2^0 */ fmul(z2_50_0,t0,z2_10_0);
+
+  /* 2^51 - 2^1 */ fsquare(t0,z2_50_0);
+  /* 2^52 - 2^2 */ fsquare(t1,t0);
+  /* 2^100 - 2^50 */ for (i = 2;i < 50;i += 2) { fsquare(t0,t1); fsquare(t1,t0); }
+  /* 2^100 - 2^0 */ fmul(z2_100_0,t1,z2_50_0);
+
+  /* 2^101 - 2^1 */ fsquare(t1,z2_100_0);
+  /* 2^102 - 2^2 */ fsquare(t0,t1);
+  /* 2^200 - 2^100 */ for (i = 2;i < 100;i += 2) { fsquare(t1,t0); fsquare(t0,t1); }
+  /* 2^200 - 2^0 */ fmul(t1,t0,z2_100_0);
+
+  /* 2^201 - 2^1 */ fsquare(t0,t1);
+  /* 2^202 - 2^2 */ fsquare(t1,t0);
+  /* 2^250 - 2^50 */ for (i = 2;i < 50;i += 2) { fsquare(t0,t1); fsquare(t1,t0); }
+  /* 2^250 - 2^0 */ fmul(t0,t1,z2_50_0);
+
+  /* 2^251 - 2^1 */ fsquare(t1,t0);
+  /* 2^252 - 2^2 */ fsquare(t0,t1);
+  /* 2^253 - 2^3 */ fsquare(t1,t0);
+  /* 2^254 - 2^4 */ fsquare(t0,t1);
+  /* 2^255 - 2^5 */ fsquare(t1,t0);
+  /* 2^255 - 21 */ fmul(out,t1,z11);
+}
+
+void curve25519_normalize(u8 *e) {
+  e[0] &= 248;
+  e[31] &= 127;
+  e[31] |= 64;
+}
+
+void curve25519_donna_ref(uint8_t *mypublic, const uint8_t *secret, const uint8_t *basepoint) {
+  limb bp[10], x[10], z[11], zmone[10];
+  uint8_t e[32];
+  int i;
+
+  for (i = 0; i < 32; ++i) e[i] = secret[i];
+  e[0] &= 248;
+  e[31] &= 127;
+  e[31] |= 64;
+
+  fexpand(bp, basepoint);
+  cmult(x, z, e, bp);
+  crecip(zmone, z);
+  fmul(z, x, zmone);
+  freduce_coefficients(z);
+  fcontract(mypublic, z);
+}
+
--- a/crypto/curve25519-donna.h
+++ b/crypto/curve25519-donna.h
@ -0,0 +1,17 @@
+#ifndef TUNSAFE_CRYPTO_CURVE25519_DONNA_H_
+#define TUNSAFE_CRYPTO_CURVE25519_DONNA_H_
+
+#include "tunsafe_types.h"
+
+void curve25519_donna_ref(uint8 *mypublic, const uint8 *secret, const uint8 *basepoint);
+extern "C" void curve25519_donna_x64(uint8 *mypublic, const uint8 *secret, const uint8 *basepoint);
+
+#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC)
+#define curve25519_donna curve25519_donna_x64
+#else
+#define curve25519_donna curve25519_donna_ref
+#endif
+
+void curve25519_normalize(uint8 *e);
+
+#endif  // TUNSAFE_CRYPTO_CURVE25519_DONNA_H_
--- a/crypto/curve25519_x64_nasm.asm
+++ b/crypto/curve25519_x64_nasm.asm
--- a/crypto/make_all_asm_files.sh
+++ b/crypto/make_all_asm_files.sh
@ -0,0 +1,28 @@
+#!/bin/sh
+
+set -e
+
+# macos
+perl make_chacha20_x64.pl macosx > chacha20_x64_gas_macosx.s
+perl make_poly1305_x64.pl macosx > poly1305_x64_gas_macosx.s
+
+cd aesgcm
+
+perl aesni-gcm-x86_64.pl macosx > aesni_gcm_x64_gas_macosx.s
+perl aesni-x86_64.pl macosx > aesni_x64_gas_macosx.s
+perl ghash-x86_64.pl macosx > ghash_x64_gas_macosx.s
+
+cd ..
+
+
+# linux,freebsd
+perl make_chacha20_x64.pl gas > chacha20_x64_gas.s
+perl make_poly1305_x64.pl gas > poly1305_x64_gas.s
+
+cd aesgcm
+
+perl aesni-gcm-x86_64.pl gas > aesni_gcm_x64_gas.s
+perl aesni-x86_64.pl gas > aesni_x64_gas.s
+perl ghash-x86_64.pl gas > ghash_x64_gas.s
+
+cd ..
--- a/crypto/make_chacha20_x64.pl
+++ b/crypto/make_chacha20_x64.pl
--- a/crypto/make_poly1305_x64.pl
+++ b/crypto/make_poly1305_x64.pl
--- a/crypto/make_poly1305_x86.pl
+++ b/crypto/make_poly1305_x86.pl
--- a/crypto/nasm.props
+++ b/crypto/nasm.props
@ -0,0 +1,18 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <PropertyGroup
+    Condition="'$(NASMBeforeTargets)' == '' and '$(NASMAfterTargets)' == '' and '$(ConfigurationType)' != 'Makefile'">
+    <NASMBeforeTargets>Midl</NASMBeforeTargets>
+    <NASMAfterTargets>CustomBuild</NASMAfterTargets>
+  </PropertyGroup>
+  <ItemDefinitionGroup>
+    <NASM>
+      <OutputFormat>$(IntDir)%(FileName).obj</OutputFormat>
+      <PackAlignmentBoundary>0</PackAlignmentBoundary>
+      <CommandLineTemplate Condition="'$(Platform)' == 'Win32'">c:\dev\nasm\nasm.exe -f win32 [AllOptions] [AdditionalOptions] %(FullPath)</CommandLineTemplate>
+      <CommandLineTemplate Condition="'$(Platform)' == 'X64'">c:\dev\nasm\nasm.exe -f win64 [AllOptions]  [AdditionalOptions] %(FullPath)</CommandLineTemplate>
+      <CommandLineTemplate Condition="'$(Platform)' != 'Win32' and '$(Platform)' != 'X64'">echo NASM not supported on this platform</CommandLineTemplate>
+      <ExecutionDescription>Assembling [Inputs]...</ExecutionDescription>
+    </NASM>
+  </ItemDefinitionGroup>
+</Project>
--- a/crypto/nasm.targets
+++ b/crypto/nasm.targets
@ -0,0 +1,82 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <PropertyPageSchema
+      Include="$(MSBuildThisFileDirectory)$(MSBuildThisFileName).xml" />
+    <AvailableItemName Include="NASM">
+      <Targets>_NASM</Targets>
+    </AvailableItemName>
+  </ItemGroup>
+  <PropertyGroup>
+    <ComputeLinkInputsTargets>
+      $(ComputeLinkInputsTargets);
+      ComputeNASMOutput;
+    </ComputeLinkInputsTargets>
+    <ComputeLibInputsTargets>
+      $(ComputeLibInputsTargets);
+      ComputeNASMOutput;
+    </ComputeLibInputsTargets>
+  </PropertyGroup>
+  <UsingTask
+    TaskName="NASM"
+    TaskFactory="XamlTaskFactory"
+    AssemblyName="Microsoft.Build.Tasks.v4.0">
+    <Task>$(MSBuildThisFileDirectory)$(MSBuildThisFileName).xml</Task>
+  </UsingTask>
+  <Target
+    Name="_NASM"
+    BeforeTargets="$(NASMBeforeTargets)"
+    AfterTargets="$(NASMAfterTargets)"
+    Condition="'@(NASM)' != ''"
+    Outputs="%(NASM.OutputFormat)"
+    Inputs="%(NASM.Identity);%(NASM.AdditionalDependencies);$(MSBuildProjectFile)"
+    DependsOnTargets="_SelectedFiles">
+    <ItemGroup Condition="'@(SelectedFiles)' != ''">
+      <NASM Remove="@(NASM)" Condition="'%(Identity)' != '@(SelectedFiles)'" />
+    </ItemGroup>
+    <ItemGroup>
+      <NASM_tlog Include="%(NASM.OutputFormat)" Condition="'%(NASM.OutputFormat)' != '' and '%(NASM.ExcludedFromBuild)' != 'true'">
+        <Source>@(NASM, '|')</Source>
+      </NASM_tlog>
+    </ItemGroup>
+    <Message
+      Importance="High"
+      Text="%(NASM.ExecutionDescription)" />
+    <WriteLinesToFile
+      Condition="'@(NASM_tlog)' != '' and '%(NASM_tlog.ExcludedFromBuild)' != 'true'"
+      File="$(IntDir)$(ProjectName).write.1.tlog"
+      Lines="^%(NASM_tlog.Source);@(NASM_tlog-&gt;'%(Fullpath)')"/>
+    <NASM
+      Condition="'@(NASM)' != '' and '%(NASM.ExcludedFromBuild)' != 'true'"
+      Inputs="%(NASM.Inputs)"
+      OutputFormat="%(NASM.OutputFormat)"
+      AssembledCodeListingFile="%(NASM.AssembledCodeListingFile)"
+      GenerateDebugInformation="%(NASM.GenerateDebugInformation)"
+      ErrorReporting="%(NASM.ErrorReporting)"
+      IncludePaths="%(NASM.IncludePaths)"
+      PreprocessorDefinitions="%(NASM.PreprocessorDefinitions)"
+      UndefinePreprocessorDefinitions="%(NASM.UndefinePreprocessorDefinitions)"
+      ErrorReportingFormat="%(NASM.ErrorReportingFormat)"
+      TreatWarningsAsErrors="%(NASM.TreatWarningsAsErrors)"
+      floatunderflow="%(NASM.floatunderflow)"
+      macrodefaults="%(NASM.macrodefaults)"
+      user="%(NASM.user)"
+      floatoverflow="%(NASM.floatoverflow)"
+      floatdenorm="%(NASM.floatdenorm)"
+      numberoverflow="%(NASM.numberoverflow)"
+      macroselfref="%(NASM.macroselfref)"
+      floattoolong="%(NASM.floattoolong)"
+      orphanlabels="%(NASM.orphanlabels)"
+      CommandLineTemplate="%(NASM.CommandLineTemplate)"
+      AdditionalOptions="%(NASM.AdditionalOptions)"
+ />
+  </Target>
+  <Target
+    Name="ComputeNASMOutput"
+    Condition="'@(NASM)' != ''">
+    <ItemGroup>
+      <Link Include="@(NASM->Metadata('OutputFormat')->Distinct()->ClearMetadata())" Condition="'%(NASM.ExcludedFromBuild)' != 'true'"/>
+      <Lib Include="@(NASM->Metadata('OutputFormat')->Distinct()->ClearMetadata())" Condition="'%(NASM.ExcludedFromBuild)' != 'true'"/>
+    </ItemGroup>
+  </Target>
+</Project>
--- a/crypto/nasm.xml
+++ b/crypto/nasm.xml
@ -0,0 +1,308 @@
+<?xml version="1.0" encoding="utf-8"?>
+<ProjectSchemaDefinitions xmlns="http://schemas.microsoft.com/build/2009/properties" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:sys="clr-namespace:System;assembly=mscorlib">
+  <Rule
+    Name="NASM"
+    PageTemplate="tool"
+    DisplayName="Netwide Assembler"
+    Order="200">
+    <Rule.DataSource>
+      <DataSource
+        Persistence="ProjectFile"
+        ItemType="NASM" />
+    </Rule.DataSource>
+    <Rule.Categories>
+      <Category
+        Name="General">
+        <Category.DisplayName>
+          <sys:String>General</sys:String>
+        </Category.DisplayName>
+      </Category>
+	  <Category
+        Name="Preprocessor">
+        <Category.DisplayName>
+          <sys:String>Preprocessing Options</sys:String>
+        </Category.DisplayName>
+      </Category>
+	  <Category
+        Name="Assembler Options">
+        <Category.DisplayName>
+          <sys:String>Assembler Options</sys:String>
+        </Category.DisplayName>
+      </Category>
+	  <Category
+        Name="Advanced">
+        <Category.DisplayName>
+          <sys:String>Advanced </sys:String>
+        </Category.DisplayName>
+      </Category>	  
+      <Category
+        Name="Command Line"
+        Subtype="CommandLine">
+        <Category.DisplayName>
+          <sys:String>Command Line</sys:String>
+        </Category.DisplayName>
+      </Category>
+    </Rule.Categories>
+    <StringProperty
+      Name="Inputs"
+      Category="Command Line"
+      IsRequired="true">
+      <StringProperty.DataSource>
+        <DataSource
+          Persistence="ProjectFile"
+          ItemType="NASM"
+          SourceType="Item" />
+      </StringProperty.DataSource>
+    </StringProperty>
+    
+  <StringProperty
+	  Name="OutputFormat"	  
+      Category="Assembler Options"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Output File Name"
+      Description="Specify Output Filename.-o [value]"
+	  Switch="-o [value]"
+	/>  
+
+
+	<StringListProperty
+	Name="AssembledCodeListingFile"
+	Category="Assembler Options"
+	DisplayName="Assembled Code Listing File"	
+	Description="Generates an assembled code listing file.     (-l [file])"
+	HelpUrl="http://www.nasm.us/doc/"
+	Switch="-l &quot;[value]&quot;"
+	/>
+	
+	<BoolProperty
+	Name="GenerateDebugInformation"
+	Category="Assembler Options"
+	DisplayName="Generate Debug Information"
+	Description="Generates Debug Information.     (-g)"
+	HelpUrl="http://www.nasm.us/doc/"
+	Switch="-g"	
+	/>
+	  
+  <StringListProperty
+      Name="ErrorReporting"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Redirect Error Messages to File"
+      Description="Drops the error Message on specified device"
+      Switch="-Z &quot;[value]&quot;"        
+    />
+	
+	<StringListProperty
+	Name="IncludePaths"
+	Category="General"
+	DisplayName="Include Paths"
+	Description="Sets path for include file.     (-I[path])"
+	HelpUrl="http://www.nasm.us/doc/"
+	Switch="-I[value]"
+	
+	/>
+	
+	<StringListProperty
+	Name="PreprocessorDefinitions"
+    Category="Preprocessor"
+    HelpUrl="http://www.nasm.us/doc/"
+    DisplayName="Preprocessor Definitions"
+    Description="Defines a text macro with the given name.     (-D[symbol])"
+	Switch="-D[value]"
+
+	/>
+	
+  <StringListProperty
+	Name="UndefinePreprocessorDefinitions"
+	Category="Preprocessor"
+	HelpUrl="http://www.nasm.us/doc/"
+	DisplayName="Undefine Preprocessor Definitions"
+	Description="Undefines a text macro with the given name.     (-U[symbol])"	
+	Switch="-U[value]"
+	/>
+	
+	<EnumProperty
+      Name="ErrorReportingFormat"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Error Reporting Format"
+      Description="Select the error reporting format ie. GNU or VC">
+      <EnumValue
+        Name="0"
+        DisplayName="-Xgnu	GNU format: Default format"
+        Switch="-Xgnu" />
+      <EnumValue
+        Name="1"
+        DisplayName="-Xvc	Style used by Microsoft Visual C++"
+        Switch="-Xvc" />      
+    </EnumProperty>
+	
+	<BoolProperty
+	Name="TreatWarningsAsErrors"
+	Category="Assembler Options"
+	DisplayName="Treat Warnings As Errors"
+	Description="Returns an error code if warnings are generated.     (-Werror)"
+	HelpUrl="http://www.nasm.us/doc/"
+	Switch="-Werror"
+	/>
+
+	<BoolProperty
+      Name="floatunderflow"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="float-underflow"
+      Description="floating point underflow (default off)"
+      Switch="-w+float-underflow" />
+
+  <BoolProperty
+      Name="macrodefaults"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable macro-defaults"
+      Description="macros with more default than optional parameters (default on)"
+      Switch="-w-macro-defaults" />
+
+  <BoolProperty
+      Name="user"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable user"
+      Description="%warning directives (default on)"
+      Switch="-w-user" />
+
+  <BoolProperty
+      Name="floatoverflow"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable float-overflow"
+      Description="floating point overflow (default on)"
+      Switch="-w-float-overflow" />
+
+  <BoolProperty
+      Name="floatdenorm"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="float-denorm"
+      Description="floating point denormal (default off)"
+      Switch="-w+float-denorm" />
+
+  <BoolProperty
+      Name="numberoverflow"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable number-overflow"
+      Description="numeric constant does not fit (default on)"
+      Switch="-w-number-overflow" />
+
+  <BoolProperty
+      Name="macroselfref"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="macro-selfref"
+      Description="cyclic macro references (default off)"
+      Switch="-w+macro-selfref" />
+
+  <BoolProperty
+      Name="floattoolong"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable float-toolong"
+      Description=" too many digits in floating-point number (default on)"
+      Switch="-w-float-toolong" />
+
+  <BoolProperty
+      Name="orphanlabels"
+      Category="Advanced"
+      HelpUrl="http://www.nasm.us/doc/"
+      DisplayName="Disable orphan-labels"
+      Description="labels alone on lines without trailing `:' (default on)"
+      Switch="-w-orphan-labels" />
+
+  <StringProperty
+      Name="CommandLineTemplate"
+      DisplayName="Command Line"
+      Visible="False"
+      IncludeInCommandLine="False" />
+
+  <DynamicEnumProperty
+        Name="NASMBeforeTargets"
+        Category="General"
+        EnumProvider="Targets"
+        IncludeInCommandLine="False">
+      <DynamicEnumProperty.DisplayName>
+        <sys:String>Execute Before</sys:String>
+      </DynamicEnumProperty.DisplayName>
+      <DynamicEnumProperty.Description>
+        <sys:String>Specifies the targets for the build customization to run before.</sys:String>
+      </DynamicEnumProperty.Description>
+      <DynamicEnumProperty.ProviderSettings>
+        <NameValuePair
+          Name="Exclude"
+          Value="^NASMBeforeTargets|^Compute" />
+      </DynamicEnumProperty.ProviderSettings>
+      <DynamicEnumProperty.DataSource>
+        <DataSource
+          Persistence="ProjectFile"
+          ItemType=""
+          HasConfigurationCondition="true" />
+      </DynamicEnumProperty.DataSource>
+    </DynamicEnumProperty>
+  <DynamicEnumProperty
+      Name="NASMAfterTargets"
+      Category="General"
+      EnumProvider="Targets"
+      IncludeInCommandLine="False">
+      <DynamicEnumProperty.DisplayName>
+        <sys:String>Execute After</sys:String>
+      </DynamicEnumProperty.DisplayName>
+      <DynamicEnumProperty.Description>
+        <sys:String>Specifies the targets for the build customization to run after.</sys:String>
+      </DynamicEnumProperty.Description>
+      <DynamicEnumProperty.ProviderSettings>
+        <NameValuePair
+          Name="Exclude"
+          Value="^NASMAfterTargets|^Compute" />
+      </DynamicEnumProperty.ProviderSettings>
+      <DynamicEnumProperty.DataSource>
+        <DataSource
+          Persistence="ProjectFile"
+          ItemType=""
+          HasConfigurationCondition="true" />
+      </DynamicEnumProperty.DataSource>
+    </DynamicEnumProperty>
+  <StringProperty
+      Name="ExecutionDescription"
+      DisplayName="Execution Description"
+      IncludeInCommandLine="False"
+      Visible="False" />
+
+  <StringListProperty
+      Name="AdditionalDependencies"
+      DisplayName="Additional Dependencies"
+      IncludeInCommandLine="False"
+      Visible="False" />
+  
+  <StringProperty
+      Subtype="AdditionalOptions"
+      Name="AdditionalOptions"
+      Category="Command Line">
+      <StringProperty.DisplayName>
+        <sys:String>Additional Options</sys:String>
+      </StringProperty.DisplayName>
+      <StringProperty.Description>
+        <sys:String>Additional Options</sys:String>
+      </StringProperty.Description>
+    </StringProperty>
+  
+  </Rule>
+  <ItemType
+    Name="NASM"
+    DisplayName="Netwide Assembler" />
+  <FileExtension
+    Name="*.asm"
+    ContentType="NASM" />
+  <ContentType
+    Name="NASM"
+    DisplayName="Netwide Assembler"
+    ItemType="NASM" />
+</ProjectSchemaDefinitions>
--- a/crypto/poly1305_x64_gas.s
+++ b/crypto/poly1305_x64_gas.s
--- a/crypto/poly1305_x64_gas_macosx.s
+++ b/crypto/poly1305_x64_gas_macosx.s
--- a/crypto/poly1305_x64_nasm.asm
+++ b/crypto/poly1305_x64_nasm.asm
--- a/crypto/siphash.cpp
+++ b/crypto/siphash.cpp
@ -0,0 +1,193 @@
+/* Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.
+ *
+ * SipHash: a fast short-input PRF
+ * https://131002.net/siphash/
+ *
+ * This implementation is specifically for SipHash2-4 for a secure PRF
+ * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for
+ * hashtables.
+ */
+#include "stdafx.h"
+
+#include "crypto/siphash.h"
+#include "tunsafe_endian.h"
+
+#define SIPROUND \
+  do { \
+  v0 += v1; v1 = rol64(v1, 13); v1 ^= v0; v0 = rol64(v0, 32); \
+  v2 += v3; v3 = rol64(v3, 16); v3 ^= v2; \
+  v0 += v3; v3 = rol64(v3, 21); v3 ^= v0; \
+  v2 += v1; v1 = rol64(v1, 17); v1 ^= v2; v2 = rol64(v2, 32); \
+  } while (0)
+
+#define PREAMBLE(len) \
+  uint64 v0 = 0x736f6d6570736575ULL; \
+  uint64 v1 = 0x646f72616e646f6dULL; \
+  uint64 v2 = 0x6c7967656e657261ULL; \
+  uint64 v3 = 0x7465646279746573ULL; \
+  uint64 b = ((uint64)(len)) << 56; \
+  v3 ^= key->key[1]; \
+  v2 ^= key->key[0]; \
+  v1 ^= key->key[1]; \
+  v0 ^= key->key[0];
+
+#define POSTAMBLE \
+  v3 ^= b; \
+  SIPROUND; \
+  SIPROUND; \
+  v0 ^= b; \
+  v2 ^= 0xff; \
+  SIPROUND; \
+  SIPROUND; \
+  SIPROUND; \
+  SIPROUND; \
+  return (v0 ^ v1) ^ (v2 ^ v3);
+
+uint64 siphash(const void *data, size_t len, const siphash_key_t *key) {
+  const uint8 *end = (uint8*)data + len - (len % sizeof(uint64));
+  const uint8 left = len & (sizeof(uint64) - 1);
+  uint64 m;
+  PREAMBLE(len)
+  for (; data != end; data = (uint8*)data + sizeof(uint64)) {
+    m = ReadLE64(data);
+    v3 ^= m;
+    SIPROUND;
+    SIPROUND;
+    v0 ^= m;
+  }
+  switch (left) {
+  case 7: b |= ((uint64)end[6]) << 48;
+  case 6: b |= ((uint64)end[5]) << 40;
+  case 5: b |= ((uint64)end[4]) << 32;
+  case 4: b |= ReadLE32(data); break;
+  case 3: b |= ((uint64)end[2]) << 16;
+  case 2: b |= ReadLE16(data); break;
+  case 1: b |= end[0];
+  }
+  POSTAMBLE
+}
+
+/**
+ * siphash_1u64 - compute 64-bit siphash PRF value of a uint64
+ * @first: first uint64
+ * @key: the siphash key
+ */
+uint64 siphash_1u64(const uint64 first, const siphash_key_t *key)
+{
+  PREAMBLE(8)
+  v3 ^= first;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= first;
+  POSTAMBLE
+}
+
+/**
+ * siphash_2u64 - compute 64-bit siphash PRF value of 2 uint64
+ * @first: first uint64
+ * @second: second uint64
+ * @key: the siphash key
+ */
+uint64 siphash_2u64(const uint64 first, const uint64 second, const siphash_key_t *key)
+{
+  PREAMBLE(16)
+  v3 ^= first;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= first;
+  v3 ^= second;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= second;
+  POSTAMBLE
+}
+
+/**
+ * siphash_3u64 - compute 64-bit siphash PRF value of 3 uint64
+ * @first: first uint64
+ * @second: second uint64
+ * @third: third uint64
+ * @key: the siphash key
+ */
+uint64 siphash_3u64(const uint64 first, const uint64 second, const uint64 third,
+     const siphash_key_t *key)
+{
+  PREAMBLE(24)
+  v3 ^= first;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= first;
+  v3 ^= second;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= second;
+  v3 ^= third;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= third;
+  POSTAMBLE
+}
+
+/**
+ * siphash_4u64 - compute 64-bit siphash PRF value of 4 uint64
+ * @first: first uint64
+ * @second: second uint64
+ * @third: third uint64
+ * @forth: forth uint64
+ * @key: the siphash key
+ */
+uint64 siphash_4u64(const uint64 first, const uint64 second, const uint64 third,
+     const uint64 forth, const siphash_key_t *key)
+{
+  PREAMBLE(32)
+  v3 ^= first;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= first;
+  v3 ^= second;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= second;
+  v3 ^= third;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= third;
+  v3 ^= forth;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= forth;
+  POSTAMBLE
+}
+
+uint64 siphash_1u32(const uint32 first, const siphash_key_t *key)
+{
+  PREAMBLE(4)
+  b |= first;
+  POSTAMBLE
+}
+
+uint64 siphash_3u32(const uint32 first, const uint32 second, const uint32 third,
+     const siphash_key_t *key)
+{
+  uint64 combined = (uint64)second << 32 | first;
+  PREAMBLE(12)
+  v3 ^= combined;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= combined;
+  b |= third;
+  POSTAMBLE
+}
+
+uint64 siphash_u64_u32(const uint64 combined, const uint32 third, const siphash_key_t *key) {
+  PREAMBLE(12)
+  v3 ^= combined;
+  SIPROUND;
+  SIPROUND;
+  v0 ^= combined;
+  b |= third;
+  POSTAMBLE
+}
+
--- a/crypto/siphash.h
+++ b/crypto/siphash.h
@ -0,0 +1,53 @@
+/* Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.
+ *
+ * SipHash: a fast short-input PRF
+ * https://131002.net/siphash/
+ *
+ * This implementation is specifically for SipHash2-4 for a secure PRF
+ * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for
+ * hashtables.
+ */
+
+#ifndef TUNSAFE_CRYPTO_SIPHASH_H_
+#define TUNSAFE_CRYPTO_SIPHASH_H_
+
+#include "tunsafe_types.h"
+
+typedef struct {
+	uint64 key[2];
+} siphash_key_t;
+
+uint64 siphash_1u64(const uint64 a, const siphash_key_t *key);
+uint64 siphash_2u64(const uint64 a, const uint64 b, const siphash_key_t *key);
+uint64 siphash_3u64(const uint64 a, const uint64 b, const uint64 c,
+		 const siphash_key_t *key);
+uint64 siphash_4u64(const uint64 a, const uint64 b, const uint64 c, const uint64 d,
+		 const siphash_key_t *key);
+uint64 siphash_1u32(const uint32 a, const siphash_key_t *key);
+uint64 siphash_3u32(const uint32 a, const uint32 b, const uint32 c,
+		 const siphash_key_t *key);
+
+static inline uint64 siphash_2u32(const uint32 a, const uint32 b,
+			       const siphash_key_t *key)
+{
+	return siphash_1u64((uint64)b << 32 | a, key);
+}
+static inline uint64 siphash_4u32(const uint32 a, const uint32 b, const uint32 c,
+			       const uint32 d, const siphash_key_t *key)
+{
+	return siphash_2u64((uint64)b << 32 | a, (uint64)d << 32 | c, key);
+}
+
+uint64 siphash_u64_u32(const uint64 combined, const uint32 third, const siphash_key_t *key);
+
+/**
+ * siphash - compute 64-bit siphash PRF value
+ * @data: buffer to hash
+ * @size: size of @data
+ * @key: the siphash key
+ */
+uint64 siphash(const void *data, size_t len, const siphash_key_t *key);
+
+#endif  // TUNSAFE_CRYPTO_SIPHASH_H_
--- a/crypto/x86_64-xlate.pl
+++ b/crypto/x86_64-xlate.pl
--- a/crypto_ops.h
+++ b/crypto_ops.h
@ -0,0 +1,41 @@
+// SPDX-License-Identifier: AGPL-1.0-only
+// Copyright (C) 2018 Ludvig Strigeus <info@tunsafe.com>. All Rights Reserved.
+#ifndef TUNSAFE_CRYPTO_OPS_H_
+#define TUNSAFE_CRYPTO_OPS_H_
+
+#include "build_config.h"
+#include "tunsafe_types.h"
+
+#include <string.h>
+#if defined(COMPILER_MSVC)
+#include <intrin.h>
+#endif  // defined(COMPILER_MSVC)
+
+#if defined(ARCH_CPU_X86_64) && defined(COMPILER_MSVC)
+FORCEINLINE static void memzero_crypto(void *dst, size_t n) {
+if (n & 7) {
+    __stosb((unsigned char*)dst, 0, n);
+  } else {
+    __stosq((uint64*)dst, 0, n >> 3);
+  }
+}
+
+#elif defined(ARCH_CPU_X86) && defined(COMPILER_MSVC)
+FORCEINLINE static void memzero_crypto(void *dst, size_t n) {
+  if (n & 3) {
+    __stosb((unsigned char*)dst, 0, n);
+  } else {
+    __stosd((unsigned long*)dst, 0, n >> 2);
+  }
+}
+#else
+FORCEINLINE static void memzero_crypto(void *dst, size_t n) {
+  memset(dst, 0, n);
+  __asm__ __volatile__("": :"r"(dst) :"memory");
+}
+#endif
+
+int memcmp_crypto(const uint8 *a, const uint8 *b, size_t n);
+
+
+#endif  // TUNSAFE_CRYPTO_OPS_H_
--- a/icons/green-bg-icon.ico
+++ b/icons/green-bg-icon.ico
--- a/icons/green-bg-icon.png
+++ b/icons/green-bg-icon.png
--- a/icons/green-icon.ico
+++ b/icons/green-icon.ico
--- a/icons/green-icon.png
+++ b/icons/green-icon.png
--- a/icons/neutral-icon.ico
+++ b/icons/neutral-icon.ico
--- a/icons/neutral-icon.png
+++ b/icons/neutral-icon.png
--- a/icons/red-icon.ico
+++ b/icons/red-icon.ico
--- a/icons/red-icon.png
+++ b/icons/red-icon.png
--- a/installer/.gitignore
+++ b/installer/.gitignore
@ -0,0 +1,4 @@
+/tunsafe*.exe
+/x64/
+/x86/
+*.pyc
--- a/installer/ChangeLog.txt
+++ b/installer/ChangeLog.txt
@ -0,0 +1,48 @@
+2018-06-20 - TunSafe v1.3-rc3
+
+Changes:
+1.Add option to block Internet traffic outside of TunSafe. Either
+  based on firewall rules, or by adding a null route, or both.
+  The firewall rule blocks all traffic except traffic from TunSafe,
+  loopback traffic, and DHCP traffic on the default NIC.
+  The route rule adds two /1 routes to 0.0.0.0.
+2.Convert LF to CRLF when importing config files
+3.Update some logging messages
+4.Delete the old routing rule pointing at the VPN server IP when
+  disconnecting
+5.Delete any conflicting old routing rule pointing at the VPN server
+  when connecting.
+6.Tray popup menu did not disappear when clicking outside of it.
+7.Show config file names also in tray popup menu.
+8.Make the menu item bold if connection is selected in popup menu.
+9.Don't show the .conf filename extension in the UI.
+10.Show also config file name when hovering on tray icon.
+11.Click on the connected server to toggle connection
+12.Fix bug where internet blocking checkbox was not removed.
+13.Change so bold is used for selected server, and checkbox
+   is used when connected.
+14.Use WS_EX_COMPOSITED to reduce flicker
+15.Now possible to enter a filename on command line to connect to.
+16.Support /minimize and /minimize_on_connect command line opts.
+17.Support PreUp,PostUp,PreDown,PostDown options on [Interface]
+   Note: For security reasons you need to first enable them,
+   so either Shift-Click on Options and select Allow Pre/Post Commands
+   or specify the /allow_pre_post command line option.
+
+2018-04-29 - TunSafe v1.2
+
+Changes:
+1.Use /24 instead of failing when a /32 Address is used
+2.Use /120 instead of failing when a /128 Address is used
+3.Add routes for all entries in AllowedIPs
+
+2018-04-29 - TunSafe v1.1
+
+Changes:
+1.Retry on failed DNS lookup. Helps when resuming from sleep.
+2.Display a better message if the TAP adapter can't be found.
+3.Retry connect when getting ERROR_FILE_NOT_FOUND.
+
+2018-03-06 - TunSafe v1.0
+
+First public release.
--- a/installer/LICENSE.TXT
+++ b/installer/LICENSE.TXT
@ -0,0 +1,240 @@
+TunSafe © 2018 Ludvig Strigeus
+==============================
+
+BY USING THE SOFTWARE, YOU ACCEPT THESE TERMS. IF YOU DO NOT ACCEPT
+THEM, DO NOT USE THE SOFTWARE.
+
+This software is provided "as is", without warranty of any kind,
+express or implied, including but not limited to the warranties of
+merchantability, fitness for a particular purpose and noninfringement.
+In no event shall the authors or copyright holders be liable for any
+claim, damages or other liability, whether in an action of contract,
+tort or otherwise, arising from, out of or in connection with the
+Software or the use or other dealings in the Software.
+
+We may not provide support services for this software in the future.
+
+You may install and use any number of copies of the software on your
+devices.
+
+Please be aware that, similar to other networking tools that capture
+network packets, the information processed by TunSafe or your VPN
+provider may include personally identifiable or other sensitive 
+information (such as usernames, passwords, addresses of web sites
+accessed). By using this software, you acknowledge that you are aware of
+this and take sole responsibility for any personally identifiable or
+other sensitive information provided to TunSafe or your VPN provider 
+through your use of the software.
+
+The software is licensed, not sold. This agreement only gives you some
+rights to use the software. Unless applicable law gives you more rights
+despite this limitation, you may use the software only as expressly
+permitted in this agreement. In doing so, you must comply with any
+technical limitations in the software that only allow you to use it in
+certain ways. You may not
+
+  * work around any technical limitations in the software;
+
+  * reverse engineer, decompile or disassemble the software, except
+    and only to the extent that applicable law expressly permits,
+    despite this limitation;
+
+  * publish the software for others to copy;
+
+  * sell, rent, lease or lend the software;
+
+  * transfer the software or this agreement to any third party; or
+
+  * use the software for commercial software hosting services.
+
+All exceptions require prior written consent from info@tunsafe.com.
+
+You can recover from us and our suppliers only direct damages up to
+U.S. $0.10. You cannot recover any other damages, including consequential,
+lost profits, special, indirect or incidental damages.
+
+This limitation applies to
+ * anything related to the software, services, content (including code)
+   on third party Internet sites, or third party programs; and
+ * claims for breach of contract, breach of warranty, guarantee or
+   condition, strict liability, negligence, or other tort to the extent
+   permitted by applicable law.
+
+It also applies even if we knew or should have known about the possibility
+of the damages. 
+
+This agreement describes certain legal rights. You may have other rights
+under the laws of your country. You may also have rights with respect to the
+party from whom you acquired the software. This agreement does not change
+your rights under the laws of your country if the laws of your country do
+not permit it to do so.
+
+This agreement is the entire agreement and is governed by the laws of Sweden.
+
+Several pieces of Open Source software were used in this product.
+Here are their licenses.
+
+BLAKE2 License
+--------------
+
+Copyright 2012, Samuel Neves <sneves@dei.uc.pt>.  You may use this under the
+terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+your option.  The terms of these licenses can be found at:
+
+- CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+- OpenSSL license   : https://www.openssl.org/source/license.html
+- Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+
+More information about the BLAKE2 hash function can be found at
+https://blake2.net.
+
+
+Curve25519-Donna License
+------------------------
+
+Copyright 2008, Google Inc.
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+
+    * Redistributions of source code must retain the above copyright
+notice, this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above
+copyright notice, this list of conditions and the following disclaimer
+in the documentation and/or other materials provided with the
+distribution.
+    * Neither the name of Google Inc. nor the names of its
+contributors may be used to endorse or promote products derived from
+this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+OpenSSL License
+---------------
+
+====================================================================
+Copyright (c) 1998-2018 The OpenSSL Project.  All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+
+1. Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer. 
+
+2. Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in
+   the documentation and/or other materials provided with the
+   distribution.
+
+3. All advertising materials mentioning features or use of this
+   software must display the following acknowledgment:
+   "This product includes software developed by the OpenSSL Project
+   for use in the OpenSSL Toolkit. (http://www.openssl.org/)"
+
+4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to
+   endorse or promote products derived from this software without
+   prior written permission. For written permission, please contact
+   openssl-core@openssl.org.
+
+5. Products derived from this software may not be called "OpenSSL"
+   nor may "OpenSSL" appear in their names without prior written
+   permission of the OpenSSL Project.
+
+6. Redistributions of any form whatsoever must retain the following
+   acknowledgment:
+   "This product includes software developed by the OpenSSL Project
+   for use in the OpenSSL Toolkit (http://www.openssl.org/)"
+
+THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT ``AS IS'' AND ANY
+EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE OpenSSL PROJECT OR
+ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+OF THE POSSIBILITY OF SUCH DAMAGE.
+====================================================================
+
+This product includes cryptographic software written by Eric Young
+(eay@cryptsoft.com).  This product includes software written by Tim
+Hudson (tjh@cryptsoft.com).
+
+
+
+Original SSLeay License
+-----------------------
+
+Copyright (C) 1995-1998 Eric Young (eay@cryptsoft.com)
+All rights reserved.
+
+This package is an SSL implementation written
+by Eric Young (eay@cryptsoft.com).
+The implementation was written so as to conform with Netscapes SSL.
+
+This library is free for commercial and non-commercial use as long as
+the following conditions are aheared to.  The following conditions
+apply to all code found in this distribution, be it the RC4, RSA,
+lhash, DES, etc., code; not just the SSL code.  The SSL documentation
+included with this distribution is covered by the same copyright terms
+except that the holder is Tim Hudson (tjh@cryptsoft.com).
+
+Copyright remains Eric Young's, and as such any Copyright notices in
+the code are not to be removed.
+If this package is used in a product, Eric Young should be given attribution
+as the author of the parts of the library used.
+This can be in the form of a textual message at program startup or
+in documentation (online or textual) provided with the package.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+1. Redistributions of source code must retain the copyright
+   notice, this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+3. All advertising materials mentioning features or use of this software
+   must display the following acknowledgement:
+   "This product includes cryptographic software written by
+    Eric Young (eay@cryptsoft.com)"
+   The word 'cryptographic' can be left out if the rouines from the library
+   being used are not cryptographic related :-).
+4. If you include any Windows specific code (or a derivative thereof) from 
+   the apps directory (application code) you must include an acknowledgement:
+   "This product includes software written by Tim Hudson (tjh@cryptsoft.com)"
+
+THIS SOFTWARE IS PROVIDED BY ERIC YOUNG ``AS IS'' AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGE.
+
+The licence and distribution terms for any publically available version or
+derivative of this code cannot be changed.  i.e. this code cannot simply be
+copied and put under another distribution licence
+[including the GNU Public Licence.]
+
+
--- a/installer/TunSafe.conf
+++ b/installer/TunSafe.conf
@ -0,0 +1,46 @@
+# This is a sample config file for TunSafe. It uses the same syntax as
+# WireGuard's wg-quick tool
+
+[Interface]
+
+# The private key of this computer. This is a secret key, don't give it out.
+# To convert it to a public key you can go to 'Generate Key Pair' in TunSafe.
+PrivateKey = gIIBl0OHb3wZjYGqZtgzRml3wec0e5vqXtSvCTfa42w=
+
+# Whether we want to bind a port to allow others to initiate connections to us.
+# Please ensure this port is mapped in your router.
+# ListenPort = 51820
+
+# Switch DNS server while connected
+# DNS = 8.8.8.8 
+
+# The addresses to bind to. Either IPv4 or IPv6. /31 and /32 are not supported.
+Address = 192.168.2.2/24
+
+# Whether to block all access to Internet that doesn't go through tunsafe.
+# Note that Internet will keep being blocked even after TunSafe is restarted.
+# Possible values (comma separated):
+#  route - Blocks all traffic using null route entries
+#  firewall - Blocks all traffic except TunSafe through the Windows firewall
+#  on - Uses the default block mechanism
+#  off - Turns off blocking
+# BlockInternet = route, firewall
+
+[Peer]
+# The public key of the peer. Do not use the private key here. Use the 'Generate Key Pair'
+# function in TunSafe to convert a private key to a public key.
+PublicKey = hIA3ikjlSOAo0qqrI+rXaS3ZH04Yx7Q2YQ4m2Syz+XE=
+
+# It's also possible to use a preshared key for extra security
+# PresharedKey  =  SNz4BYc61amtDhzxNCxgYgdV9rPU+WiC8woX47Xf/2Y=
+
+# The IP range that we may send packets to for this peer.
+AllowedIPs = 192.168.2.0/24
+
+# Address of the server
+Endpoint = 192.168.1.4:8040
+
+# Send periodic keepalives to ensure connection stays up behind NAT.
+PersistentKeepalive = 25
+
+
--- a/installer/icon.ico
+++ b/installer/icon.ico
--- a/installer/signplugin.dll
+++ b/installer/signplugin.dll
--- a/installer/signplugin/.gitignore
+++ b/installer/signplugin/.gitignore
@ -0,0 +1,3 @@
+/Debug/
+/Release/
+/.vs/
--- a/installer/signplugin/chkstk.obj
+++ b/installer/signplugin/chkstk.obj
--- a/installer/signplugin/ed25519.py
+++ b/installer/signplugin/ed25519.py
@ -0,0 +1,104 @@
+import hashlib
+
+b = 256
+q = 2**255 - 19
+l = 2**252 + 27742317777372353535851937790883648493
+
+def H(m):
+  return hashlib.sha512(m).digest()
+
+def expmod(b,e,m):
+  if e == 0: return 1
+  t = expmod(b,e/2,m)**2 % m
+  if e & 1: t = (t*b) % m
+  return t
+
+def inv(x):
+  return expmod(x,q-2,q)
+
+d = -121665 * inv(121666)
+I = expmod(2,(q-1)/4,q)
+
+def xrecover(y):
+  xx = (y*y-1) * inv(d*y*y+1)
+  x = expmod(xx,(q+3)/8,q)
+  if (x*x - xx) % q != 0: x = (x*I) % q
+  if x % 2 != 0: x = q-x
+  return x
+
+By = 4 * inv(5)
+Bx = xrecover(By)
+B = [Bx % q,By % q]
+
+def edwards(P,Q):
+  x1 = P[0]
+  y1 = P[1]
+  x2 = Q[0]
+  y2 = Q[1]
+  x3 = (x1*y2+x2*y1) * inv(1+d*x1*x2*y1*y2)
+  y3 = (y1*y2+x1*x2) * inv(1-d*x1*x2*y1*y2)
+  return [x3 % q,y3 % q]
+
+def scalarmult(P,e):
+  if e == 0: return [0,1]
+  Q = scalarmult(P,e/2)
+  Q = edwards(Q,Q)
+  if e & 1: Q = edwards(Q,P)
+  return Q
+
+def encodeint(y):
+  bits = [(y >> i) & 1 for i in range(b)]
+  return ''.join([chr(sum([bits[i * 8 + j] << j for j in range(8)])) for i in range(b/8)])
+
+def encodepoint(P):
+  x = P[0]
+  y = P[1]
+  bits = [(y >> i) & 1 for i in range(b - 1)] + [x & 1]
+  return ''.join([chr(sum([bits[i * 8 + j] << j for j in range(8)])) for i in range(b/8)])
+
+def bit(h,i):
+  return (ord(h[i/8]) >> (i%8)) & 1
+
+def publickey(sk):
+  h = H(sk)
+  a = 2**(b-2) + sum(2**i * bit(h,i) for i in range(3,b-2))
+  A = scalarmult(B,a)
+  return encodepoint(A)
+
+def Hint(m):
+  h = H(m)
+  return sum(2**i * bit(h,i) for i in range(2*b))
+
+def signature(m,sk,pk):
+  h = H(sk)
+  a = 2**(b-2) + sum(2**i * bit(h,i) for i in range(3,b-2))
+  r = Hint(''.join([h[i] for i in range(b/8,b/4)]) + m)
+  R = scalarmult(B,r)
+  S = (r + Hint(encodepoint(R) + pk + m) * a) % l
+  return encodepoint(R) + encodeint(S)
+
+def isoncurve(P):
+  x = P[0]
+  y = P[1]
+  return (-x*x + y*y - 1 - d*x*x*y*y) % q == 0
+
+def decodeint(s):
+  return sum(2**i * bit(s,i) for i in range(0,b))
+
+def decodepoint(s):
+  y = sum(2**i * bit(s,i) for i in range(0,b-1))
+  x = xrecover(y)
+  if x & 1 != bit(s,b-1): x = q-x
+  P = [x,y]
+  if not isoncurve(P): raise Exception("decoding point that is not on curve")
+  return P
+
+def checkvalid(s,m,pk):
+  if len(s) != b/4: raise Exception("signature length is wrong")
+  if len(pk) != b/8: raise Exception("public-key length is wrong")
+  R = decodepoint(s[0:b/8])
+  A = decodepoint(pk)
+  S = decodeint(s[b/8:b/4])
+  h = Hint(encodepoint(R) + pk + m)
+  if scalarmult(B,S) != edwards(R,scalarmult(A,h)):
+    raise Exception("signature does not pass verification")
--- a/installer/signplugin/ed_signtool.py
+++ b/installer/signplugin/ed_signtool.py
@ -0,0 +1,22 @@
+import hashlib
+
+def H(m):
+  return hashlib.sha512(m).digest()
+
+import ed25519
+import os
+
+sk = "".join(chr(c) for c in [4, 213, 116, 80, 117, 4, 70, 166, 244, 214, 234, 159, 197, 101, 182, 177, 106, 180, 68, 125, 51, 32, 159, 77, 27, 151, 233, 91, 109, 184, 147, 235])
+pk = "".join(chr(c) for c in [79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99])
+
+hash = H(file('../tap/TunSafe-TAP-9.21.2.exe', 'rb').read()) 
+print hash.encode('hex'), repr(hash)
+
+#sk = os.urandom(32)
+#pk = ed25519.publickey(sk)
+#print 'sk', [ord(c) for c in sk]
+#print 'pk', [ord(c) for c in pk]
+
+#m = 'test'
+s = ed25519.signature(hash,sk,pk)
+file('../tap/TunSafe-TAP-9.21.2.exe.sig', 'wb').write(s.encode('hex'))
--- a/installer/signplugin/main.cpp
+++ b/installer/signplugin/main.cpp
@ -0,0 +1,121 @@
+#include <Windows.h>
+extern "C" {
+#include "tiny/edsign.h"
+#include "nsis/pluginapi.h"
+#include "tiny/sha512.h"
+}
+
+// To work with Unicode version of NSIS, please use TCHAR-type
+// functions for accessing the variables and the stack.
+
+unsigned char buffer[4096];
+
+// sk[4, 213, 116, 80, 117, 4, 70, 166, 244, 214, 234, 159, 197, 101, 182, 177, 106, 180, 68, 125, 51, 32, 159, 77, 27, 151, 233, 91, 109, 184, 147, 235]
+// pk[79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99]
+static const unsigned char pk[32] = {79, 236, 107, 197, 85, 239, 235, 109, 123, 181, 230, 115, 206, 112, 218, 80, 174, 167, 119, 187, 113, 153, 17, 115, 77, 100, 154, 84, 181, 194, 254, 99};
+
+int CheckFile(char *file) {
+  sha512_state ctx;
+  int ret;
+  HANDLE h;
+  unsigned char out[64];
+  unsigned char signature[64];
+
+  h = CreateFileA(file, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+  if (h == INVALID_HANDLE_VALUE)
+    return 1;
+  DWORD n;
+  sha512_init(&ctx);
+
+  size_t total_size = 0;
+  size_t p = 0;
+  while (ReadFile(h, buffer, sizeof(buffer), &n, NULL) && n) {
+    total_size += n;
+    p = 0;
+    while (p + 128 <= n) {
+      sha512_block(&ctx, buffer + p);
+      p += 128;
+    }
+    if (p != n)
+      break;
+  }
+  sha512_final(&ctx, buffer + p, total_size);
+  sha512_get(&ctx, out, 0, 64);
+  CloseHandle(h);
+  /*
+  for (size_t i = 0; i < 64; i++) {
+    buffer[i * 2 + 0] = "0123456789abcdef"[out[i] >> 4];
+    buffer[i * 2 + 1] = "0123456789abcdef"[out[i] & 0xF];
+  }
+  buffer[128] = 0;
+  MessageBoxA(0, (char*)buffer, "sha", 0);
+  */
+  char *x = file;
+  while (*x)x++;
+  memcpy(x, ".sig", 5);
+
+  h = CreateFileA(file, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+  if (h == INVALID_HANDLE_VALUE)
+    return 2;
+  n = 0;
+  ReadFile(h, buffer, sizeof(buffer), &n, NULL);
+  CloseHandle(h);
+  if (n < 128)
+    return 3;
+
+  memset(signature, 0, sizeof(signature));
+  
+  for (int i = 0; i < 128; i++) {
+    unsigned char c = buffer[i];
+    if (c >= '0' && c <= '9')
+      c -= '0';
+    else if ((c |= 32), c >= 'a' && c <= 'f')
+      c -= 'a' - 10;
+    else
+      return 4;
+    signature[i >> 1] = (signature[i >> 1] << 4) + c;
+  }
+
+  /* create a random seed, and a keypair out of that seed */
+  //ed25519_create_seed(seed);
+  //ed25519_create_keypair(public_key, private_key, seed);
+
+  /* create signature on the message with the keypair */
+  //ed25519_sign(signature, message, message_len, public_key, private_key);
+
+  /* verify the signature */
+  return edsign_verify(signature, pk, out, sizeof(out)) ? 0 : 5;
+}
+
+extern "C" void __declspec(dllexport) myFunction(HWND hwndParent, int string_size,
+                                      LPTSTR variables, stack_t **stacktop,
+                                      extra_parameters *extra, ...) {
+  EXDLL_INIT();
+
+  int rv = 10;
+
+  // note if you want parameters from the stack, pop them off in order.
+  // i.e. if you are called via exdll::myFunction file.dat read.txt
+  // calling popstring() the first time would give you file.dat,
+  // and the second time would give you read.txt. 
+  // you should empty the stack of your parameters, and ONLY your
+  // parameters.
+
+  // do your stuff here
+  {
+    LPTSTR msgbuf = (LPTSTR)GlobalAlloc(GPTR, (string_size + 1 + 10) * sizeof(*msgbuf));
+    if (msgbuf) {
+      if (!popstring(msgbuf)) {
+        rv = CheckFile(msgbuf);
+      }
+      GlobalFree(msgbuf);
+    }
+  }
+
+  pushint(rv);
+}
+
+
+BOOL WINAPI DllMain(HINSTANCE hInst, ULONG ul_reason_for_call, LPVOID lpReserved) {
+  return TRUE;
+}
--- a/installer/signplugin/nsis/api.h
+++ b/installer/signplugin/nsis/api.h
@ -0,0 +1,85 @@
+/*
+ * apih
+ * 
+ * This file is a part of NSIS.
+ * 
+ * Copyright (C) 1999-2018 Nullsoft and Contributors
+ * 
+ * Licensed under the zlib/libpng license (the "License");
+ * you may not use this file except in compliance with the License.
+ * 
+ * Licence details can be found in the file COPYING.
+ * 
+ * This software is provided 'as-is', without any express or implied
+ * warranty.
+ */
+
+#ifndef _NSIS_EXEHEAD_API_H_
+#define _NSIS_EXEHEAD_API_H_
+
+// Starting with NSIS 2.42, you can check the version of the plugin API in exec_flags->plugin_api_version
+// The format is 0xXXXXYYYY where X is the major version and Y is the minor version (MAKELONG(y,x))
+// When doing version checks, always remember to use >=, ex: if (pX->exec_flags->plugin_api_version >= NSISPIAPIVER_1_0) {}
+
+#define NSISPIAPIVER_1_0 0x00010000
+#define NSISPIAPIVER_CURR NSISPIAPIVER_1_0
+
+// NSIS Plug-In Callback Messages
+enum NSPIM 
+{
+  NSPIM_UNLOAD,    // This is the last message a plugin gets, do final cleanup
+  NSPIM_GUIUNLOAD, // Called after .onGUIEnd
+};
+
+// Prototype for callbacks registered with extra_parameters->RegisterPluginCallback()
+// Return NULL for unknown messages
+// Should always be __cdecl for future expansion possibilities
+typedef UINT_PTR (*NSISPLUGINCALLBACK)(enum NSPIM);
+
+// extra_parameters data structure containing other interesting stuff
+// besides the stack, variables and HWND passed on to plug-ins.
+typedef struct
+{
+  int autoclose;          // SetAutoClose
+  int all_user_var;       // SetShellVarContext: User context = 0, Machine context = 1
+  int exec_error;         // IfErrors
+  int abort;              // IfAbort
+  int exec_reboot;        // IfRebootFlag (NSIS_SUPPORT_REBOOT)
+  int reboot_called;      // NSIS_SUPPORT_REBOOT
+  int XXX_cur_insttype;   // Deprecated
+  int plugin_api_version; // Plug-in ABI. See NSISPIAPIVER_CURR (Note: used to be XXX_insttype_changed)
+  int silent;             // IfSilent (NSIS_CONFIG_SILENT_SUPPORT)
+  int instdir_error;      // GetInstDirError
+  int rtl;                // 1 if $LANGUAGE is a RTL language
+  int errlvl;             // SetErrorLevel
+  int alter_reg_view;     // SetRegView: Default View = 0, Alternative View = (sizeof(void*) > 4 ? KEY_WOW64_32KEY : KEY_WOW64_64KEY)
+  int status_update;      // SetDetailsPrint
+} exec_flags_t;
+
+#ifndef NSISCALL
+#  define NSISCALL __stdcall
+#endif
+#if !defined(_WIN32) && !defined(LPTSTR)
+#  define LPTSTR TCHAR*
+#endif
+
+typedef struct {
+  exec_flags_t *exec_flags;
+  int (NSISCALL *ExecuteCodeSegment)(int, HWND);
+  void (NSISCALL *validate_filename)(LPTSTR);
+  int (NSISCALL *RegisterPluginCallback)(HMODULE, NSISPLUGINCALLBACK); // returns 0 on success, 1 if already registered and < 0 on errors
+} extra_parameters;
+
+// Definitions for page showing plug-ins
+// See Ui.c to understand better how they're used
+
+// sent to the outer window to tell it to go to the next inner window
+#define WM_NOTIFY_OUTER_NEXT (WM_USER+0x8)
+
+// custom pages should send this message to let NSIS know they're ready
+#define WM_NOTIFY_CUSTOM_READY (WM_USER+0xd)
+
+// sent as wParam with WM_NOTIFY_OUTER_NEXT when user cancels - heed its warning
+#define NOTIFY_BYE_BYE 'x'
+
+#endif /* _NSIS_EXEHEAD_API_H_ */
--- a/installer/signplugin/nsis/nsis_tchar.h
+++ b/installer/signplugin/nsis/nsis_tchar.h
@ -0,0 +1,229 @@
+/*
+ * nsis_tchar.h
+ * 
+ * This file is a part of NSIS.
+ * 
+ * Copyright (C) 1999-2018 Nullsoft and Contributors
+ * 
+ * This software is provided 'as-is', without any express or implied
+ * warranty.
+ *
+ * For Unicode support by Jim Park -- 08/30/2007
+ */
+
+// Jim Park: Only those we use are listed here.
+
+#pragma once
+
+#ifdef _UNICODE
+
+#ifndef _T
+#define __T(x)   L ## x
+#define _T(x)    __T(x)
+#define _TEXT(x) __T(x)
+#endif
+
+#ifndef _TCHAR_DEFINED
+#define _TCHAR_DEFINED
+#if !defined(_NATIVE_WCHAR_T_DEFINED) && !defined(_WCHAR_T_DEFINED)
+typedef unsigned short TCHAR;
+#else
+typedef wchar_t TCHAR;
+#endif
+#endif
+
+
+// program
+#define _tenviron   _wenviron
+#define __targv     __wargv
+
+// printfs
+#define _ftprintf   fwprintf
+#define _sntprintf  _snwprintf
+#if (defined(_MSC_VER) && (_MSC_VER<=1310||_MSC_FULL_VER<=140040310)) || defined(__MINGW32__)
+#	define _stprintf   swprintf
+#else
+#	define _stprintf   _swprintf
+#endif
+#define _tprintf    wprintf
+#define _vftprintf  vfwprintf
+#define _vsntprintf _vsnwprintf
+#if defined(_MSC_VER) && (_MSC_VER<=1310)
+#	define _vstprintf  vswprintf
+#else
+#	define _vstprintf  _vswprintf
+#endif
+
+// scanfs
+#define _tscanf     wscanf
+#define _stscanf    swscanf
+
+// string manipulations
+#define _tcscat     wcscat
+#define _tcschr     wcschr
+#define _tcsclen    wcslen
+#define _tcscpy     wcscpy
+#define _tcsdup     _wcsdup
+#define _tcslen     wcslen
+#define _tcsnccpy   wcsncpy
+#define _tcsncpy    wcsncpy
+#define _tcsrchr    wcsrchr
+#define _tcsstr     wcsstr
+#define _tcstok     wcstok
+
+// string comparisons
+#define _tcscmp     wcscmp
+#define _tcsicmp    _wcsicmp
+#define _tcsncicmp  _wcsnicmp
+#define _tcsncmp    wcsncmp
+#define _tcsnicmp   _wcsnicmp
+
+// upper / lower
+#define _tcslwr     _wcslwr
+#define _tcsupr     _wcsupr
+#define _totlower   towlower
+#define _totupper   towupper
+
+// conversions to numbers
+#define _tcstoi64   _wcstoi64
+#define _tcstol     wcstol
+#define _tcstoul    wcstoul
+#define _tstof      _wtof
+#define _tstoi      _wtoi
+#define _tstoi64    _wtoi64
+#define _ttoi       _wtoi
+#define _ttoi64     _wtoi64
+#define _ttol       _wtol
+
+// conversion from numbers to strings
+#define _itot       _itow
+#define _ltot       _ltow
+#define _i64tot     _i64tow
+#define _ui64tot    _ui64tow
+
+// file manipulations
+#define _tfopen     _wfopen
+#define _topen      _wopen
+#define _tremove    _wremove
+#define _tunlink    _wunlink
+
+// reading and writing to i/o
+#define _fgettc     fgetwc
+#define _fgetts     fgetws
+#define _fputts     fputws
+#define _gettchar   getwchar
+
+// directory
+#define _tchdir     _wchdir
+
+// environment
+#define _tgetenv    _wgetenv
+#define _tsystem    _wsystem
+
+// time
+#define _tcsftime   wcsftime
+
+#else // ANSI
+
+#ifndef _T
+#define _T(x)    x
+#define _TEXT(x) x
+#endif
+
+#ifndef _TCHAR_DEFINED
+#define _TCHAR_DEFINED
+typedef char TCHAR;
+#endif
+
+// program
+#define _tenviron   environ
+#define __targv     __argv
+
+// printfs
+#define _ftprintf   fprintf
+#define _sntprintf  _snprintf
+#define _stprintf   sprintf
+#define _tprintf    printf
+#define _vftprintf  vfprintf
+#define _vsntprintf _vsnprintf
+#define _vstprintf  vsprintf
+
+// scanfs
+#define _tscanf     scanf
+#define _stscanf    sscanf
+
+// string manipulations
+#define _tcscat     strcat
+#define _tcschr     strchr
+#define _tcsclen    strlen
+#define _tcscnlen   strnlen
+#define _tcscpy     strcpy
+#define _tcsdup     _strdup
+#define _tcslen     strlen
+#define _tcsnccpy   strncpy
+#define _tcsrchr    strrchr
+#define _tcsstr     strstr
+#define _tcstok     strtok
+
+// string comparisons
+#define _tcscmp     strcmp
+#define _tcsicmp    _stricmp
+#define _tcsncmp    strncmp
+#define _tcsncicmp  _strnicmp
+#define _tcsnicmp   _strnicmp
+
+// upper / lower
+#define _tcslwr     _strlwr
+#define _tcsupr     _strupr
+
+#define _totupper   toupper
+#define _totlower   tolower
+
+// conversions to numbers
+#define _tcstol     strtol
+#define _tcstoul    strtoul
+#define _tstof      atof
+#define _tstoi      atoi
+#define _tstoi64    _atoi64
+#define _tstoi64    _atoi64
+#define _ttoi       atoi
+#define _ttoi64     _atoi64
+#define _ttol       atol
+
+// conversion from numbers to strings
+#define _i64tot     _i64toa
+#define _itot       _itoa
+#define _ltot       _ltoa
+#define _ui64tot    _ui64toa
+
+// file manipulations
+#define _tfopen     fopen
+#define _topen      _open
+#define _tremove    remove
+#define _tunlink    _unlink
+
+// reading and writing to i/o
+#define _fgettc     fgetc
+#define _fgetts     fgets
+#define _fputts     fputs
+#define _gettchar   getchar
+
+// directory
+#define _tchdir     _chdir
+
+// environment
+#define _tgetenv    getenv
+#define _tsystem    system
+
+// time
+#define _tcsftime   strftime
+
+#endif
+
+// is functions (the same in Unicode / ANSI)
+#define _istgraph   isgraph
+#define _istascii   __isascii
+
+#define __TFILE__ _T(__FILE__)
+#define __TDATE__ _T(__DATE__)
+#define __TTIME__ _T(__TIME__)
--- a/installer/signplugin/nsis/pluginapi-x86-ansi.lib
+++ b/installer/signplugin/nsis/pluginapi-x86-ansi.lib
--- a/installer/signplugin/nsis/pluginapi-x86-unicode.lib
+++ b/installer/signplugin/nsis/pluginapi-x86-unicode.lib
--- a/installer/signplugin/nsis/pluginapi.h
+++ b/installer/signplugin/nsis/pluginapi.h
@ -0,0 +1,108 @@
+#ifndef ___NSIS_PLUGIN__H___
+#define ___NSIS_PLUGIN__H___
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "api.h"
+#include "nsis_tchar.h" // BUGBUG: Why cannot our plugins use the compilers tchar.h?
+
+#ifndef NSISCALL
+#  define NSISCALL WINAPI
+#endif
+
+#define EXDLL_INIT()           {  \
+        g_stringsize=string_size; \
+        g_stacktop=stacktop;      \
+        g_variables=variables; }
+
+typedef struct _stack_t {
+  struct _stack_t *next;
+#ifdef UNICODE
+  WCHAR text[1]; // this should be the length of g_stringsize when allocating
+#else
+  char text[1];
+#endif
+} stack_t;
+
+enum
+{
+INST_0,         // $0
+INST_1,         // $1
+INST_2,         // $2
+INST_3,         // $3
+INST_4,         // $4
+INST_5,         // $5
+INST_6,         // $6
+INST_7,         // $7
+INST_8,         // $8
+INST_9,         // $9
+INST_R0,        // $R0
+INST_R1,        // $R1
+INST_R2,        // $R2
+INST_R3,        // $R3
+INST_R4,        // $R4
+INST_R5,        // $R5
+INST_R6,        // $R6
+INST_R7,        // $R7
+INST_R8,        // $R8
+INST_R9,        // $R9
+INST_CMDLINE,   // $CMDLINE
+INST_INSTDIR,   // $INSTDIR
+INST_OUTDIR,    // $OUTDIR
+INST_EXEDIR,    // $EXEDIR
+INST_LANG,      // $LANGUAGE
+__INST_LAST
+};
+
+extern unsigned int g_stringsize;
+extern stack_t **g_stacktop;
+extern LPTSTR g_variables;
+
+void NSISCALL pushstring(LPCTSTR str);
+void NSISCALL pushintptr(INT_PTR value);
+#define pushint(v) pushintptr((INT_PTR)(v))
+int NSISCALL popstring(LPTSTR str); // 0 on success, 1 on empty stack
+int NSISCALL popstringn(LPTSTR str, int maxlen); // with length limit, pass 0 for g_stringsize
+INT_PTR NSISCALL popintptr();
+#define popint() ( (int) popintptr() )
+int NSISCALL popint_or(); // with support for or'ing (2|4|8)
+INT_PTR NSISCALL nsishelper_str_to_ptr(LPCTSTR s);
+#define myatoi(s) ( (int) nsishelper_str_to_ptr(s) ) // converts a string to an integer
+unsigned int NSISCALL myatou(LPCTSTR s); // converts a string to an unsigned integer, decimal only
+int NSISCALL myatoi_or(LPCTSTR s); // with support for or'ing (2|4|8)
+LPTSTR NSISCALL getuservariable(const int varnum);
+void NSISCALL setuservariable(const int varnum, LPCTSTR var);
+
+#ifdef UNICODE
+#define PopStringW(x) popstring(x)
+#define PushStringW(x) pushstring(x)
+#define SetUserVariableW(x,y) setuservariable(x,y)
+
+int  NSISCALL PopStringA(LPSTR ansiStr);
+void NSISCALL PushStringA(LPCSTR ansiStr);
+void NSISCALL GetUserVariableW(const int varnum, LPWSTR wideStr);
+void NSISCALL GetUserVariableA(const int varnum, LPSTR ansiStr);
+void NSISCALL SetUserVariableA(const int varnum, LPCSTR ansiStr);
+
+#else
+// ANSI defs
+
+#define PopStringA(x) popstring(x)
+#define PushStringA(x) pushstring(x)
+#define SetUserVariableA(x,y) setuservariable(x,y)
+
+int  NSISCALL PopStringW(LPWSTR wideStr);
+void NSISCALL PushStringW(LPWSTR wideStr);
+void NSISCALL GetUserVariableW(const int varnum, LPWSTR wideStr);
+void NSISCALL GetUserVariableA(const int varnum, LPSTR ansiStr);
+void NSISCALL SetUserVariableW(const int varnum, LPCWSTR wideStr);
+
+#endif
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif//!___NSIS_PLUGIN__H___
--- a/installer/signplugin/signplugin.sln
+++ b/installer/signplugin/signplugin.sln
@ -0,0 +1,28 @@
+
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio 15
+VisualStudioVersion = 15.0.26403.7
+MinimumVisualStudioVersion = 10.0.40219.1
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "signplugin", "signplugin.vcxproj", "{C6E4A1D7-ECBC-466E-9183-30727EF81533}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|x64 = Debug|x64
+		Debug|x86 = Debug|x86
+		Release|x64 = Release|x64
+		Release|x86 = Release|x86
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x64.ActiveCfg = Debug|x64
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x64.Build.0 = Debug|x64
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x86.ActiveCfg = Debug|Win32
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Debug|x86.Build.0 = Debug|Win32
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x64.ActiveCfg = Release|x64
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x64.Build.0 = Release|x64
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x86.ActiveCfg = Release|Win32
+		{C6E4A1D7-ECBC-466E-9183-30727EF81533}.Release|x86.Build.0 = Release|Win32
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+EndGlobal
--- a/installer/signplugin/signplugin.vcxproj
+++ b/installer/signplugin/signplugin.vcxproj
@ -0,0 +1,166 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <VCProjectVersion>15.0</VCProjectVersion>
+    <ProjectGuid>{C6E4A1D7-ECBC-466E-9183-30727EF81533}</ProjectGuid>
+    <Keyword>Win32Proj</Keyword>
+    <WindowsTargetPlatformVersion>10.0.15063.0</WindowsTargetPlatformVersion>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <ConfigurationType>DynamicLibrary</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <ConfigurationType>DynamicLibrary</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <WholeProgramOptimization>false</WholeProgramOptimization>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+  </ImportGroup>
+  <ImportGroup Label="Shared">
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <LinkIncremental>true</LinkIncremental>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <LinkIncremental>true</LinkIncremental>
+    <GenerateManifest>false</GenerateManifest>
+  </PropertyGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <PreprocessorDefinitions>WIN32;_DEBUG;_WINDOWS;_USRDLL;SIGNPLUGIN_EXPORTS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
+      <RuntimeLibrary>MultiThreadedDebugDLL</RuntimeLibrary>
+      <WarningLevel>Level3</WarningLevel>
+      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>
+      <Optimization>Disabled</Optimization>
+    </ClCompile>
+    <Link>
+      <TargetMachine>MachineX86</TargetMachine>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+      <SubSystem>Windows</SubSystem>
+      <EntryPointSymbol>
+      </EntryPointSymbol>
+      <IgnoreAllDefaultLibraries>false</IgnoreAllDefaultLibraries>
+      <ImageHasSafeExceptionHandlers>false</ImageHasSafeExceptionHandlers>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <PreprocessorDefinitions>WIN32;NDEBUG;_WINDOWS;_USRDLL;SIGNPLUGIN_EXPORTS;%(PreprocessorDefinitions)</PreprocessorDefinitions>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <WarningLevel>Level3</WarningLevel>
+      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>
+      <ExceptionHandling>false</ExceptionHandling>
+      <BufferSecurityCheck>false</BufferSecurityCheck>
+      <Optimization>MinSpace</Optimization>
+      <OmitFramePointers>true</OmitFramePointers>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+    </ClCompile>
+    <Link>
+      <TargetMachine>MachineX86</TargetMachine>
+      <GenerateDebugInformation>false</GenerateDebugInformation>
+      <SubSystem>Windows</SubSystem>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <IgnoreAllDefaultLibraries>true</IgnoreAllDefaultLibraries>
+      <EntryPointSymbol>DllMain</EntryPointSymbol>
+      <ImageHasSafeExceptionHandlers>false</ImageHasSafeExceptionHandlers>
+      <LinkTimeCodeGeneration>UseLinkTimeCodeGeneration</LinkTimeCodeGeneration>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <ClCompile Include="main.cpp" />
+    <ClCompile Include="tiny\c25519.c" />
+    <ClCompile Include="tiny\ed25519.c" />
+    <ClCompile Include="tiny\edsign.c" />
+    <ClCompile Include="tiny\f25519.c" />
+    <ClCompile Include="tiny\fprime.c" />
+    <ClCompile Include="tiny\morph25519.c" />
+    <ClCompile Include="tiny\sha512.c" />
+    <ClCompile Include="win32_crt_math.cpp" />
+    <ClCompile Include="win32_crt_memory.cpp" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="curve25519-donna-32bit.h" />
+    <ClInclude Include="curve25519-donna-64bit.h" />
+    <ClInclude Include="curve25519-donna-helpers.h" />
+    <ClInclude Include="curve25519-donna-sse2.h" />
+    <ClInclude Include="ed25519-donna-32bit-tables.h" />
+    <ClInclude Include="ed25519-donna-64bit-tables.h" />
+    <ClInclude Include="ed25519-donna-batchverify.h" />
+    <ClInclude Include="ed25519-donna-impl-base.h" />
+    <ClInclude Include="ed25519-donna-impl-sse2.h" />
+    <ClInclude Include="ed25519-donna-portable-identify.h" />
+    <ClInclude Include="ed25519-donna-portable.h" />
+    <ClInclude Include="ed25519-donna.h" />
+    <ClInclude Include="ed25519-hash-custom.h" />
+    <ClInclude Include="ed25519-hash.h" />
+    <ClInclude Include="ed25519-randombytes.h" />
+    <ClInclude Include="ed25519.h" />
+    <ClInclude Include="modm-donna-32bit.h" />
+    <ClInclude Include="modm-donna-64bit.h" />
+    <ClInclude Include="tiny\c25519.h" />
+    <ClInclude Include="tiny\ed25519.h" />
+    <ClInclude Include="tiny\edsign.h" />
+    <ClInclude Include="tiny\f25519.h" />
+    <ClInclude Include="tiny\fprime.h" />
+    <ClInclude Include="tiny\morph25519.h" />
+    <ClInclude Include="tiny\sha512.h" />
+  </ItemGroup>
+  <ItemGroup>
+    <Object Include="chkstk.obj">
+      <ExcludedFromBuild Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">true</ExcludedFromBuild>
+    </Object>
+  </ItemGroup>
+  <ItemGroup>
+    <Library Include="nsis\pluginapi-x86-unicode.lib" />
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+  </ImportGroup>
+</Project>
--- a/installer/signplugin/signplugin.vcxproj.filters
+++ b/installer/signplugin/signplugin.vcxproj.filters
@ -0,0 +1,132 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="Header Files">
+      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>
+      <Extensions>h;hh;hpp;hxx;hm;inl;inc;xsd</Extensions>
+    </Filter>
+    <Filter Include="Resource Files">
+      <UniqueIdentifier>{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}</UniqueIdentifier>
+      <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav</Extensions>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="main.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="win32_crt_math.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="win32_crt_memory.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\c25519.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\ed25519.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\edsign.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\f25519.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\fprime.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\morph25519.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="tiny\sha512.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="curve25519-donna-32bit.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="curve25519-donna-64bit.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="curve25519-donna-helpers.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="curve25519-donna-sse2.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-32bit-tables.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-64bit-tables.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-batchverify.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-impl-base.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-impl-sse2.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-portable-identify.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna-portable.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-donna.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-hash-custom.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-hash.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519-randombytes.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="ed25519.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="modm-donna-32bit.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="modm-donna-64bit.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\c25519.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\ed25519.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\edsign.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\f25519.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\fprime.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\morph25519.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+    <ClInclude Include="tiny\sha512.h">
+      <Filter>Source Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+  <ItemGroup>
+    <Object Include="chkstk.obj" />
+  </ItemGroup>
+  <ItemGroup>
+    <Library Include="nsis\pluginapi-x86-unicode.lib" />
+  </ItemGroup>
+</Project>
--- a/installer/signplugin/signplugin.vcxproj.user
+++ b/installer/signplugin/signplugin.vcxproj.user
@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <PropertyGroup />
+</Project>
--- a/installer/signplugin/tiny/c25519.c
+++ b/installer/signplugin/tiny/c25519.c
@ -0,0 +1,124 @@
+/* Curve25519 (Montgomery form)
+ * Daniel Beer <dlbeer@gmail.com>, 18 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "c25519.h"
+
+const uint8_t c25519_base_x[F25519_SIZE] = {9};
+
+/* Double an X-coordinate */
+static void xc_double(uint8_t *x3, uint8_t *z3,
+		      const uint8_t *x1, const uint8_t *z1)
+{
+	/* Explicit formulas database: dbl-1987-m
+	 *
+	 * source 1987 Montgomery "Speeding the Pollard and elliptic
+	 *   curve methods of factorization", page 261, fourth display
+	 * compute X3 = (X1^2-Z1^2)^2
+	 * compute Z3 = 4 X1 Z1 (X1^2 + a X1 Z1 + Z1^2)
+	 */
+	uint8_t x1sq[F25519_SIZE];
+	uint8_t z1sq[F25519_SIZE];
+	uint8_t x1z1[F25519_SIZE];
+	uint8_t a[F25519_SIZE];
+
+	f25519_mul__distinct(x1sq, x1, x1);
+	f25519_mul__distinct(z1sq, z1, z1);
+	f25519_mul__distinct(x1z1, x1, z1);
+
+	f25519_sub(a, x1sq, z1sq);
+	f25519_mul__distinct(x3, a, a);
+
+	f25519_mul_c(a, x1z1, 486662);
+	f25519_add(a, x1sq, a);
+	f25519_add(a, z1sq, a);
+	f25519_mul__distinct(x1sq, x1z1, a);
+	f25519_mul_c(z3, x1sq, 4);
+}
+
+/* Differential addition */
+static void xc_diffadd(uint8_t *x5, uint8_t *z5,
+		       const uint8_t *x1, const uint8_t *z1,
+		       const uint8_t *x2, const uint8_t *z2,
+		       const uint8_t *x3, const uint8_t *z3)
+{
+	/* Explicit formulas database: dbl-1987-m3
+	 *
+	 * source 1987 Montgomery "Speeding the Pollard and elliptic curve
+	 *   methods of factorization", page 261, fifth display, plus
+	 *   common-subexpression elimination
+	 * compute A = X2+Z2
+	 * compute B = X2-Z2
+	 * compute C = X3+Z3
+	 * compute D = X3-Z3
+	 * compute DA = D A
+	 * compute CB = C B
+	 * compute X5 = Z1(DA+CB)^2
+	 * compute Z5 = X1(DA-CB)^2
+	 */
+	uint8_t da[F25519_SIZE];
+	uint8_t cb[F25519_SIZE];
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+
+	f25519_add(a, x2, z2);
+	f25519_sub(b, x3, z3); /* D */
+	f25519_mul__distinct(da, a, b);
+
+	f25519_sub(b, x2, z2);
+	f25519_add(a, x3, z3); /* C */
+	f25519_mul__distinct(cb, a, b);
+
+	f25519_add(a, da, cb);
+	f25519_mul__distinct(b, a, a);
+	f25519_mul__distinct(x5, z1, b);
+
+	f25519_sub(a, da, cb);
+	f25519_mul__distinct(b, a, a);
+	f25519_mul__distinct(z5, x1, b);
+}
+
+void c25519_smult(uint8_t *result, const uint8_t *q, const uint8_t *e)
+{
+	/* Current point: P_m */
+	uint8_t xm[F25519_SIZE];
+	uint8_t zm[F25519_SIZE] = {1};
+
+	/* Predecessor: P_(m-1) */
+	uint8_t xm1[F25519_SIZE] = {1};
+	uint8_t zm1[F25519_SIZE] = {0};
+
+	int i;
+
+	/* Note: bit 254 is assumed to be 1 */
+	f25519_copy(xm, q);
+
+	for (i = 253; i >= 0; i--) {
+		const int bit = (e[i >> 3] >> (i & 7)) & 1;
+		uint8_t xms[F25519_SIZE];
+		uint8_t zms[F25519_SIZE];
+
+		/* From P_m and P_(m-1), compute P_(2m) and P_(2m-1) */
+		xc_diffadd(xm1, zm1, q, f25519_one, xm, zm, xm1, zm1);
+		xc_double(xm, zm, xm, zm);
+
+		/* Compute P_(2m+1) */
+		xc_diffadd(xms, zms, xm1, zm1, xm, zm, q, f25519_one);
+
+		/* Select:
+		 *   bit = 1 --> (P_(2m+1), P_(2m))
+		 *   bit = 0 --> (P_(2m), P_(2m-1))
+		 */
+		f25519_select(xm1, xm1, xm, bit);
+		f25519_select(zm1, zm1, zm, bit);
+		f25519_select(xm, xm, xms, bit);
+		f25519_select(zm, zm, zms, bit);
+	}
+
+	/* Freeze out of projective coordinates */
+	f25519_inv__distinct(zm1, zm);
+	f25519_mul__distinct(result, zm1, xm);
+	f25519_normalize(result);
+}
--- a/installer/signplugin/tiny/c25519.h
+++ b/installer/signplugin/tiny/c25519.h
@ -0,0 +1,48 @@
+/* Curve25519 (Montgomery form)
+ * Daniel Beer <dlbeer@gmail.com>, 18 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef C25519_H_
+#define C25519_H_
+
+#include <stdint.h>
+#include "f25519.h"
+
+/* Curve25519 has the equation over F(p = 2^255-19):
+ *
+ *    y^2 = x^3 + 486662x^2 + x
+ *
+ * 486662 = 4A+2, where A = 121665. This is a Montgomery curve.
+ *
+ * For more information, see:
+ *
+ *    Bernstein, D.J. (2006) "Curve25519: New Diffie-Hellman speed
+ *    records". Document ID: 4230efdfa673480fc079449d90f322c0.
+ */
+
+/* This is the site of a Curve25519 exponent (private key) */
+#define C25519_EXPONENT_SIZE  32
+
+/* Having generated 32 random bytes, you should call this function to
+ * finalize the generated key.
+ */
+static inline void c25519_prepare(uint8_t *key)
+{
+	key[0] &= 0xf8;
+	key[31] &= 0x7f;
+	key[31] |= 0x40;
+}
+
+/* X-coordinate of the base point */
+extern const uint8_t c25519_base_x[F25519_SIZE];
+
+/* X-coordinate scalar multiply: given the X-coordinate of q, return the
+ * X-coordinate of e*q.
+ *
+ * result and q are field elements. e is an exponent.
+ */
+void c25519_smult(uint8_t *result, const uint8_t *q, const uint8_t *e);
+
+#endif
--- a/installer/signplugin/tiny/ed25519.c
+++ b/installer/signplugin/tiny/ed25519.c
@ -0,0 +1,320 @@
+/* Edwards curve operations
+ * Daniel Beer <dlbeer@gmail.com>, 9 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "ed25519.h"
+
+/* Base point is (numbers wrapped):
+ *
+ *     x = 151122213495354007725011514095885315114
+ *         54012693041857206046113283949847762202
+ *     y = 463168356949264781694283940034751631413
+ *         07993866256225615783033603165251855960
+ *
+ * y is derived by transforming the original Montgomery base (u=9). x
+ * is the corresponding positive coordinate for the new curve equation.
+ * t is x*y.
+ */
+const struct ed25519_pt ed25519_base = {
+	.x = {
+		0x1a, 0xd5, 0x25, 0x8f, 0x60, 0x2d, 0x56, 0xc9,
+		0xb2, 0xa7, 0x25, 0x95, 0x60, 0xc7, 0x2c, 0x69,
+		0x5c, 0xdc, 0xd6, 0xfd, 0x31, 0xe2, 0xa4, 0xc0,
+		0xfe, 0x53, 0x6e, 0xcd, 0xd3, 0x36, 0x69, 0x21
+	},
+	.y = {
+		0x58, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66,
+		0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66,
+		0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66,
+		0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66
+	},
+	.t = {
+		0xa3, 0xdd, 0xb7, 0xa5, 0xb3, 0x8a, 0xde, 0x6d,
+		0xf5, 0x52, 0x51, 0x77, 0x80, 0x9f, 0xf0, 0x20,
+		0x7d, 0xe3, 0xab, 0x64, 0x8e, 0x4e, 0xea, 0x66,
+		0x65, 0x76, 0x8b, 0xd7, 0x0f, 0x5f, 0x87, 0x67
+	},
+	.z = {1, 0}
+};
+
+const struct ed25519_pt ed25519_neutral = {
+	.x = {0},
+	.y = {1, 0},
+	.t = {0},
+	.z = {1, 0}
+};
+
+/* Conversion to and from projective coordinates */
+void ed25519_project(struct ed25519_pt *p,
+		     const uint8_t *x, const uint8_t *y)
+{
+	f25519_copy(p->x, x);
+	f25519_copy(p->y, y);
+	f25519_load(p->z, 1);
+	f25519_mul__distinct(p->t, x, y);
+}
+
+void ed25519_unproject(uint8_t *x, uint8_t *y,
+		       const struct ed25519_pt *p)
+{
+	uint8_t z1[F25519_SIZE];
+
+	f25519_inv__distinct(z1, p->z);
+	f25519_mul__distinct(x, p->x, z1);
+	f25519_mul__distinct(y, p->y, z1);
+
+	f25519_normalize(x);
+	f25519_normalize(y);
+}
+
+/* Compress/uncompress points. We compress points by storing the x
+ * coordinate and the parity of the y coordinate.
+ *
+ * Rearranging the curve equation, we obtain explicit formulae for the
+ * coordinates:
+ *
+ *     x = sqrt((y^2-1) / (1+dy^2))
+ *     y = sqrt((x^2+1) / (1-dx^2))
+ *
+ * Where d = (-121665/121666), or:
+ *
+ *     d = 370957059346694393431380835087545651895
+ *         42113879843219016388785533085940283555
+ */
+
+static const uint8_t ed25519_d[F25519_SIZE] = {
+	0xa3, 0x78, 0x59, 0x13, 0xca, 0x4d, 0xeb, 0x75,
+	0xab, 0xd8, 0x41, 0x41, 0x4d, 0x0a, 0x70, 0x00,
+	0x98, 0xe8, 0x79, 0x77, 0x79, 0x40, 0xc7, 0x8c,
+	0x73, 0xfe, 0x6f, 0x2b, 0xee, 0x6c, 0x03, 0x52
+};
+
+void ed25519_pack(uint8_t *c, const uint8_t *x, const uint8_t *y)
+{
+	uint8_t tmp[F25519_SIZE];
+	uint8_t parity;
+
+	f25519_copy(tmp, x);
+	f25519_normalize(tmp);
+	parity = (tmp[0] & 1) << 7;
+
+	f25519_copy(c, y);
+	f25519_normalize(c);
+	c[31] |= parity;
+}
+
+uint8_t ed25519_try_unpack(uint8_t *x, uint8_t *y, const uint8_t *comp)
+{
+	const int parity = comp[31] >> 7;
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+	uint8_t c[F25519_SIZE];
+
+	/* Unpack y */
+	f25519_copy(y, comp);
+	y[31] &= 127;
+
+	/* Compute c = y^2 */
+	f25519_mul__distinct(c, y, y);
+
+	/* Compute b = (1+dy^2)^-1 */
+	f25519_mul__distinct(b, c, ed25519_d);
+	f25519_add(a, b, f25519_one);
+	f25519_inv__distinct(b, a);
+
+	/* Compute a = y^2-1 */
+	f25519_sub(a, c, f25519_one);
+
+	/* Compute c = a*b = (y^2-1)/(1-dy^2) */
+	f25519_mul__distinct(c, a, b);
+
+	/* Compute a, b = +/-sqrt(c), if c is square */
+	f25519_sqrt(a, c);
+	f25519_neg(b, a);
+
+	/* Select one of them, based on the compressed parity bit */
+	f25519_select(x, a, b, (a[0] ^ parity) & 1);
+
+	/* Verify that x^2 = c */
+	f25519_mul__distinct(a, x, x);
+	f25519_normalize(a);
+	f25519_normalize(c);
+
+	return f25519_eq(a, c);
+}
+
+/* k = 2d */
+static const uint8_t ed25519_k[F25519_SIZE] = {
+	0x59, 0xf1, 0xb2, 0x26, 0x94, 0x9b, 0xd6, 0xeb,
+	0x56, 0xb1, 0x83, 0x82, 0x9a, 0x14, 0xe0, 0x00,
+	0x30, 0xd1, 0xf3, 0xee, 0xf2, 0x80, 0x8e, 0x19,
+	0xe7, 0xfc, 0xdf, 0x56, 0xdc, 0xd9, 0x06, 0x24
+};
+
+void ed25519_add(struct ed25519_pt *r,
+		 const struct ed25519_pt *p1, const struct ed25519_pt *p2)
+{
+	/* Explicit formulas database: add-2008-hwcd-3
+	 *
+	 * source 2008 Hisil--Wong--Carter--Dawson,
+	 *     http://eprint.iacr.org/2008/522, Section 3.1
+	 * appliesto extended-1
+	 * parameter k
+	 * assume k = 2 d
+	 * compute A = (Y1-X1)(Y2-X2)
+	 * compute B = (Y1+X1)(Y2+X2)
+	 * compute C = T1 k T2
+	 * compute D = Z1 2 Z2
+	 * compute E = B - A
+	 * compute F = D - C
+	 * compute G = D + C
+	 * compute H = B + A
+	 * compute X3 = E F
+	 * compute Y3 = G H
+	 * compute T3 = E H
+	 * compute Z3 = F G
+	 */
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+	uint8_t c[F25519_SIZE];
+	uint8_t d[F25519_SIZE];
+	uint8_t e[F25519_SIZE];
+	uint8_t f[F25519_SIZE];
+	uint8_t g[F25519_SIZE];
+	uint8_t h[F25519_SIZE];
+
+	/* A = (Y1-X1)(Y2-X2) */
+	f25519_sub(c, p1->y, p1->x);
+	f25519_sub(d, p2->y, p2->x);
+	f25519_mul__distinct(a, c, d);
+
+	/* B = (Y1+X1)(Y2+X2) */
+	f25519_add(c, p1->y, p1->x);
+	f25519_add(d, p2->y, p2->x);
+	f25519_mul__distinct(b, c, d);
+
+	/* C = T1 k T2 */
+	f25519_mul__distinct(d, p1->t, p2->t);
+	f25519_mul__distinct(c, d, ed25519_k);
+
+	/* D = Z1 2 Z2 */
+	f25519_mul__distinct(d, p1->z, p2->z);
+	f25519_add(d, d, d);
+
+	/* E = B - A */
+	f25519_sub(e, b, a);
+
+	/* F = D - C */
+	f25519_sub(f, d, c);
+
+	/* G = D + C */
+	f25519_add(g, d, c);
+
+	/* H = B + A */
+	f25519_add(h, b, a);
+
+	/* X3 = E F */
+	f25519_mul__distinct(r->x, e, f);
+
+	/* Y3 = G H */
+	f25519_mul__distinct(r->y, g, h);
+
+	/* T3 = E H */
+	f25519_mul__distinct(r->t, e, h);
+
+	/* Z3 = F G */
+	f25519_mul__distinct(r->z, f, g);
+}
+
+void ed25519_double(struct ed25519_pt *r, const struct ed25519_pt *p)
+{
+	/* Explicit formulas database: dbl-2008-hwcd
+	 *
+	 * source 2008 Hisil--Wong--Carter--Dawson,
+	 *     http://eprint.iacr.org/2008/522, Section 3.3
+	 * compute A = X1^2
+	 * compute B = Y1^2
+	 * compute C = 2 Z1^2
+	 * compute D = a A
+	 * compute E = (X1+Y1)^2-A-B
+	 * compute G = D + B
+	 * compute F = G - C
+	 * compute H = D - B
+	 * compute X3 = E F
+	 * compute Y3 = G H
+	 * compute T3 = E H
+	 * compute Z3 = F G
+	 */
+	uint8_t a[F25519_SIZE];
+	uint8_t b[F25519_SIZE];
+	uint8_t c[F25519_SIZE];
+	uint8_t e[F25519_SIZE];
+	uint8_t f[F25519_SIZE];
+	uint8_t g[F25519_SIZE];
+	uint8_t h[F25519_SIZE];
+
+	/* A = X1^2 */
+	f25519_mul__distinct(a, p->x, p->x);
+
+	/* B = Y1^2 */
+	f25519_mul__distinct(b, p->y, p->y);
+
+	/* C = 2 Z1^2 */
+	f25519_mul__distinct(c, p->z, p->z);
+	f25519_add(c, c, c);
+
+	/* D = a A (alter sign) */
+	/* E = (X1+Y1)^2-A-B */
+	f25519_add(f, p->x, p->y);
+	f25519_mul__distinct(e, f, f);
+	f25519_sub(e, e, a);
+	f25519_sub(e, e, b);
+
+	/* G = D + B */
+	f25519_sub(g, b, a);
+
+	/* F = G - C */
+	f25519_sub(f, g, c);
+
+	/* H = D - B */
+	f25519_neg(h, b);
+	f25519_sub(h, h, a);
+
+	/* X3 = E F */
+	f25519_mul__distinct(r->x, e, f);
+
+	/* Y3 = G H */
+	f25519_mul__distinct(r->y, g, h);
+
+	/* T3 = E H */
+	f25519_mul__distinct(r->t, e, h);
+
+	/* Z3 = F G */
+	f25519_mul__distinct(r->z, f, g);
+}
+
+void ed25519_smult(struct ed25519_pt *r_out, const struct ed25519_pt *p,
+		   const uint8_t *e)
+{
+	struct ed25519_pt r;
+	int i;
+
+	ed25519_copy(&r, &ed25519_neutral);
+
+	for (i = 255; i >= 0; i--) {
+		const uint8_t bit = (e[i >> 3] >> (i & 7)) & 1;
+		struct ed25519_pt s;
+
+		ed25519_double(&r, &r);
+		ed25519_add(&s, &r, p);
+
+		f25519_select(r.x, r.x, s.x, bit);
+		f25519_select(r.y, r.y, s.y, bit);
+		f25519_select(r.z, r.z, s.z, bit);
+		f25519_select(r.t, r.t, s.t, bit);
+	}
+
+	ed25519_copy(r_out, &r);
+}
--- a/installer/signplugin/tiny/ed25519.h
+++ b/installer/signplugin/tiny/ed25519.h
@ -0,0 +1,82 @@
+/* Edwards curve operations
+ * Daniel Beer <dlbeer@gmail.com>, 9 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef ED25519_H_
+#define ED25519_H_
+
+#include "f25519.h"
+
+/* This is not the Ed25519 signature system. Rather, we're implementing
+ * basic operations on the twisted Edwards curve over (Z mod 2^255-19):
+ *
+ *     -x^2 + y^2 = 1 - (121665/121666)x^2y^2
+ *
+ * With the positive-x base point y = 4/5.
+ *
+ * These functions will not leak secret data through timing.
+ *
+ * For more information, see:
+ *
+ *     Bernstein, D.J. & Lange, T. (2007) "Faster addition and doubling on
+ *     elliptic curves". Document ID: 95616567a6ba20f575c5f25e7cebaf83.
+ *
+ *     Hisil, H. & Wong, K K. & Carter, G. & Dawson, E. (2008) "Twisted
+ *     Edwards curves revisited". Advances in Cryptology, ASIACRYPT 2008,
+ *     Vol. 5350, pp. 326-343.
+ */
+
+/* Projective coordinates */
+struct ed25519_pt {
+	uint8_t  x[F25519_SIZE];
+	uint8_t  y[F25519_SIZE];
+	uint8_t  t[F25519_SIZE];
+	uint8_t  z[F25519_SIZE];
+};
+
+extern const struct ed25519_pt ed25519_base;
+extern const struct ed25519_pt ed25519_neutral;
+
+/* Convert between projective and affine coordinates (x/y in F25519) */
+void ed25519_project(struct ed25519_pt *p,
+		     const uint8_t *x, const uint8_t *y);
+
+void ed25519_unproject(uint8_t *x, uint8_t *y,
+		       const struct ed25519_pt *p);
+
+/* Compress/uncompress points. try_unpack() will check that the
+ * compressed point is on the curve, returning 1 if the unpacked point
+ * is valid, and 0 otherwise.
+ */
+#define ED25519_PACK_SIZE  F25519_SIZE
+
+void ed25519_pack(uint8_t *c, const uint8_t *x, const uint8_t *y);
+uint8_t ed25519_try_unpack(uint8_t *x, uint8_t *y, const uint8_t *c);
+
+/* Add, double and scalar multiply */
+#define ED25519_EXPONENT_SIZE  32
+
+/* Prepare an exponent by clamping appropriate bits */
+static inline void ed25519_prepare(uint8_t *e)
+{
+	e[0] &= 0xf8;
+	e[31] &= 0x7f;
+	e[31] |= 0x40;
+}
+
+/* Order of the group generated by the base point */
+static inline void ed25519_copy(struct ed25519_pt *dst,
+				const struct ed25519_pt *src)
+{
+	memcpy(dst, src, sizeof(*dst));
+}
+
+void ed25519_add(struct ed25519_pt *r,
+		 const struct ed25519_pt *a, const struct ed25519_pt *b);
+void ed25519_double(struct ed25519_pt *r, const struct ed25519_pt *a);
+void ed25519_smult(struct ed25519_pt *r, const struct ed25519_pt *a,
+		   const uint8_t *e);
+
+#endif
--- a/installer/signplugin/tiny/edsign.c
+++ b/installer/signplugin/tiny/edsign.c
@ -0,0 +1,168 @@
+/* Edwards curve signature system
+ * Daniel Beer <dlbeer@gmail.com>, 22 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "ed25519.h"
+#include "sha512.h"
+#include "fprime.h"
+#include "edsign.h"
+
+#define EXPANDED_SIZE  64
+
+static const uint8_t ed25519_order[FPRIME_SIZE] = {
+	0xed, 0xd3, 0xf5, 0x5c, 0x1a, 0x63, 0x12, 0x58,
+	0xd6, 0x9c, 0xf7, 0xa2, 0xde, 0xf9, 0xde, 0x14,
+	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10
+};
+
+static void expand_key(uint8_t *expanded, const uint8_t *secret)
+{
+	struct sha512_state s;
+
+	sha512_init(&s);
+	sha512_final(&s, secret, EDSIGN_SECRET_KEY_SIZE);
+	sha512_get(&s, expanded, 0, EXPANDED_SIZE);
+	ed25519_prepare(expanded);
+}
+
+static uint8_t upp(struct ed25519_pt *p, const uint8_t *packed)
+{
+	uint8_t x[F25519_SIZE];
+	uint8_t y[F25519_SIZE];
+	uint8_t ok = ed25519_try_unpack(x, y, packed);
+
+	ed25519_project(p, x, y);
+	return ok;
+}
+
+static void pp(uint8_t *packed, const struct ed25519_pt *p)
+{
+	uint8_t x[F25519_SIZE];
+	uint8_t y[F25519_SIZE];
+
+	ed25519_unproject(x, y, p);
+	ed25519_pack(packed, x, y);
+}
+
+static void sm_pack(uint8_t *r, const uint8_t *k)
+{
+	struct ed25519_pt p;
+
+	ed25519_smult(&p, &ed25519_base, k);
+	pp(r, &p);
+}
+
+void edsign_sec_to_pub(uint8_t *pub, const uint8_t *secret)
+{
+	uint8_t expanded[EXPANDED_SIZE];
+
+	expand_key(expanded, secret);
+	sm_pack(pub, expanded);
+}
+
+static void hash_with_prefix(uint8_t *out_fp,
+			     uint8_t *init_block, unsigned int prefix_size,
+			     const uint8_t *message, size_t len)
+{
+	struct sha512_state s;
+
+	sha512_init(&s);
+
+	if (len < SHA512_BLOCK_SIZE && len + prefix_size < SHA512_BLOCK_SIZE) {
+		memcpy(init_block + prefix_size, message, len);
+		sha512_final(&s, init_block, len + prefix_size);
+	} else {
+		size_t i;
+
+		memcpy(init_block + prefix_size, message,
+		       SHA512_BLOCK_SIZE - prefix_size);
+		sha512_block(&s, init_block);
+
+		for (i = SHA512_BLOCK_SIZE - prefix_size;
+		     i + SHA512_BLOCK_SIZE <= len;
+		     i += SHA512_BLOCK_SIZE)
+			sha512_block(&s, message + i);
+
+		sha512_final(&s, message + i, len + prefix_size);
+	}
+
+	sha512_get(&s, init_block, 0, SHA512_HASH_SIZE);
+	fprime_from_bytes(out_fp, init_block, SHA512_HASH_SIZE, ed25519_order);
+}
+
+static void generate_k(uint8_t *k, const uint8_t *kgen_key,
+		       const uint8_t *message, size_t len)
+{
+	uint8_t block[SHA512_BLOCK_SIZE];
+
+	memcpy(block, kgen_key, 32);
+	hash_with_prefix(k, block, 32, message, len);
+}
+
+static void hash_message(uint8_t *z, const uint8_t *r, const uint8_t *a,
+			 const uint8_t *m, size_t len)
+{
+	uint8_t block[SHA512_BLOCK_SIZE];
+
+	memcpy(block, r, 32);
+	memcpy(block + 32, a, 32);
+	hash_with_prefix(z, block, 64, m, len);
+}
+
+void edsign_sign(uint8_t *signature, const uint8_t *pub,
+		 const uint8_t *secret,
+		 const uint8_t *message, size_t len)
+{
+	uint8_t expanded[EXPANDED_SIZE];
+	uint8_t e[FPRIME_SIZE];
+	uint8_t s[FPRIME_SIZE];
+	uint8_t k[FPRIME_SIZE];
+	uint8_t z[FPRIME_SIZE];
+
+	expand_key(expanded, secret);
+
+	/* Generate k and R = kB */
+	generate_k(k, expanded + 32, message, len);
+	sm_pack(signature, k);
+
+	/* Compute z = H(R, A, M) */
+	hash_message(z, signature, pub, message, len);
+
+	/* Obtain e */
+	fprime_from_bytes(e, expanded, 32, ed25519_order);
+
+	/* Compute s = ze + k */
+	fprime_mul(s, z, e, ed25519_order);
+	fprime_add(s, k, ed25519_order);
+	memcpy(signature + 32, s, 32);
+}
+
+uint8_t edsign_verify(const uint8_t *signature, const uint8_t *pub,
+		      const uint8_t *message, size_t len)
+{
+	struct ed25519_pt p;
+	struct ed25519_pt q;
+	uint8_t lhs[F25519_SIZE];
+	uint8_t rhs[F25519_SIZE];
+	uint8_t z[FPRIME_SIZE];
+	uint8_t ok = 1;
+
+	/* Compute z = H(R, A, M) */
+	hash_message(z, signature, pub, message, len);
+
+	/* sB = (ze + k)B = ... */
+	sm_pack(lhs, signature + 32);
+
+	/* ... = zA + R */
+	ok &= upp(&p, pub);
+	ed25519_smult(&p, &p, z);
+	ok &= upp(&q, signature);
+	ed25519_add(&p, &p, &q);
+	pp(rhs, &p);
+
+	/* Equal? */
+	return ok & f25519_eq(lhs, rhs);
+}
--- a/installer/signplugin/tiny/edsign.h
+++ b/installer/signplugin/tiny/edsign.h
@ -0,0 +1,51 @@
+/* Edwards curve signature system
+ * Daniel Beer <dlbeer@gmail.com>, 22 Apr 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef EDSIGN_H_
+#define EDSIGN_H_
+
+#include <stdint.h>
+#include <stddef.h>
+
+/* This is the Ed25519 signature system, as described in:
+ *
+ *     Daniel J. Bernstein, Niels Duif, Tanja Lange, Peter Schwabe, Bo-Yin
+ *     Yang. High-speed high-security signatures. Journal of Cryptographic
+ *     Engineering 2 (2012), 77-89. Document ID:
+ *     a1a62a2f76d23f65d622484ddd09caf8. URL:
+ *     http://cr.yp.to/papers.html#ed25519. Date: 2011.09.26.
+ *
+ * The format and calculation of signatures is compatible with the
+ * Ed25519 implementation in SUPERCOP. Note, however, that our secret
+ * keys are half the size: we don't store a copy of the public key in
+ * the secret key (we generate it on demand).
+ */
+
+/* Any string of 32 random bytes is a valid secret key. There is no
+ * clamping of bits, because we don't use the key directly as an
+ * exponent (the exponent is derived from part of a key expansion).
+ */
+#define EDSIGN_SECRET_KEY_SIZE  32
+
+/* Given a secret key, produce the public key (a packed Edwards-curve
+ * point).
+ */
+#define EDSIGN_PUBLIC_KEY_SIZE  32
+
+void edsign_sec_to_pub(uint8_t *pub, const uint8_t *secret);
+
+/* Produce a signature for a message. */
+#define EDSIGN_SIGNATURE_SIZE  64
+
+void edsign_sign(uint8_t *signature, const uint8_t *pub,
+		 const uint8_t *secret,
+		 const uint8_t *message, size_t len);
+
+/* Verify a message signature. Returns non-zero if ok. */
+uint8_t edsign_verify(const uint8_t *signature, const uint8_t *pub,
+		      const uint8_t *message, size_t len);
+
+#endif
--- a/installer/signplugin/tiny/f25519.c
+++ b/installer/signplugin/tiny/f25519.c
@ -0,0 +1,324 @@
+/* Arithmetic mod p = 2^255-19
+ * Daniel Beer <dlbeer@gmail.com>, 5 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "f25519.h"
+
+const uint8_t f25519_zero[F25519_SIZE] = {0};
+const uint8_t f25519_one[F25519_SIZE] = {1};
+
+void f25519_load(uint8_t *x, uint32_t c)
+{
+	unsigned int i;
+
+	for (i = 0; i < sizeof(c); i++) {
+		x[i] = c;
+		c >>= 8;
+	}
+
+	for (; i < F25519_SIZE; i++)
+		x[i] = 0;
+}
+
+void f25519_normalize(uint8_t *x)
+{
+	uint8_t minusp[F25519_SIZE];
+	uint16_t c;
+	int i;
+
+	/* Reduce using 2^255 = 19 mod p */
+	c = (x[31] >> 7) * 19;
+	x[31] &= 127;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += x[i];
+		x[i] = c;
+		c >>= 8;
+	}
+
+	/* The number is now less than 2^255 + 18, and therefore less than
+	 * 2p. Try subtracting p, and conditionally load the subtracted
+	 * value if underflow did not occur.
+	 */
+	c = 19;
+
+	for (i = 0; i + 1 < F25519_SIZE; i++) {
+		c += x[i];
+		minusp[i] = c;
+		c >>= 8;
+	}
+
+	c += ((uint16_t)x[i]) - 128;
+	minusp[31] = c;
+
+	/* Load x-p if no underflow */
+	f25519_select(x, minusp, x, (c >> 15) & 1);
+}
+
+uint8_t f25519_eq(const uint8_t *x, const uint8_t *y)
+{
+	uint8_t sum = 0;
+	int i;
+
+	for (i = 0; i < F25519_SIZE; i++)
+		sum |= x[i] ^ y[i];
+
+	sum |= (sum >> 4);
+	sum |= (sum >> 2);
+	sum |= (sum >> 1);
+
+	return (sum ^ 1) & 1;
+}
+
+void f25519_select(uint8_t *dst,
+		   const uint8_t *zero, const uint8_t *one,
+		   uint8_t condition)
+{
+	const uint8_t mask = -condition;
+	int i;
+
+	for (i = 0; i < F25519_SIZE; i++)
+		dst[i] = zero[i] ^ (mask & (one[i] ^ zero[i]));
+}
+
+void f25519_add(uint8_t *r, const uint8_t *a, const uint8_t *b)
+{
+	uint16_t c = 0;
+	int i;
+
+	/* Add */
+	for (i = 0; i < F25519_SIZE; i++) {
+		c >>= 8;
+		c += ((uint16_t)a[i]) + ((uint16_t)b[i]);
+		r[i] = c;
+	}
+
+	/* Reduce with 2^255 = 19 mod p */
+	r[31] &= 127;
+	c = (c >> 7) * 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_sub(uint8_t *r, const uint8_t *a, const uint8_t *b)
+{
+	uint32_t c = 0;
+	int i;
+
+	/* Calculate a + 2p - b, to avoid underflow */
+	c = 218;
+	for (i = 0; i + 1 < F25519_SIZE; i++) {
+		c += 65280 + ((uint32_t)a[i]) - ((uint32_t)b[i]);
+		r[i] = c;
+		c >>= 8;
+	}
+
+	c += ((uint32_t)a[31]) - ((uint32_t)b[31]);
+	r[31] = c & 127;
+	c = (c >> 7) * 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_neg(uint8_t *r, const uint8_t *a)
+{
+	uint32_t c = 0;
+	int i;
+
+	/* Calculate 2p - a, to avoid underflow */
+	c = 218;
+	for (i = 0; i + 1 < F25519_SIZE; i++) {
+		c += 65280 - ((uint32_t)a[i]);
+		r[i] = c;
+		c >>= 8;
+	}
+
+	c -= ((uint32_t)a[31]);
+	r[31] = c & 127;
+	c = (c >> 7) * 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_mul__distinct(uint8_t *r, const uint8_t *a, const uint8_t *b)
+{
+	uint32_t c = 0;
+	int i;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		int j;
+
+		c >>= 8;
+		for (j = 0; j <= i; j++)
+			c += ((uint32_t)a[j]) * ((uint32_t)b[i - j]);
+
+		for (; j < F25519_SIZE; j++)
+			c += ((uint32_t)a[j]) *
+			     ((uint32_t)b[i + F25519_SIZE - j]) * 38;
+
+		r[i] = c;
+	}
+
+	r[31] &= 127;
+	c = (c >> 7) * 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_mul(uint8_t *r, const uint8_t *a, const uint8_t *b)
+{
+	uint8_t tmp[F25519_SIZE];
+
+	f25519_mul__distinct(tmp, a, b);
+	f25519_copy(r, tmp);
+}
+
+void f25519_mul_c(uint8_t *r, const uint8_t *a, uint32_t b)
+{
+	uint32_t c = 0;
+	int i;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c >>= 8;
+		c += b * ((uint32_t)a[i]);
+		r[i] = c;
+	}
+
+	r[31] &= 127;
+	c >>= 7;
+	c *= 19;
+
+	for (i = 0; i < F25519_SIZE; i++) {
+		c += r[i];
+		r[i] = c;
+		c >>= 8;
+	}
+}
+
+void f25519_inv__distinct(uint8_t *r, const uint8_t *x)
+{
+	uint8_t s[F25519_SIZE];
+	int i;
+
+	/* This is a prime field, so by Fermat's little theorem:
+	 *
+	 *     x^(p-1) = 1 mod p
+	 *
+	 * Therefore, raise to (p-2) = 2^255-21 to get a multiplicative
+	 * inverse.
+	 *
+	 * This is a 255-bit binary number with the digits:
+	 *
+	 *     11111111... 01011
+	 *
+	 * We compute the result by the usual binary chain, but
+	 * alternate between keeping the accumulator in r and s, so as
+	 * to avoid copying temporaries.
+	 */
+
+	/* 1 1 */
+	f25519_mul__distinct(s, x, x);
+	f25519_mul__distinct(r, s, x);
+
+	/* 1 x 248 */
+	for (i = 0; i < 248; i++) {
+		f25519_mul__distinct(s, r, r);
+		f25519_mul__distinct(r, s, x);
+	}
+
+	/* 0 */
+	f25519_mul__distinct(s, r, r);
+
+	/* 1 */
+	f25519_mul__distinct(r, s, s);
+	f25519_mul__distinct(s, r, x);
+
+	/* 0 */
+	f25519_mul__distinct(r, s, s);
+
+	/* 1 */
+	f25519_mul__distinct(s, r, r);
+	f25519_mul__distinct(r, s, x);
+
+	/* 1 */
+	f25519_mul__distinct(s, r, r);
+	f25519_mul__distinct(r, s, x);
+}
+
+void f25519_inv(uint8_t *r, const uint8_t *x)
+{
+	uint8_t tmp[F25519_SIZE];
+
+	f25519_inv__distinct(tmp, x);
+	f25519_copy(r, tmp);
+}
+
+/* Raise x to the power of (p-5)/8 = 2^252-3, using s for temporary
+ * storage.
+ */
+static void exp2523(uint8_t *r, const uint8_t *x, uint8_t *s)
+{
+	int i;
+
+	/* This number is a 252-bit number with the binary expansion:
+	 *
+	 *     111111... 01
+	 */
+
+	/* 1 1 */
+	f25519_mul__distinct(r, x, x);
+	f25519_mul__distinct(s, r, x);
+
+	/* 1 x 248 */
+	for (i = 0; i < 248; i++) {
+		f25519_mul__distinct(r, s, s);
+		f25519_mul__distinct(s, r, x);
+	}
+
+	/* 0 */
+	f25519_mul__distinct(r, s, s);
+
+	/* 1 */
+	f25519_mul__distinct(s, r, r);
+	f25519_mul__distinct(r, s, x);
+}
+
+void f25519_sqrt(uint8_t *r, const uint8_t *a)
+{
+	uint8_t v[F25519_SIZE];
+	uint8_t i[F25519_SIZE];
+	uint8_t x[F25519_SIZE];
+	uint8_t y[F25519_SIZE];
+
+	/* v = (2a)^((p-5)/8) [x = 2a] */
+	f25519_mul_c(x, a, 2);
+	exp2523(v, x, y);
+
+	/* i = 2av^2 - 1 */
+	f25519_mul__distinct(y, v, v);
+	f25519_mul__distinct(i, x, y);
+	f25519_load(y, 1);
+	f25519_sub(i, i, y);
+
+	/* r = avi */
+	f25519_mul__distinct(x, v, a);
+	f25519_mul__distinct(r, x, i);
+}
--- a/installer/signplugin/tiny/f25519.h
+++ b/installer/signplugin/tiny/f25519.h
@ -0,0 +1,92 @@
+/* Arithmetic mod p = 2^255-19
+ * Daniel Beer <dlbeer@gmail.com>, 8 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#ifndef F25519_H_
+#define F25519_H_
+
+#include <stdint.h>
+#include <string.h>
+
+/* Field elements are represented as little-endian byte strings. All
+ * operations have timings which are independent of input data, so they
+ * can be safely used for cryptography.
+ *
+ * Computation is performed on un-normalized elements. These are byte
+ * strings which fall into the range 0 <= x < 2p. Use f25519_normalize()
+ * to convert to a value 0 <= x < p.
+ *
+ * Elements received from the outside may greater even than 2p.
+ * f25519_normalize() will correctly deal with these numbers too.
+ */
+#define F25519_SIZE  32
+
+/* Identity constants */
+extern const uint8_t f25519_zero[F25519_SIZE];
+extern const uint8_t f25519_one[F25519_SIZE];
+
+/* Load a small constant */
+void f25519_load(uint8_t *x, uint32_t c);
+
+/* Copy two points */
+static inline void f25519_copy(uint8_t *x, const uint8_t *a)
+{
+	memcpy(x, a, F25519_SIZE);
+}
+
+/* Normalize a field point x < 2*p by subtracting p if necessary */
+void f25519_normalize(uint8_t *x);
+
+/* Compare two field points in constant time. Return one if equal, zero
+ * otherwise. This should be performed only on normalized values.
+ */
+uint8_t f25519_eq(const uint8_t *x, const uint8_t *y);
+
+/* Conditional copy. If condition == 0, then zero is copied to dst. If
+ * condition == 1, then one is copied to dst. Any other value results in
+ * undefined behaviour.
+ */
+void f25519_select(uint8_t *dst,
+		   const uint8_t *zero, const uint8_t *one,
+		   uint8_t condition);
+
+/* Add/subtract two field points. The three pointers are not required to
+ * be distinct.
+ */
+void f25519_add(uint8_t *r, const uint8_t *a, const uint8_t *b);
+void f25519_sub(uint8_t *r, const uint8_t *a, const uint8_t *b);
+
+/* Unary negation */
+void f25519_neg(uint8_t *r, const uint8_t *a);
+
+/* Multiply two field points. The __distinct variant is used when r is
+ * known to be in a different location to a and b.
+ */
+void f25519_mul(uint8_t *r, const uint8_t *a, const uint8_t *b);
+void f25519_mul__distinct(uint8_t *r, const uint8_t *a, const uint8_t *b);
+
+/* Multiply a point by a small constant. The two pointers are not
+ * required to be distinct.
+ *
+ * The constant must be less than 2^24.
+ */
+void f25519_mul_c(uint8_t *r, const uint8_t *a, uint32_t b);
+
+/* Take the reciprocal of a field point. The __distinct variant is used
+ * when r is known to be in a different location to x.
+ */
+void f25519_inv(uint8_t *r, const uint8_t *x);
+void f25519_inv__distinct(uint8_t *r, const uint8_t *x);
+
+/* Compute one of the square roots of the field element, if the element
+ * is square. The other square is -r.
+ *
+ * If the input is not square, the returned value is a valid field
+ * element, but not the correct answer. If you don't already know that
+ * your element is square, you should square the return value and test.
+ */
+void f25519_sqrt(uint8_t *r, const uint8_t *x);
+
+#endif
--- a/installer/signplugin/tiny/fprime.c
+++ b/installer/signplugin/tiny/fprime.c
@ -0,0 +1,215 @@
+/* Arithmetic in prime fields
+ * Daniel Beer <dlbeer@gmail.com>, 10 Jan 2014
+ *
+ * This file is in the public domain.
+ */
+
+#include "fprime.h"
+
+const uint8_t fprime_zero[FPRIME_SIZE] = {0};
+const uint8_t fprime_one[FPRIME_SIZE] = {1};
+
+static void raw_add(uint8_t *x, const uint8_t *p)
+{
+	uint16_t c = 0;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++) {
+		c += ((uint16_t)x[i]) + ((uint16_t)p[i]);
+		x[i] = c;
+		c >>= 8;
+	}
+}
+
+static void raw_try_sub(uint8_t *x, const uint8_t *p)
+{
+	uint8_t minusp[FPRIME_SIZE];
+	uint16_t c = 0;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++) {
+		c = ((uint16_t)x[i]) - ((uint16_t)p[i]) - c;
+		minusp[i] = c;
+		c = (c >> 8) & 1;
+	}
+
+	fprime_select(x, minusp, x, c);
+}
+
+/* Warning: this function is variable-time */
+static int prime_msb(const uint8_t *p)
+{
+	int i;
+	uint8_t x;
+
+	for (i = FPRIME_SIZE - 1; i >= 0; i--)
+		if (p[i])
+			break;
+
+	x = p[i];
+	i <<= 3;
+
+	while (x) {
+		x >>= 1;
+		i++;
+	}
+
+	return i - 1;
+}
+
+/* Warning: this function may be variable-time in the argument n */
+static void shift_n_bits(uint8_t *x, int n)
+{
+	uint16_t c = 0;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++) {
+		c |= ((uint16_t)x[i]) << n;
+		x[i] = c;
+		c >>= 8;
+	}
+}
+
+void fprime_load(uint8_t *x, uint32_t c)
+{
+	unsigned int i;
+
+	for (i = 0; i < sizeof(c); i++) {
+		x[i] = c;
+		c >>= 8;
+	}
+
+	for (; i < FPRIME_SIZE; i++)
+		x[i] = 0;
+}
+
+static inline int min_int(int a, int b)
+{
+	return a < b ? a : b;
+}
+
+void fprime_from_bytes(uint8_t *n,
+		       const uint8_t *x, size_t len,
+		       const uint8_t *modulus)
+{
+	const int preload_total = min_int(prime_msb(modulus) - 1, len << 3);
+	const int preload_bytes = preload_total >> 3;
+	const int preload_bits = preload_total & 7;
+	const int rbits = (len << 3) - preload_total;
+	int i;
+
+	memset(n, 0, FPRIME_SIZE);
+
+	for (i = 0; i < preload_bytes; i++)
+		n[i] = x[len - preload_bytes + i];
+
+	if (preload_bits) {
+		shift_n_bits(n, preload_bits);
+		n[0] |= x[len - preload_bytes - 1] >> (8 - preload_bits);
+	}
+
+	for (i = rbits - 1; i >= 0; i--) {
+		const uint8_t bit = (x[i >> 3] >> (i & 7)) & 1;
+
+		shift_n_bits(n, 1);
+		n[0] |= bit;
+		raw_try_sub(n, modulus);
+	}
+}
+
+void fprime_normalize(uint8_t *x, const uint8_t *modulus)
+{
+	uint8_t n[FPRIME_SIZE];
+
+	fprime_from_bytes(n, x, FPRIME_SIZE, modulus);
+	fprime_copy(x, n);
+}
+
+uint8_t fprime_eq(const uint8_t *x, const uint8_t *y)
+{
+	uint8_t sum = 0;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++)
+		sum |= x[i] ^ y[i];
+
+	sum |= (sum >> 4);
+	sum |= (sum >> 2);
+	sum |= (sum >> 1);
+
+	return (sum ^ 1) & 1;
+}
+
+void fprime_select(uint8_t *dst,
+		   const uint8_t *zero, const uint8_t *one,
+		   uint8_t condition)
+{
+	const uint8_t mask = -condition;
+	int i;
+
+	for (i = 0; i < FPRIME_SIZE; i++)
+		dst[i] = zero[i] ^ (mask & (one[i] ^ zero[i]));
+}
+
+void fprime_add(uint8_t *r, const uint8_t *a, const uint8_t *modulus)
+{
+	raw_add(r, a);
+	raw_try_sub(r, modulus);
+}
+
+void fprime_sub(uint8_t *r, const uint8_t *a, const uint8_t *modulus)
+{
+	raw_add(r, modulus);
+	raw_try_sub(r, a);
+	raw_try_sub(r, modulus);
+}
+
+void fprime_mul(uint8_t *r, const uint8_t *a, const uint8_t *b,
+		const uint8_t *modulus)
+{
+	int i;
+
+	memset(r, 0, FPRIME_SIZE);
+
+	for (i = prime_msb(modulus); i >= 0; i--) {
+		const uint8_t bit = (b[i >> 3] >> (i & 7)) & 1;
+		uint8_t plusa[FPRIME_SIZE];
+
+		shift_n_bits(r, 1);
+		raw_try_sub(r, modulus);
+
+		fprime_copy(plusa, r);
+		fprime_add(plusa, a, modulus);
+
+		fprime_select(r, r, plusa, bit);
+	}
+}
+
+void fprime_inv(uint8_t *r, const uint8_t *a, const uint8_t *modulus)
+{
+	uint8_t pm2[FPRIME_SIZE];
+	uint16_t c = 2;
+	int i;
+
+	/* Compute (p-2) */
+	fprime_copy(pm2, modulus);
+	for (i = 0; i < FPRIME_SIZE; i++) {
+		c = modulus[i] - c;
+		pm2[i] = c;
+		c >>= 8;
+	}
+
+	/* Binary exponentiation */
+	fprime_load(r, 1);
+
+	for (i = prime_msb(modulus); i >= 0; i--) {
+		uint8_t r2[FPRIME_SIZE];
+
+		fprime_mul(r2, r, r, modulus);
+
+		if ((pm2[i >> 3] >> (i & 7)) & 1)
+			fprime_mul(r, r2, a, modulus);
+		else
+			fprime_copy(r, r2);
+	}
+}
--- a/Show more
+++ b/Show more
				`@ -0,0 +1,2 @@`
				`g++7 -I . -O2 -static -mssse3 -o tunsafe benchmark.cpp tunsafe_cpu.cpp wireguard_config.cpp wireguard.cpp wireguard_proto.cpp util.cpp network_bsd.cpp crypto/blake2s.cpp crypto/blake2s_sse.cpp crypto/chacha20poly1305.cpp crypto/curve25519-donna.cpp crypto/siphash.cpp crypto/chacha20_x64_gas.s crypto/poly1305_x64_gas.s ipzip2/ipzip2.cpp -lrt`