在阅读了https://www.cnblogs.com/Imageshop/p/9069650.html 博主的三次卷积插值的进一步SSE优化文章后,对这个算法产生了很大的兴趣,然而复现代码的过程是艰难的,好在作者了提供了部分代码,经过1周的努力,我实现了完整的支持1通道,3通道,4通道的三次卷积插值算法的SSE优化代码。现在准备分享一下这个算法的实现过程。
算法原理之前我在我的另外一个工程: https://github.com/BBuf/Image-processing-algorithm/tree/master/Image%20Interpolation 实现过3次卷积插值,百度百科提供的原理如下:https://baike.baidu.com/item/%E5%8F%8C%E4%B8%89%E6%AC%A1%E6%8F%92%E5%80%BC/11055947?fr=aladdin 。这个过程可以用另外一套理论来解释,也就是:
图片截图自ImageShop的文章。先贴出使用三次卷积插值的简单代码:
void Bicubic_Original(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, float X, float Y)
{int Channel = Stride / Width;int PosX = floor(X), PosY = floor(Y);float PartXX = X - PosX, PartYY = Y - PosY;unsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1);unsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1);unsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1);unsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1);unsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0);unsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0);unsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0);unsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0);unsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1);unsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1);unsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1);unsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1);unsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2);unsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2);unsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2);unsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2);float U0 = SinXDivX(1 + PartXX), U1 = SinXDivX(PartXX);float U2 = SinXDivX(1 - PartXX), U3 = SinXDivX(2 - PartXX);float V0 = SinXDivX(1 + PartYY), V1 = SinXDivX(PartYY);float V2 = SinXDivX(1 - PartYY), V3 = SinXDivX(2 - PartYY);for (int I = 0; I
从这里可以看到,我们引入上面那2条曲线正是为了计算每个像素点在三次卷积插值时的系数。而提出一个拟合得表达式来近似Sin曲线的原因是什么呢?实际上当我们将X取值为0.3,然后按照这2个公式计算的结果如下:
SinXDivX_Standard(1 + X) + SinXDivX_Standard(X) + SinXDivX_Standard(1 - X) + SinXDivX_Standard(2 - X) = 0.8767
SinXDivX(1 + X) + SinXDivX(X) + SinXDivX(1 - X) + SinXDivX(2 - X) =1
, 可以看到如果我们使用拟合得表达式,权重系数就不需要再进行归一化处理了。这两个函数的代码如下:
// 该函数计算插值曲线sin(x * PI) / (x * PI)的值,下面是它的近似拟合表达式
float SinXDivX(float X) {const float a &#61; -1; //a还可以取 a&#61;-2,-1,-0.75,-0.5等等&#xff0c;起到调节锐化或模糊程度的作用X &#61; abs(X);float X2 &#61; X * X, X3 &#61; X2 * X;if (X <&#61; 1)return (a &#43; 2) * X3 - (a &#43; 3) * X2 &#43; 1;else if (X <&#61; 2)return a * X3 - (5 * a) * X2 &#43; (8 * a) * X - (4 * a);elsereturn 0;
}// 精确计算插值曲线sin(x * PI) / (x * PI)
float SinXDivX_Standard(float X) {if (abs(X) <0.000001f)return 1;elsereturn sin(X * 3.1415926f) / (X * 3.1415926f);
}
三次卷积插值的原始实现
// 原始的插值算法
void IM_Resize_Cubic_Origin(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {int Channel &#61; StrideS / SrcW;if ((SrcW &#61;&#61; DstW) && (SrcH &#61;&#61; DstH)) {memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));return;}printf("%d\n", Channel);for (int Y &#61; 0; Y
优化1&#xff1a;边界特殊处理&#43;查表法
对原始实现一个显然的优化点在于&#xff0c;对于每个像素都要计算领域每个像素的系数&#xff0c;显然这个可以用定点化查表来实现。同时&#xff0c;如果我们把边界特殊处理&#xff0c;我们就可以省去很多GetCheckedPixel函数的调用&#xff0c;这个加速也是非常明显的。最后在使用了边界特殊处理&#43;定点化查表后&#xff0c;速度比原始实现大概提升了2倍。
// C语言实现的查表&#43;插值算法
void IM_Resize_Cubic_Table(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {int Channel &#61; StrideS / SrcW;if ((SrcW &#61;&#61; DstW) && (SrcH &#61;&#61; DstH)) {memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));return;}short *SinXDivX_Table &#61; (short *)malloc(513 * sizeof(short));for (int I &#61; 0; I <513; I&#43;&#43;)SinXDivX_Table[I] &#61; int(0.5 &#43; 256 * SinXDivX(I / 256.0f)); // 建立查找表&#xff0c;定点化int AddX &#61; (SrcW <<16) / DstW, AddY &#61; (SrcH <<16) / DstH;int ErrorX &#61; -(1 <<15) &#43; (AddX >> 1), ErrorY &#61; -(1 <<15) &#43; (AddY >> 1);int StartX &#61; ((1 <<16) - ErrorX) / AddX &#43; 1; // 计算出需要特殊处理的边界int StartY &#61; ((1 <<16) - ErrorY) / AddY &#43; 1; // y0&#43;y*yr>&#61;1; y0&#61;ErrorY &#61;> y>&#61;(1-ErrorY)/yrint EndX &#61; (((SrcW - 3) <<16) - ErrorX) / AddX &#43; 1;int EndY &#61; (((SrcH - 3) <<16) - ErrorY) / AddY &#43; 1; // y0&#43;y*yr<&#61;(height-3) &#61;> y<&#61;(height-3-ErrorY)/yrif (StartY >&#61; DstH) StartY &#61; DstH;if (StartX >&#61; DstW) StartX &#61; DstW;if (EndX
优化3&#xff1a;SSE优化
如果是一个通道和三个通道的图像&#xff0c;是很容易把上面的过程翻译为SSE代码的。比较复杂的是对于24位图像的处理&#xff0c;我这里偷了个懒&#xff0c;将RGB图像直接转为了RGBA图像&#xff0c;在RGBA图像操作之后将结果图像又转为了RGB图像&#xff0c;这个转化过程也使用了SSE优化。在SSE优化系列2-高斯滤波https://blog.csdn.net/just_sort/article/details/95212099中&#xff0c;留下了一个坑&#xff0c;就是BGR和BGRA相互转化的SSE优化&#xff0c;我这里填了这个坑点。实现的代码如下&#xff1a;
void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {const int BlockSize &#61; 4;int Block &#61; (Width - 2) / BlockSize;__m128i Mask &#61; _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1);__m128i Mask2 &#61; _mm_setr_epi8(0, 2, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);__m128i Zero &#61; _mm_setzero_si128();for (int Y &#61; 0; Y
都是一些常规操作&#xff0c;可以看着数据模拟一下寄存器变量变化过程。这里再提供一个输出寄存器变量值的函数&#xff1a;
void debug(__m128i var) {uint8_t *val &#61; (uint8_t*)&var;//can also use uint32_t instead of 16_t printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\n",val[0], val[1], val[2], val[3], val[4], val[5],val[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13],val[14], val[15]);
}
最后对单通道和四通道的图像进行三次卷积插值的SSE优化代码为&#xff1a;
// 使用SSE优化立方插值算法
// 最大支持图像大小为: 32767*32767
void IM_Resize_SSE(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {int Channel &#61; StrideS / SrcW;if ((SrcW &#61;&#61; DstW) && (SrcH &#61;&#61; DstH)) {memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));return;}short *SinXDivX_Table &#61; (short *)malloc(513 * sizeof(short));short *Table &#61; (short *)malloc(DstW * 4 * sizeof(short));for (int I &#61; 0; I <513; I&#43;&#43;)SinXDivX_Table[I] &#61; int(0.5 &#43; 256 * SinXDivX(I / 256.0f)); // 建立查找表&#xff0c;定点化int AddX &#61; (SrcW <<16) / DstW, AddY &#61; (SrcH <<16) / DstH;int ErrorX &#61; -(1 <<15) &#43; (AddX >> 1), ErrorY &#61; -(1 <<15) &#43; (AddY >> 1);int StartX &#61; ((1 <<16) - ErrorX) / AddX &#43; 1; // 计算出需要特殊处理的边界int StartY &#61; ((1 <<16) - ErrorY) / AddY &#43; 1; // y0&#43;y*yr>&#61;1; y0&#61;ErrorY &#61;> y>&#61;(1-ErrorY)/yrint EndX &#61; (((SrcW - 3) <<16) - ErrorX) / AddX &#43; 1;int EndY &#61; (((SrcH - 3) <<16) - ErrorY) / AddY &#43; 1; // y0&#43;y*yr<&#61;(height-3) &#61;> y<&#61;(height-3-ErrorY)/yrif (StartY >&#61; DstH) StartY &#61; DstH;if (StartX >&#61; DstW) StartX &#61; DstW;if (EndX
这个优化过程技巧性比较强&#xff0c;这里需要仔细说明一下。我们先来看Channel为1的情况&#xff0c;观察这句代码&#xff1a;Pixel00[I] * U0 &#43; Pixel01[I] * U1 &#43; Pixel02[I] * U2 &#43; Pixel03[I] * U3
&#xff0c;我们发现Pixel00,Pixel01,Pixel02,Pixel03
在内存中是连续的&#xff0c;且这个变量都是在0-256的值域中&#xff0c;显然我们可以使用SSE寄存器读取这几个变量。同时SSE中有一个_mm_madd_epi16
&#xff0c;实现的功能是&#xff1a;
__m128i _mm_madd_epi16 (__m128i a, __m128i b); Return ValueAdds the signed 32-bit integer results pairwise and packs the 4 signed 32-bit integer results.r0 :&#61; (a0 * b0) &#43; (a1 * b1)r1 :&#61; (a2 * b2) &#43; (a3 * b3)r2 :&#61; (a4 * b4) &#43; (a5 * b5)r3 :&#61; (a6 * b6) &#43; (a7 * b7)
考虑我们函数中&#xff0c;行1到行4每行的代码都只有4次乘法和3次加法&#xff0c;不能直接使用&#xff0c;但是我们可以考虑把两行整合在一起&#xff0c;一次性计算&#xff0c;这样就需要调用2次_mm_madd_epi16
&#xff0c;然后2次的结果在调用_mm_hadd_epi32
这个水平方向的累加函数就能得到新的结果&#xff0c; _mm_hsum_epi32的实现如下:
// 4个有符号的32位的数据相加的和
inline int _mm_hsum_epi32(__m128i V) { //V3 V2 V1 V0__m128i T &#61; _mm_add_epi32(V, _mm_srli_si128(V, 8)); //V3&#43;V1 V2&#43;V0 V1 V0T &#61; _mm_add_epi32(T, _mm_srli_si128(T, 4)); //V3&#43;V1&#43;V2&#43;V0 V2&#43;V0&#43;V1 V1&#43;V0 V0return _mm_cvtsi128_si32(T); //提取低位
}
实现过程中还有很多以前重复介绍过的sse指令&#xff0c;可以在MSDN或我的资源中&#xff1a; https://github.com/BBuf/Image-processing-algorithm-Speed/tree/master/resources 查看。算法的具体过程请看代码&#xff0c; 一些重要地方我都注释好了。完整代码请在github查看。
速度测试可以看到我们的SSE优化版本代码还是比不过OpenCV3.1.0自带的函数&#xff0c;猜测OpenCV内部对BGR图像应该是直接在3个通道上操作&#xff0c;且优化做得更多更好&#xff0c;所以比我SSE优化的速度快了近3倍。
原图&#xff1a;
结果图&#xff0c;扩大了1.5倍&#xff1a;
https://www.cnblogs.com/Imageshop/p/9069650.html