My NN Layers

Modules

Encoding

class encoding.nn.Encoding(D, K)[source]

Encoding Layer: a learnable residual encoder.

_images/cvpr17.svg

Encoding Layer accpets 3D or 4D inputs. It considers an input featuremaps with the shape of \(C\times H\times W\) as a set of C-dimentional input features \(X=\{x_1, ...x_N\}\), where N is total number of features given by \(H\times W\), which learns an inherent codebook \(D=\{d_1,...d_K\}\) and a set of smoothing factor of visual centers \(S=\{s_1,...s_K\}\). Encoding Layer outputs the residuals with soft-assignment weights \(e_k=\sum_{i=1}^Ne_{ik}\), where

\[e_{ik} = \frac{exp(-s_k\|r_{ik}\|^2)}{\sum_{j=1}^K exp(-s_j\|r_{ij}\|^2)} r_{ik}\]

and the residuals are given by \(r_{ik} = x_i - d_k\). The output encoders are \(E=\{e_1,...e_K\}\)

Parameters:
  • D – dimention of the features or feature channels
  • K – number of codeswords
Shape:
  • Input: \(X\in\mathcal{R}^{B\times N\times D}\) or \(\mathcal{R}^{B\times D\times H\times W}\) (where \(B\) is batch, \(N\) is total number of features or \(H\times W\).)
  • Output: \(E\in\mathcal{R}^{B\times K\times D}\)
Variables:
  • codewords (Tensor) – the learnable codewords of shape (\(K\times D\))
  • scale (Tensor) – the learnable scale factor of visual centers
Reference:
Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, Amit Agrawal. “Context Encoding for Semantic Segmentation. CVPR 2018 Hang Zhang, Jia Xue, and Kristin Dana. “Deep TEN: Texture Encoding Network.” The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017

Examples

>>> import encoding
>>> import torch
>>> import torch.nn.functional as F
>>> from torch.autograd import Variable
>>> B,C,H,W,K = 2,3,4,5,6
>>> X = Variable(torch.cuda.DoubleTensor(B,C,H,W).uniform_(-0.5,0.5), requires_grad=True)
>>> layer = encoding.Encoding(C,K).double().cuda()
>>> E = layer(X)

Inspiration

class encoding.nn.Inspiration(C, B=1)[source]

Inspiration Layer (CoMatch Layer) enables the multi-style transfer in feed-forward network, which learns to match the target feature statistics during the training. This module is differentialble and can be inserted in standard feed-forward network to be learned directly from the loss function without additional supervision.

\[Y = \phi^{-1}[\phi(\mathcal{F}^T)W\mathcal{G}]\]

Please see the example of MSG-Net training multi-style generative network for real-time transfer.

Reference:
Hang Zhang and Kristin Dana. “Multi-style Generative Network for Real-time Transfer.” arXiv preprint arXiv:1703.06953 (2017)

UpsampleConv2d

class encoding.nn.UpsampleConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, scale_factor=1, bias=True)[source]

To avoid the checkerboard artifacts of standard Fractionally-strided Convolution, we adapt an integer stride convolution but producing a \(2\times 2\) outputs for each convolutional window.

_images/upconv.png
Reference:
Hang Zhang and Kristin Dana. “Multi-style Generative Network for Real-time Transfer.” arXiv preprint arXiv:1703.06953 (2017)
Parameters:
  • in_channels (int) – Number of channels in the input image
  • out_channels (int) – Number of channels produced by the convolution
  • kernel_size (int or tuple) – Size of the convolving kernel
  • stride (int or tuple, optional) – Stride of the convolution. Default: 1
  • padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0
  • output_padding (int or tuple, optional) – Zero-padding added to one side of the output. Default: 0
  • groups (int, optional) – Number of blocked connections from input channels to output channels. Default: 1
  • bias (bool, optional) – If True, adds a learnable bias to the output. Default: True
  • dilation (int or tuple, optional) – Spacing between kernel elements. Default: 1
  • scale_factor (int) – scaling factor for upsampling convolution. Default: 1
Shape:
  • Input: \((N, C_{in}, H_{in}, W_{in})\)
  • Output: \((N, C_{out}, H_{out}, W_{out})\) where \(H_{out} = scale * (H_{in} - 1) * stride[0] - 2 * padding[0] + kernel\_size[0] + output\_padding[0]\) \(W_{out} = scale * (W_{in} - 1) * stride[1] - 2 * padding[1] + kernel\_size[1] + output\_padding[1]\)
Variables:
  • weight (Tensor) – the learnable weights of the module of shape (in_channels, scale * scale * out_channels, kernel_size[0], kernel_size[1])
  • bias (Tensor) – the learnable bias of the module of shape (scale * scale * out_channels)

Examples

>>> # With square kernels and equal stride
>>> m = nn.UpsampleCov2d(16, 33, 3, stride=2)
>>> # non-square kernels and unequal stride and with padding
>>> m = nn.UpsampleCov2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
>>> input = autograd.Variable(torch.randn(20, 16, 50, 100))
>>> output = m(input)
>>> # exact output size can be also specified as an argument
>>> input = autograd.Variable(torch.randn(1, 16, 12, 12))
>>> downsample = nn.Conv2d(16, 16, 3, stride=2, padding=1)
>>> upsample = nn.UpsampleCov2d(16, 16, 3, stride=2, padding=1)
>>> h = downsample(input)
>>> h.size()
torch.Size([1, 16, 6, 6])
>>> output = upsample(h, output_size=input.size())
>>> output.size()
torch.Size([1, 16, 12, 12])

DilatedAvgPool2d

class encoding.nn.DilatedAvgPool2d(kernel_size, stride=None, padding=0, dilation=1)[source]

We provide Dilated Average Pooling for the dilation of Densenet as in encoding.dilated.DenseNet.

Reference:

Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, Amit Agrawal. “Context Encoding for Semantic Segmentation. CVPR 2018

Applies a 2D average pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C, H, W)\), output \((B, C, H_{out}, W_{out})\), kernel_size \((k_H,k_W)\), stride \((s_H,s_W)\) dilation \((d_H,d_W)\) can be precisely described as:

\[\begin{array}{ll} out(b, c, h, w) = 1 / (k_H \cdot k_W) \cdot \sum_{{m}=0}^{k_H-1} \sum_{{n}=0}^{k_W-1} input(b, c, s_H \cdot h + d_H \cdot m, s_W \cdot w + d_W \cdot n) \end{array}\]
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
The parameters kernel_size, stride, padding, dilation can either be:
  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension
Parameters:
  • kernel_size – the size of the window
  • stride – the stride of the window. Default value is kernel_size
  • padding – implicit zero padding to be added on both sides
  • dilation – the dilation parameter similar to Conv2d
Shape:
  • Input: \((B, C, H_{in}, W_{in})\)
  • Output: \((B, C, H_{out}, W_{out})\) where \(H_{out} = floor((H_{in} + 2 * padding[0] - kernel\_size[0]) / stride[0] + 1)\) \(W_{out} = floor((W_{in} + 2 * padding[1] - kernel\_size[1]) / stride[1] + 1)\) For stride=1, the output featuremap preserves the same size as input.

Examples:

>>> # pool of square window of size=3, stride=2, dilation=2
>>> m = nn.DilatedAvgPool2d(3, stride=2, dilation=2)
>>> input = autograd.Variable(torch.randn(20, 16, 50, 32))
>>> output = m(input)

Functions

aggregate

encoding.functions.aggregate(A, X, C)[source]

Aggregate operation, aggregate the residuals of inputs (\(X\)) with repect to the codewords (\(C\)) with assignment weights (\(A\)).

\[e_{k} = \sum_{i=1}^{N} a_{ik} (x_i - d_k)\]
Shape:
  • Input: \(A\in\mathcal{R}^{B\times N\times K}\) \(X\in\mathcal{R}^{B\times N\times D}\) \(C\in\mathcal{R}^{K\times D}\) (where \(B\) is batch, \(N\) is total number of features, \(K\) is number is codewords, \(D\) is feature dimensions.)
  • Output: \(E\in\mathcal{R}^{B\times K\times D}\)

Examples

>>> B,N,K,D = 2,3,4,5
>>> A = Variable(torch.cuda.DoubleTensor(B,N,K).uniform_(-0.5,0.5), requires_grad=True)
>>> X = Variable(torch.cuda.DoubleTensor(B,N,D).uniform_(-0.5,0.5), requires_grad=True)
>>> C = Variable(torch.cuda.DoubleTensor(K,D).uniform_(-0.5,0.5), requires_grad=True)
>>> func = encoding.aggregate()
>>> E = func(A, X, C)

dilatedavgpool2d

encoding.functions.dilatedavgpool2d(input, kernel_size, stride=None, padding=0, dilation=1)[source]

Dilated Average Pool 2d, for dilation of DenseNet.

Reference:

Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, Amit Agrawal. “Context Encoding for Semantic Segmentation. CVPR 2018

Applies 2D average-pooling operation in kh x kw regions by step size dh x dw steps. The number of output features is equal to the number of input planes.

See DilatedAvgPool2d for details and output shape.

Parameters:
  • input – input tensor (minibatch x in_channels x iH x iW)
  • kernel_size – size of the pooling region, a single number or a tuple (kh x kw)
  • stride – stride of the pooling operation, a single number or a tuple (sh x sw). Default is equal to kernel size
  • padding – implicit zero padding on the input, a single number or a tuple (padh x padw), Default: 0
  • dilation – the dilation parameter similar to Conv2d