# Simultaneous Edge Alignment and Learning [Zhiding Yu+, ECCV, 2018]
## Abstract

Edge detection is among the most fundamental vision problems for its role in perceptual grouping and its wide applications.
Recent advances in representation learning have led to considerable improvements in this area. Many state of the art edge detection models are learned with fully convolutional networks (FCNs).

However, FCN-based edge learning tends to be vulnerable to misaligned labels due to the delicate structure of edges.
While such problem was considered in evaluation benchmarks, similar issue has not been explicitly addressed in general edge learning.

In this paper, we show that label misalignment can cause considerably degraded edge learning quality, and address this issue by proposing a simultaneous edge alignment and learning framework.
To this end, we formulate a probabilistic model where edge alignment is treated as latent variable optimization, and is learned end-to-end during network training.
Experiments show several applications of this work, including improved edge detection with state of the art performance, and automatic refinement of noisy annotations.

## 概要
表現学習における最近の進歩は、この分野においてかなりの改善をもたらした。 最新のエッジ検出モデルの多くは、完全畳み込みネットワーク(FCN)で学習されています。



# 1 Introduction
Early edge detection methods often formulate the task as a low-level or mid-level grouping problem where Gestalt laws and perceptual grouping play considerable roles in algorithm design [23,7,44,16].
Latter works start to consider learning edges in a data-driven way, by looking into the statistics of features near boundaries [25,34,12,39,1,2,31,13].
More recently, advances in deep representation learning [26,43,18] have further led to significant improvements on edge detection, pushing the boundaries of state of the art performance [49,20,3,24,50] to new levels.
The associated tasks also expended from the conventional binary edge detection problems to the recent more challenging category-aware edge detection problems [38,17,4,22,52].
As a result of such advancement, a wide variety of other vision problems have enjoyed the benefits of reliable edge detectors.
Examples of these applications include, but are not limited to (semantic) segmentation [1,51,9,4,5],
object proposal generation [53,4,50], object detection [29], depth estimation [32,19], and 3D vision [33,21,42], etc.
With the strong representation abilities of deep networks and the dense labeling nature of edge detection, many state of the art edge detectors are based on FCNs.
Despite the underlying resemblance to other dense labeling tasks, edge learning problems face some typical challenges and issues.
First, in light of the highly imbalanced amounts of positive samples (edge pixels) and negative samples (non-edge pixels),
using reweighted losses where positive samples are weighted higher has become a predominant choice in recent deep edge learning frameworks [49,24,22,30,52].
While such a strategy to some extent renders better learning behaviors6, it also induces thicker detected edges as well as more false positives.
An example of this issue is illustrated in Fig.
1(c) and Fig. 1(g), where the edge mapspredicted by CASENet [52] contains thick object boundaries.
A direct consequence is that many local details are missing, which is not favored for other potential applications using edge detectors.
Another challenging issue for edge learning is the training label noise caused by inevitable misalignment during annotation.
Unlike segmentation, edge learning is generally more vulnerable to such noise due to the fact that edge structures by nature are much more delicate than regions.
Even slight misalignment can lead to significant proportion of mismatches between ground truth and prediction.
In order to predict sharp edges, a model should learn to distinguish the few true edge pixels while suppressing edge responses near them.
This already presents a considerable challenge to the model as non-edge pixels near edges are likely to be hard negatives with similar features,
while the presence of misalignment further causes significant confusion by continuously sending false positives during training.
The problem is further aggravated under reweighted losses, where predicting more false positives near the edge is be an effective way to decrease the loss due to the significant higher weights of positive samples.
Unfortunately, completely eliminating misalignment during annotation is almost impossible given the limit of human precision and the diminishing gain of annotation quality from additional efforts as a result.
For datasets such as Cityscapes [11] where high quality labels are generated by professional annotators, misalignment can still be frequently observed.
For datasets with crowdsourcing annotations where quality control presents another challenge, the issue can become even more severe.
Our proposed solution is an end-to-end framework towards Simultaneous Edge Alignment and Learning (SEAL).
In particular, we formulate the problem with a probabilistic model, treating edge labels as latent variables to be jointly learned during training.
We show that the optimization of latent edge labels can be transformed into a bipartite graph min-cost assignment problem, and present an end-to-end learning framework towards model training.
Fig. 2 shows some examples where the model gradually learns how to align noisy edge labels to more accurate positions along with edge learning.
# はじめに
オブジェクト検出[29]、深度推定[32,19], および3Dビジョン[33,21,42]など。
図1(c)および図1(g)において、CASENet [52]によって予測されたエッジマップは太いオブジェクト境界を含む。
高品質のラベルがプロのアノテーターによって生成されるCityscapes [11]のようなデータセットのために、それでもミスアライメントはしばしば観察されることができます。
ラベルの品質が向上し、重み付けされていない損失を使用して負の抑制が改善されるという利点の結果として、提案されたフレームワークは高品質のシャープエッジで最新の検出性能を生み出します(図1(d)および図1(h)参照) 。

# 2 Related work
## 2.1 Boundary map correspondence
Our work is partly motivated by the early work of boundary evaluation using
precision-recall and F-measure [34].
To address misalignment between prediction and human ground truth,
[34] proposed to compute a one-to-one correspondence
for the subset of matchable edge pixels from both domains by solving a min-cost assignment problem.
However, [34] only considers the alignment between fixed boundary maps,
while our work addresses a more complicated learning problem
where edge alignment becomes part of the optimization with learnable inputs.
## 2.2 Mask refinement via energy minimization
Yang et al. [50] proposed to use dense-CRF to refine object mask and contour.
Despite the similar goal, our method differs from [50] in that:
1. The refinement framework in [50] is a separate preprocessing step, while our work jointly learns refinement with the model in an end-to-end fashion.
2. The CRF model in [50] only utilizes low-level features, while our model considers both low-level and high-level information via a deep network.
3. The refinement framework in [50] is segmentation-based, while our framework directly targets edge refinement.
## 2.3 Object contour and mask learning
A series of works [40,8,37] seek to learn object contours/masks in a supervised fashion.
Deep active contour [40] uses learned CNN features to steer contour evolution given the input of an initialized contour.
Polygon-RNN [8] introduced a semi-automatic approach for object mask annotation, by learning to extract polygons given input bounding boxes.
DeepMask [37] proposed an object proposal generation method to output class-agnostic segmentation masks.
These methods require accurate ground truth for contour/mask learning,
while this work only assumes noisy ground truths and seek to refine them automatically.
## 2.4 Noisy label learning
Our work can be broadly viewed as a structured noisy label learning framework where we leverage abundant structural priors to correct label noise.
Existing noisy label learning literatures have proposed directed graphical models [48], conditional random fields (CRF) [45], neural networks [46,47],
robust losses [35] and knowledge graph [27] to model and correct image-level noisy labels.
Alternatively, our work considers pixel-level labels instead of image-level ones.
## 2.5 Virtual evidence in Bayesian networks
Our work also shares similarity with virtual evidence [36,6,28], where the uncertainty of an observation is modeled by a distribution rather than a single value.
In our problem, noisy labels can be regarded as uncertain observations which give conditional prior distributions over different configurations of aligned labels.
# 2 関連研究
## 2.1 境界マップ対応
## 2.2 エネルギー最小化によるマスク改善
1. [50]はframeworkの洗練を前処理に分けていますが、私たちの研究はモデルとの洗練をend-to-endで共同で学習します。
2. [50]のCRFモデルは低レベルの特徴のみを利用しているが、我々のモデルは低レベルと高レベルの両方の情報を深いネットワークを介して考慮している。
3. [50]の洗練フレームワークはセグメンテーションに基づいていますが、私たちのフレームワークはエッジ洗練を直接ターゲットにしています。
## 2.3 オブジェクト輪郭とマスク学習
Polygon-RNN [8]は、入力バウンディングボックスを与えられたポリゴンを抽出することを学ぶことによって、
DeepMask [37]はクラスにとらわれないセグメンテーションマスクを出力するためのオブジェクト提案生成法を提案した。
これらの方法は輪郭/マスク学習のために正確なGround Truthを必要としますが、
この研究はノイズの多いGround Truthを仮定してそれらを自動的に洗練することを試みます
## 2.4 ノイズラベル学習
ニューラルネットワーク[46、47]、ロバストロス[35]、知識グラフ[27]を提案した。 ノイズのラベル。
## 2.5 ベイジアンネットワークにおける仮想証明

# 3 A probabilistic view towards edge learning
In many classification problems, training of the models can be formulated as maximizing the following likelihood function with respect to the parameters:
> 式(1)
where y, x and W indicate respectively training labels, observed inputs and model parameters.
Depending on how the conditional probability is parameterized, the above likelihood function may correspond to different types of models.
For example, a generalized linear model function leads to the well known logistic regression.
If the parameterization is formed as a layered representation, the model may turn into CNNs or multilayer perceptrons.
One may observe that many traditional supervised edge learning models can also be regarded as special cases under the above probabilistic framework.
Here, we are mostly concerned with edge detection using fully convolutional neural networks.
In this case, the variable y indicates the set of edge prediction configurations at every pixel, while x and W denote the input image and the network parameters, respectively.
# 3 エッジ学習への確率論的見解
> 式(1)

# 4 Simultaneous edge alignment and learning
To introduce the ability of correcting edge labels during training,
we consider the following model.
Instead of treating the observed annotation y as the fitting target,
we assume there is an underlying ground truth yˆ that is more accurate than y.
Our goal is to treat yˆ as a latent variable to be jointly estimated during learning,
which leads to the following likelihood maximization problem:
> 式(2)
where yˆ indicates the underlying true ground truth. The former part P(y|yˆ)
can be regarded as an edge prior probabilistic model of an annotator generating
labels given the observed ground truths, while the latter part P(yˆ|x;W) is the
standard likelihood of the prediction model.
## 4.1 Multilabel edge learning
Consider the multilabel edge learning setting where one assumes that y does not need to be mutually exclusive at each pixel.
In other words, any pixel may correspond to the edges of multiple classes.
The likelihood can be decomposed to a set of class-wise joint probabilities assuming the inter-class independence:
> 式(3)
where y^k ∈ {0, 1} ^N indicates the set of binary labels corresponding to the k-th class.
A typical multilabel edge learning example which alsoassumes inter-class independence is CASENet [52].
In addition, binary edge detection methods such as HED [49] can be viewed as special cases of multilabel edge learning.
## 4.2 Edge prior model
## 4.3 Network likelihood model
## 4.4 Learning
## 4.5 Inference
# 4 Simultaneous edge alignment and learning(どう和訳すれば?)
yよりも正確な根底にある真理値y ^があると仮定します。
私たちの目的は、学習中にy ^を潜在変数としてまとめて推定することです。
> 式(2)
ここで、y ^は基礎となるground truthの真値を示します。

前者の部分P(y | y ^)は、観測されたGround Truthを与えられたラベルを生成する
後者の部分P(y ^ | x; W)は予測モデルの標準尤度です。
## 4.1 Multilabel edge learning
> 式(3)
ここで、y ^ k∈{0、1} ^ Nは、k番目のクラスに対応するバイナリラベルの集合を示します。
クラス間の独立性も仮定している典型的なマルチラベルエッジ学習の例はCASENet [52]です。
さらに、HED [49]などのバイナリエッジ検出方法は、
## 4.2 Edge prior model

