Zihao WangBNRist, Department of Computer Science and Technology, RIIT, Institute of Internet Industry, Tsinghua UniversityDatong ZhouDepartment of Mathematical Sciences, Tsinghua UniversityMing YangDepartment of Computer Science and Technology, Tsinghua UniversityYong ZhangBNRist, Department of Computer Science and Technology, RIIT, Institute of Internet Industry, Tsinghua UniversityChenglong BaoYau Mathematical Sciences Center, Tsinghua UniversityHao WuDepartment of Mathematical Sciences, Tsinghua University
Computing the distance among linguistic objects is an essential problem in natural language processing. The word mover’s distance (WMD) has been successfully applied to measure the document distance by synthesizing the low-level word similarity with the framework of optimal transport (OT). However, due to the global transportation nature of OT, the WMD may overestimate the semantic dissimilarity when documents contain unequal semantic details. In this paper, we propose to address this overestimation issue with a novel Wasserstein-Fisher-Rao (WFR) document distance grounded on unbalanced optimal transport theory. Compared to the WMD, the WFR document distance provides a trade-off between global transportation and local truncation, which leads to a better similarity measure for unequal semantic details. Moreover, an efficient prune strategy is particularly designed for the WFR document distance to facilitate the top-k queries among a large number of documents. Extensive experimental results show that the WFR document distance achieves higher accuracy that WMD and even its supervised variation s-WMD.
Existing domain adaptation methods aim at learning features that can be generalized among domains. These methods commonly require to update source classifier to adapt to the target domain and do not properly handle the trade-off between the source domain and the target domain. In this work, instead of training a classifier to adapt to the target domain, we use a separable component called data calibrator to help the fixed source classifier recover discrimination power in the target domain,
while preserving the source domain’s performance. When the difference between two domains is small, the source
classifier’s representation is sufficient to perform well in the target domain and outperforms GAN-based methods in
digits. Otherwise, the proposed method can leverage synthetic images generated by GANs to boost performance and achieve state-of-the-art performance in digits datasets and driving scene semantic segmentation. Our method also empirically suggests the potential connection between domain adaptation and adversarial attacks.
Code release is available at https://github.com/yeshaokai/Calibrator-Domain-Adaptation
Linfeng ZhangTsinghua University; Institute for interdisciplinary Information Core TechnologyMuzhou YuInstitute for interdisciplinary Information Core Technology; Xi’an Jiaotong UniversityTong ChenTsinghua UniversityZuoqiang ShiTsinghua UniversityChenglong BaoTsinghua UniversityKaisheng MaTsinghua University
Training process is crucial for the deployment of the network in applications which have two strict requirements on both accuracy and robustness. However, most existing approaches are in a dilemma, i.e. model accuracy and robustness forming an embarrassing tradeoff – the improvement of one leads to the drop of the other. The challenge remains for as we try to improve the accuracy and robustness simultaneously. In this paper, we propose a novel training method via introducing the auxiliary classifiers for training on corrupted samples, while the clean samples are normally trained with the primary classifier. In the training stage, a novel distillation method named input-aware self distillation is proposed to facilitate the primary classifier to learn the robust information from auxiliary classifiers. Along with it, a new normalization method - selective batch normal-
ization is proposed to prevent the model from the negative influence of corrupted images. At the end of the training period, a L2-norm penalty is applied to the weights of primary and auxiliary classifiers such that their weights are asymptotically identical. In the stage of inference, only the primary classifier is used and thus no extra computation and storage are needed. Extensive experiments on CIFAR10, CIFAR100 and ImageNet show that noticeable improvements on both accuracy and robustness can be observed by the proposed auxiliary training. On average, auxiliary training achieves 2.21% accuracy and 21.64% robustness
(measured by corruption error) improvements over traditional training methods on CIFAR100. Codes have been released on github.
Zonghan YangInstitute for Artificial Intelligence, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua UniversityYang LiuInstitute for Artificial Intelligence, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua UniversityChenglong BaoYau Mathematical Sciences Center, Tsinghua UniversityZuoqiang ShiDepartment of Mathematical Sciences, Tsinghua University
Although ordinary differential equations (ODEs) provide insights for designing network architectures, its relationship with the non-residual convolutional neural networks (CNNs) is still unclear. In this paper, we present a novel ODE model by adding a damping term. It can be shown that the proposed model can recover both a ResNet and a CNN by adjusting an interpolation coefficient. Therefore, the damped ODE model provides a unified framework for the interpretation of residual and non-residual networks. The Lyapunov analysis reveals better stability of the proposed model, and thus yields robustness improvement of the learned networks. Experiments on a number of image classification benchmarks show that the proposed model substantially improves the accuracy of ResNet and ResNeXt over the perturbed inputs from both stochastic noise and adversarial attack methods. Moreover, the loss landscape analysis demonstrates the improved robustness of our method along the attack direction.
Jinxiu LiangSouth China University of Technology, Guangzhou 510006, ChinaYong XuSouth China University of Technology, Guangzhou 510006, China; Peng Cheng Laboratory, Shenzhen 510852, ChinaChenglong BaoTsinghua University, Beijing 100084, ChinaYuhui QuanSouth China University of Technology, Guangzhou 510006, ChinaHui JiNational University of Singapore, Singapore 117543, Singapore
Learning rate is arguably the most important hyper-parameter to tune when training a neural network. As manually setting right learning rate remains a cumbersome process, adaptive learning rate algorithms aim at automating such a process. Motivated by the success of the Barzilai–Borwein (BB) step-size method in many gradient descent methods for solving convex problems, this paper aims at investigating the potential of the BB method for training neural networks. With strong motivation from related convergence analysis, the BB method is generalized to adaptive learning rate of mini-batch gradient descent. The experiments showed that, in contrast to many existing methods, the proposed BB method is highly insensitive to initial learning rate, especially in terms of generalization performance. Also, the BB method showed its advantages on both learning speed and generalization performance over other available methods.
Tangjun Wangepartment of Mathematical Sciences, Tsinghua University, Beijing 100084, ChinaZehao DouDepartment of Statistics and Data Science, Yale UniversityChenglong BaoYau Mathematical Sciences Center, Tsinghua University, Beijing 100084, China, and also with Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, ChinZuoqiang Shiau Mathematical Sciences Center, Tsinghua University, Beijing 100084, China, and also with Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
Diffusion, a fundamental internal mechanism emerging in many physical processes, describes the interaction among different objects. In many learning tasks with limited training samples, the diffusion connects the labeled and unlabeled data points and is a critical component for achieving high classification accuracy. Many existing deep learning approaches directly impose the fusion loss when training neural networks. In this work, inspired by the convection-diffusion ordinary differential equations (ODEs), we propose a novel diffusion residual network (Diff-ResNet), internally introduces diffusion into the architectures of neural networks. Under the structured data assumption, it is proved that the proposed diffusion block can increase the distance-diameter ratio that improves the separability of inter-class points and reduces the distance among local intra-class points. Moreover, this property can be easily adopted by the residual networks for constructing the separable hyperplanes. Extensive experiments of synthetic binary classification, semi-supervised graph node classification and few-shot image classification in various datasets validate the effectiveness of the proposed method.
Wei WanSchool of mathematics and physics, North China Electric Power UniversityYuejin ZhangDepartment of Mathematical Science, Tsinghua UniversityChenglong BaoYau Mathematical Sciences Center, Tsinghua University, and Yaqi Lake Beijing Institute of Mathematical Sciences and ApplicationsBin DongBeijing International Center for Mathematical Research, Peking University, and Center for Ma- chine Learning Research, Peking UniversityZuoqiang ShiYau Mathematical Sciences Center, Tsinghua University, and Yaqi Lake Beijing Institute of Mathematical Sciences and Applications
The dynamic formulation of optimal transport has attracted growing interests in scientific computing and machine learning, and its computation requires to solve a PDE-constrained optimization problem. The classical Eulerian discretization based approaches suffer from the curse of dimensionality, which arises from the approximation of high-dimensional velocity field. In this work, we propose a deep learning based method to solve the dynamic optimal transport in high dimensional space. Our method contains three main ingredients: a carefully designed representation of the velocity field, the discretization of the PDE constraint along the characteristics, and the computation of high dimensional integral by Monte Carlo method in each time step. Specifically, in the representation of the velocity field, we apply the classical nodal basis function in time and the deep neural networks in space domain with the H1-norm regularization. This technique promotes the regularity of the velocity field in both time and space such that the discretization along the characteristic remains to be stable during the training process. Extensive numerical examples have been conducted to test the proposed method. Compared to other solvers of optimal transport, our method could give more accurate results in high dimensional cases and has very good scalability with respect to dimension. Finally, we extend our method to more complicated cases such as crowd motion problem.
In this paper, we develop an alternating direction method of multipliers (ADMM) for deep neural networks training with sigmoid-type activation functions (called sigmoid-ADMM pair ), mainly motivated by the gradient-free nature of ADMM in avoiding the saturation of sigmoid-type activations and the advantages of deep neural networks with sigmoid-type activations (called deep sigmoid nets) over their rectified linear unit (ReLU) counterparts (called deep ReLU nets) in terms of approximation. In particular, we prove that the approximation capability of deep sigmoid nets is not worse than that of deep ReLU nets by
showing that ReLU activation function can be well approximated by deep sigmoid nets with two hidden layers and finitely many free parameters but not vice-verse. We also establish the global convergence of the proposed ADMM for the nonlinearly constrained formulation of the deep sigmoid nets training from arbitrary initial points to a Karush-Kuhn-Tucker (KKT) point at a rate of order O(1/k). Besides sigmoid activation, such a convergence theorem holds for a general class of smooth activations. Compared with the widely used stochastic gradient descent (SGD) algorithm for the deep ReLU nets training (called ReLU-SGD pair), the proposed sigmoid-ADMM pair is practically stable with respect to the algorithmic hyperparameters including the learning rate, initial schemes and the pro-processing of the input data. Moreover, we find that to approximate and learn simple but important functions the proposed sigmoid-ADMM pair numerically outperforms the ReLU-SGD pair.
We propose a deep learning-based method for object detection in UAV-borne thermal images that have the capability of observing scenes in both day and night. Compared with visible images, thermal images have lower requirements for illumination conditions, but they typically have blurred edges and low contrast. Using a boundary-aware salient object detection network, we extract the saliency maps of the thermal images to improve the distinguishability. Thermal images are augmented with the corresponding saliency maps through channel replacement and pixel-level weighted fusion methods. Considering the limited computing power of UAV platforms, a lightweight combinational neural network ComNet is used as the core object detection method. The YOLOv3 model trained on the original images is used as a benchmark and compared with the proposed method. In the experiments, we analyze the detection performances of the ComNet models with different image fusion schemes. The experimental results show that the average precisions (APs) for pedestrian and vehicle detection have been improved by 2%~5% compared with the benchmark without saliency map fusion and MobileNetv2. The detection speed is increased by over 50%, while the model size is reduced by 58%. The results demonstrate that the proposed method provides a compromise model, which has application potential in UAV-borne detection tasks.