A second year DPhil student as part of the OxWaSP program, supervised by Professor Dino Sejdinovic and Professor Yee Whye Teh. My research interests lie in the use of kernel methods in the meta-learning setup. I am also interested in Gaussian processes and un/self-supervised learning.
Publications
2021
S. L. Chau
,
J. Ton
,
J. Gonzalez
,
Y. W. Teh
,
D. Sejdinovic
,
BayesIMP: Uncertainty Quantification for Causal Data Fusion, in Advances in Neural Information Processing Systems (NeurIPS), 2021.
While causal models are becoming one of the mainstays of machine learning, the problem of uncertainty quantification in causal inference remains challenging. In this paper, we study the causal data fusion problem, where datasets pertaining to multiple causal graphs are combined to estimate the average treatment effect of a target variable. As data arises from multiple sources and can vary in quality and quantity, principled uncertainty quantification becomes essential. To that end, we introduce Bayesian Interventional Mean Processes, a framework which combines ideas from probabilistic integration and kernel mean embeddings to represent interventional distributions in the reproducing kernel Hilbert space, while taking into account the uncertainty within each causal graph. To demonstrate the utility of our uncertainty estimation, we apply our method to the Causal Bayesian Optimisation task and show improvements over state-of-the-art methods.
@inproceedings{ChaTonGonTehSej2021,
title = {{BayesIMP: Uncertainty Quantification for Causal Data Fusion}},
author = {Chau, Siu Lun and Ton, Jean-Francois and Gonzalez, Javier and Teh, Yee Whye and Sejdinovic, Dino},
year = {2021},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}
}
Z. Li
,
J. Ton
,
D. Oglic
,
D. Sejdinovic
,
Towards A Unified Analysis of Random Fourier Features, Journal of Machine Learning Research (JMLR), vol. 22, no. 108, 1–51, 2021.
Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. The existing theoretical analysis of the approach, however, remains focused on specific learning tasks and typically gives pessimistic bounds which are at odds with the empirical results. We tackle these problems and provide the first unified risk analysis of learning with random Fourier features using the squared error and Lipschitz continuous loss functions. In our bounds, the trade-off between the computational cost and the learning risk convergence rate is problem specific and expressed in terms of the regularization parameter and the number of effective degrees of freedom. We study both the standard random Fourier features method for which we improve the existing bounds on the number of features required to guarantee the corresponding minimax risk convergence rate of kernel ridge regression, as well as a data-dependent modification which samples features proportional to ridge leverage scores and further reduces the required number of features. As ridge leverage scores are expensive to compute, we devise a simple approximation scheme which provably reduces the computational cost without loss of statistical efficiency. Our empirical results illustrate the effectiveness of the proposed scheme relative to the standard random Fourier features method.
@article{LiTonOglSej2021,
author = {Li, Zhu and Ton, Jean-Francois and Oglic, Dino and Sejdinovic, Dino},
title = {{{Towards A Unified Analysis of Random Fourier Features}}},
journal = {Journal of Machine Learning Research (JMLR)},
volume = {22},
number = {108},
year = {2021},
pages = {1--51}
}
J. Ton
,
L. Chan
,
Y. W. Teh
,
D. Sejdinovic
,
Noise Contrastive Meta Learning for Conditional Density Estimation using Kernel Mean Embeddings, in Artificial Intelligence and Statistics (AISTATS), 2021, PMLR 130:1099–1107.
Current meta-learning approaches focus on learning functional representations of relationships between variables, i.e. estimating conditional expectations in regression. In many applications, however, the conditional distributions cannot be meaningfully summarized solely by expectation (due to e.g. multimodality). We introduce a novel technique for meta-learning conditional densities, which combines neural representation and noise contrastive estimation together with well-established literature in conditional mean embeddings into reproducing kernel Hilbert spaces. The method shows significant improvements over standard density estimation methods on synthetic and real-world data, by leveraging shared representations across multiple conditional density estimation tasks.
@inproceedings{TonChaTehSej2021,
author = {Ton, Jean-Francois and Chan, Leung and Teh, Yee Whye and Sejdinovic, Dino},
title = {{{Noise Contrastive Meta Learning for Conditional Density Estimation using Kernel Mean Embeddings}}},
pages = {PMLR 130:1099--1107},
year = {2021},
booktitle = {Artificial Intelligence and Statistics (AISTATS)}
}
J. Ton
,
D. Sejdinovic
,
K. Fukumizu
,
Meta Learning for Causal Direction, in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 11, 9897–9905.
The inaccessibility of controlled randomized trials due to inherent constraints in many fields of science has been a fundamental issue in causal inference. In this paper, we focus on distinguishing the cause from effect in the bivariate setting under limited observational data. Based on recent developments in meta learning as well as in causal inference, we introduce a novel generative model that allows distinguishing cause and effect in the small data setting. Using a learnt task variable that contains distributional information of each dataset, we propose an end-to-end algorithm that makes use of similar training datasets at test time. We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes.
@inproceedings{TonSejFuk2021,
author = {Ton, Jean-Francois and Sejdinovic, Dino and Fukumizu, Kenji},
title = {{{Meta Learning for Causal Direction}}},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
volume = {35},
number = {11},
pages = {9897--9905},
year = {2021}
}
We develop a functional encoder-decoder approach to supervised meta-learning, where labeled data is encoded into an infinite-dimensional functional representation rather than a finite-dimensional one. Furthermore, rather than directly producing the representation, we learn a neural update rule resembling functional gradient descent which iteratively improves the representation. The final representation is used to condition the decoder to make predictions on unlabeled data. Our approach is the first to demonstrates the success of encoder-decoder style meta-learning methods like conditional neural processes on large-scale few-shot classification benchmarks such as miniImageNet and tieredImageNet, where it achieves state-of-the-art performance.
@inproceedings{xu2019metafun,
title = {MetaFun: Meta-Learning with Iterative Functional Updates},
author = {Xu, Jin and Ton, Jean-Francois and Kim, Hyunjik and Kosiorek, Adam R and Teh, Yee Whye},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2020}
}
2019
J. Ton
,
L. Chan
,
Y. W. Teh
,
D. Sejdinovic
,
Noise Contrastive Meta-Learning for Conditional Density Estimation using Kernel Mean Embeddings, ArXiv e-prints:1906.02236, 2019.
Current meta-learning approaches focus on learning functional representations of relationships between variables, i.e. on estimating conditional expectations in regression. In many applications, however, we are faced with conditional distributions which cannot be meaningfully summarized using expectation only (due to e.g. multimodality). Hence, we consider the problem of conditional density estimation in the meta-learning setting. We introduce a novel technique for meta-learning which combines neural representation and noise-contrastive estimation with the established literature of conditional mean embeddings into reproducing kernel Hilbert spaces. The method is validated on synthetic and real-world problems, demonstrating the utility of sharing learned representations across multiple conditional density estimation tasks.
@unpublished{TonChaTehSej2019,
author = {Ton, Jean-Francois and Chan, Lucian and Teh, Yee Whye and Sejdinovic, Dino},
title = {{{Noise Contrastive Meta-Learning for Conditional Density Estimation using Kernel Mean Embeddings}}},
journal = {ArXiv e-prints:1906.02236},
year = {2019}
}
Z. Li
,
J. Ton
,
D. Oglic
,
D. Sejdinovic
,
Towards A Unified Analysis of Random Fourier Features, in International Conference on Machine Learning (ICML), 2019, PMLR 97:3905–3914.
Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. The existing theoretical analysis of the approach, however, remains focused on specific learning tasks and typically gives pessimistic bounds which are at odds with the empirical results. We tackle these problems and provide the first unified risk analysis of learning with random Fourier features using the squared error and Lipschitz continuous loss functions. In our bounds, the trade-off between the computational cost and the expected risk convergence rate is problem specific and expressed in terms of the regularization parameter and the \emphnumber of effective degrees of freedom. We study both the standard random Fourier features method for which we improve the existing bounds on the number of features required to guarantee the corresponding minimax risk convergence rate of kernel ridge regression, as well as a data-dependent modification which samples features proportional to \emphridge leverage scores and further reduces the required number of features. As ridge leverage scores are expensive to compute, we devise a simple approximation scheme which provably reduces the computational cost without loss of statistical efficiency.
@inproceedings{LiTonOglSej2019,
author = {Li, Z. and Ton, J.-F. and Oglic, D. and Sejdinovic, D.},
title = {{{Towards A Unified Analysis of Random Fourier Features}}},
booktitle = {International Conference on Machine Learning (ICML)},
pages = {PMLR 97:3905-3914},
year = {2019}
}
H. Chai
,
J. Ton
,
M. Osborne
,
R. Garnett
,
Automated Model Selection with Bayesian Quadrature, in International Conference on Machine Learning (ICML), 2019, PMLR 97:931–940.
We present a novel technique for tailoring Bayesian quadrature (BQ) to model selection. The state-of-the-art for comparing the evidence of multiple models relies on Monte Carlo methods, which converge slowly and are unreliable for computationally expensive models. Previous research has shown that BQ offers sample efficiency superior to Monte Carlo in computing the evidence of an individual model. However, applying BQ directly to model comparison may waste computation producing an overly-accurate estimate for the evidence of a clearly poor model. We propose an automated and efficient algorithm for computing the most-relevant quantity for model selection: the posterior probability of a model. Our technique maximizes the mutual information between this quantity and observations of the models’ likelihoods, yielding efficient acquisition of samples across disparate model spaces when likelihood observations are limited. Our method produces more-accurate model posterior estimates using fewer model likelihood evaluations than standard Bayesian quadrature and Monte Carlo estimators, as we demonstrate on synthetic and real-world examples.
@inproceedings{chai2019automated,
author = {Chai, H. and Ton, J.-F. and Osborne, M. and Garnett, R.},
title = {{{Automated Model Selection with Bayesian Quadrature}}},
booktitle = {International Conference on Machine Learning (ICML)},
pages = {PMLR 97:931-940},
year = {2019}
}
2018
J. Ton
,
S. Flaxman
,
D. Sejdinovic
,
S. Bhatt
,
Spatial Mapping with Gaussian Processes and Nonstationary Fourier Features, Spatial Statistics, vol. 28, 59–78, 2018.
The use of covariance kernels is ubiquitous in the field of spatial statistics. Kernels allow data to be mapped into high-dimensional feature spaces and can thus extend simple linear additive methods to nonlinear methods with higher order interactions. However, until recently, there has been a strong reliance on a limited class of stationary kernels such as the Matern or squared exponential, limiting the expressiveness of these modelling approaches. Recent machine learning research has focused on spectral representations to model arbitrary stationary kernels and introduced more general representations that include classes of nonstationary kernels. In this paper, we exploit the connections between Fourier feature representations, Gaussian processes and neural networks to generalise previous approaches and develop a simple and efficient framework to learn arbitrarily complex nonstationary kernel functions directly from the data, while taking care to avoid overfitting using state-of-the-art methods from deep learning. We highlight the very broad array of kernel classes that could be created within this framework. We apply this to a time series dataset and a remote sensing problem involving land surface temperature in Eastern Africa. We show that without increasing the computational or storage complexity, nonstationary kernels can be used to improve generalisation performance and provide more interpretable results.
@article{TonFlaSejBha2018,
author = {Ton, J.-F. and Flaxman, S. and Sejdinovic, D. and Bhatt, S.},
title = {{Spatial Mapping with Gaussian Processes and Nonstationary Fourier Features}},
journal = {Spatial Statistics},
year = {2018},
volume = {28},
pages = {59--78}
}