Search for a command to run...
Multiple instance learning (MIL) is a problem structure in which large sets of instances are grouped into ensembles/bags and labels are provided only at bag level. A general formulation is that the probability of a bag being positive is estimated as the maximum probability of being positive across its set of instances. Accurate bag label prediction depends on the effective identification of the instances with the maximum probability of being positive. These instances may be very sparse (i.e., low witness rates). We exemplify our study by structuring a data generation process mimicking an archetypal MIL problem: the prediction of a patient’s disease state based on immune receptor sequences of a very large number of adaptive immune cells, where only a few are involved in the etiology of the given disease. We exemplify our study by designing a data generation process mimicking an archetypal MIL problem: predicting a patient’s disease state based on immune receptor sequences from a large number of adaptive immune cells, where only a few contribute to the etiology of the disease. We thus consider cases where instances are short sequences and explore under which circumstances the sparsity of instance relevance for bag labels (low witness rate) corresponds to a sparsity of feature relevance for the determination (prediction) of bag labels. In such circumstances, we explore for which condition the different sparsities could support the detection of positive instances and relevant features, as well as exploring when such sparsity increases the effectiveness of strongly regularised LASSO models. For the task, we constructed a data-generating process where bags are composed of large sets of short sequences and systematically explored how strongly regularised logistic regression models performed across a range of data simulation parameterisations. We conclude by quantitatively reporting at which MIL problem characteristics bag label prediction is accurate even at lower witness rates. Our approach and finding provide a robust guide on the limits of multiple instance learning problems with sparse data. In particular, we provide a statistical support for experimental campaigns on immune receptors highlighting data requirements for a desirable identification accuracy of the etiology of the given disease.
Published in: Machine Learning with Applications
Volume 21, pp. 100679-100679