Maximal Margin Labeling for Multi-Topic Text Categorization

Kazawa, Hideto; Izumitani, Tomonori; Taira, Hirotoshi; Maeda, Eisaku

Maximal Margin Labeling for Multi-Topic Text Categorization

Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, Eisaku Maeda

Advances in Neural Information Processing Systems 17 (NIPS 2004)

Abstract

In this paper, we address the problem of statistical learning for multi- topic text categorization (MTC), whose goal is to choose all relevant top- ics (a label) from a given set of topics. The proposed algorithm, Max- imal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classiﬁer on the induced multi-class cate- gorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experi- ments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines.

1 Multi-topic Text Categorization (MTC)

This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say that a text is assigned a label, not a topic.

In almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by judging each topic’s relevance to the text. In this decomposition approach, the features speciﬁc to a topic, not a label, are regarded as important features. However, the approach may result in inefﬁcient learning as we will explain in the following example.

Imagine an MTC problem of scientiﬁc papers where quantum computing papers are as- signed multi-topic label “quantum physics (QP) & computer science (CS)”. (QP and CS are topics in this example.) Since there are some words speciﬁc to quantum computing such as “qbit1”, one can say that efﬁcient MTC learners should use such words to assign label QP & CS. However, the decomposition approach is likely to ignore these words since they are only speciﬁc to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS.

1Qbit is a unit of quantum information, and frequently appears in real quantum computing litera-

tures, but rarely seen in other literatures.

Symbol x(∈ Rd) t1, t2, . . . , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi, Li)}m

i=1

Meaning A document vector Topics The set of all topics A label The binary representation of L. 1 if tj ∈ L and 0 otherwise. The set of all possible labels Training samples

Abstract

Name Change Policy