Fuxin Li ‡ †

Arridhana Ciptadi †

† Georgia Institute of Technology

Abstract

James M. Rehg †

‡ Oregon State University

line in tracking evaluations. MHT is in essence a breadthfirst search algorithm, hence its performance strongly depends on the ability to prune branches in the search tree quickly and reliably, in order to keep the number of track hypotheses manageable. In the early work on MHT for visual tracking [12], target detectors were unreliable and motion models had limited utility, leading to high combinatoric growth of the search space and the need for efficient pruning methods.

This paper revisits the classical multiple hypotheses tracking (MHT) algorithm in a tracking-by-detection framework. The success of MHT largely depends on the ability to maintain a small list of potential hypotheses, which can be facilitated with the accurate object detectors that are currently available. We demonstrate that a classical MHT implementation from the 90’s can come surprisingly close to the performance of state-of-the-art methods on standard benchmark datasets. In order to further utilize the strength of MHT in exploiting higher-order information, we introduce a method for training online appearance models for each track hypothesis. We show that appearance models can be learned efficiently via a regularized least squares framework, requiring only a few extra operations for each hypothesis branch. We obtain state-of-the-art results on popular tracking-by-detection datasets such as PETS and the recent MOT challenge.

This paper argues that the MHT approach is well-suited to the current visual tracking context. Modern advances in tracking-by-detection and the development of effective feature representations for object appearance have created new opportunities for the MHT method. First, we demonstrate that a modern formulation of a standard motion-based MHT approach gives comparable performance to state-of-the-art methods on popular tracking datasets. Second, and more importantly, we show that MHT can easily exploit highorder appearance information which has been difficult to incorporate into other tracking frameworks based on unary and pairwise energies. We present a novel MHT method which incorporates long-term appearance modeling, using features from deep convolutional neural networks [20, 16]. The appearance models are trained online for each track hypothesis on all detections from the entire history of the track. We utilize online regularized least squares [25] to achieve high efficiency. In our formulation, the computational cost of training the appearance models has little dependency on the number of hypothesis branches, making it extremely suitable for the MHT approach.

1. Introduction Multiple Hypotheses Tracking (MHT) is one of the earliest successful algorithms for visual tracking. Originally proposed in 1979 by Reid [36], it builds a tree of potential track hypotheses for each candidate target, thereby providing a systematic solution to the data association problem. The likelihood of each track is calculated and the most likely combination of tracks is selected. Importantly, MHT is ideally suited to exploiting higher-order information such as long-term motion and appearance models, since the entire track hypothesis can be considered when computing the likelihood. MHT has been popular in the radar target tracking community [6]. However, in visual tracking problems, it is generally considered to be slow and memory intensive, requiring many pruning tricks to be practical. While there was considerable interest in MHT in the vision community during the 90s, for the past 15 years it has not been a mainstream approach for tracking, and rarely appears as a base-

Our experimental results demonstrate that our scoring function, which combines motion and appearance, is highly effective in pruning the hypothesis space efficiently and accurately. Using our trained appearance model, we are able to cut the effective number of branches in each frame to about 50% of all branches (Sec. 5.1). This enables us to make less restrictive assumptions on motion and explore a larger space of hypotheses. This also makes MHT less sensitive to parameter choices and heuristics (Fig. 3). Experiments on the PETS and the recent MOT challenge illustrate the state-of-the-art performance of our approach.

‡ This work was conducted while the 2nd author was at Georgia Tech.

1

2. Related Work Network flow-based methods [35, 4, 45, 10] have recently become a standard approach to visual multi-target tracking due to their computational efficiency and optimality. In recent years, efficient inference algorithms to find the globally optimal solution [45, 4] or approximate solutions [35] have been introduced. However, the benefits of flow-based approaches come with a costly restriction: the cost function can only contain unary and pairwise terms. Pairwise costs are very restrictive in representing motion and appearance. In particular, it is difficult to represent even a linear motion model with those terms. An alternative is to define pairwise costs between tracklets – short object tracks that can be computed reliably [26, 3, 18, 8]. Unfortunately the availability of reliable tracklets cannot be guaranteed, and any mistakes propagate to the final solution. In Brendel et al. [8], data association for tracklets is solved using the Maximum Weighted Independent Set (MWIS) method. We also adopt MWIS, but follow the classical formulation in [34] and focus on the incorporation of appearance modeling. Collins [11] showed mathematically that the multidimensional assignment problem is a more complete representation of the multi-target tracking problem than the network flow formulation. Unlike network flow, there is no limitation in the form of the cost function, even though finding an exact solution to the multidimensional assignment problem is intractable. Classical solutions to multidimensional assignment are MHT [36, 12, 17, 34] and Markov Chain Monte Carlo (MCMC) data association [19, 32]. While MCMC provides asymptotic guarantees, MHT has the potential to explore the solution space more thoroughly, but has traditionally been hindered by the exponential growth in the number of hypotheses and had to resort to aggressive pruning strategies, such as propagating only the M -best hypotheses [12]. We will show that this limitation can be addressed through discriminative appearance modeling. Andriyenko [1] proposed a discrete-continuous optimization method to jointly solve trajectory estimation and data association. Trajectory estimation is solved by spline fitting and data association is solved via MRF inference. These two steps are alternated until convergence. Segal [37] proposed a related approach based on a message passing algorithm. These methods are similar to MHT in the sense that they directly optimize a global energy with no guarantees on solution quality. But in practice, MHT is more effective in identifying high quality solutions. There have been a significant number of prior works that exploit appearance information to solve data association. In the network flow-based method, the pairwise terms can be weighted by offline trained appearance templates [38] or a simple distance metric between appearance features [45].

However, these methods have limited capability to model the complex appearance changes of a target. In [17], a simple fixed appearance model is incorporated into a standard MHT framework. In contrast, we show that MHT can be extended to include online learned discriminative appearance models for each track hypothesis. Online discriminative appearance modeling is a standard method for addressing appearance variation [39]. In tracklet association, several works [2, 42, 21, 22] train discriminative appearance models of tracklets in order to design a better affinity score function. However, these approaches still share the limitations of the tracklet approach. Other works [7, 40] train a classifier for each target and use the classification score for greedy data association or particle filtering. These methods only keep one online learned model for each target, while our method trains multiple online appearance models via multiple track hypotheses, which is more robust to model drift.

3. Multiple Hypotheses Tracking We adopt a tracking-by-detection framework such that our observations are localized bounding boxes obtained from an object detection algorithm. Let k denote the most recent frame and Mk denote the number of object detections (i.e. observations) in that frame. For a given track, let ik denote the observation which is selected at frame k, where ik ∈ {0, 1, . . . , Mk }. The observation sequence i1 , i2 , . . . , ik then defines a track hypothesis over k frames. Note that the dummy assignment it = 0 represents the case of a missing observation (due to occlusion or a false negative).1 Let the binary variable zi1 i2 ...ik denote whether or not a track hypothesis is selected in the final solution. A global hypothesis is a set of track hypotheses that are not in conflict, i.e. that do not share any measurements at any time. A key strategy in MHT is to delay data association decisions by keeping multiple hypotheses active until data association ambiguities are resolved. MHT maintains multiple track trees, and each tree represents all of the hypotheses that originate from a single observation (Fig. 1c). At each frame, the track trees are updated from observations and each track in the tree is scored. The best set of non-conflicting tracks (the best global hypothesis) can then be found by solving a maximum weighted independent set problem (Fig. 2a). Afterwards, branches that deviate too much from the global hypothesis are pruned from the trees, and the algorithm proceeds to the next frame. In the rest of this section, we will describe the approach in more detail. 1 For notational convenience, observation sequences can be assumed to be padded with zeros so that all track hypotheses can be treated as fixed length sequences, despite their varying starting and ending times.

Tree 1 k

t=1

k 4

1

3

��

Tree 3

...

5

4

Tree 2

1

t = k-2

1

2

2

1 2

5 1

1 2

2

3

��

t = k-1

3

3

2

2

1

1

t=k

(a) Tracks in Video Frames

3

1

2

4

(b) Gating

3

1

0

5

(c) Track Trees

Figure 1. Illustration of MHT. (a) Track hypotheses after the gating test at time k. Only a subset of track hypotheses is visualized here for simplicity. (b) Example gating areas for two track hypotheses with different thresholds dth . (c) The corresponding track trees. Each tree node is associated with an observation in (a). 層惣想

宋宋捜

Tree 1 t=1 匝層0

層匝1

匝匝惣

Tree 3

k

...

層惣匝

Tree 2

1

t = k-2

1

2

4

1

t = k-1

3

2

5 1

匝匝層

t=k

4

(a) MWIS

2

1

3

2

2

3

層匝惣

1

3 2

1

5

(b) N -scan Pruning

(c) Remaining Track Hypotheses

Figure 2. (a) An undirected graph for the example of Fig. 1 in which each track hypothesis is a node and an edge connects two tracks that are conflicting. The observations for each hypothesis in the last three frames are indicated. An example of the Maximum Weighted Independent Set (MWIS) is highlighted in blue. (b) An N -scan pruning example (N = 2). The branches in blue contain the global hypothesis at frame k. Pruning at t = k − 2 removes all branches that are far from the global hypothesis. (c) Track hypotheses after the pruning. The trajectories in blue represent the finalized measurement associations.

3.1. Track Tree Construction and Updating A track tree encapsulates multiple hypotheses starting from a single observation. At each frame, a new track tree is constructed for each observation, representing the possibility that this observation corresponds to a new object entering the scene. Previously existing track trees are also updated with observations from the current frame. Each track hypothesis is extended by appending new observations located within its gating area as its children, with each new observation spawning a separate branch. We also always spawn a separate branch with a dummy observation, in order to account for missing detection.

3.2. Gating Based on the motion estimates, a gating area is predicted for each track hypothesis which specifies where the next observation of the track is expected to appear. Let xlk be the random variable that represents the likely location of the lth track at time k. The variable xlk is asˆ lk and cosumed to be normally distributed with mean x l variance Σk determined by Kalman filtering. The decision whether to update a particular trajectory with a new observation ik is made based on the Mahalanobis distance d2 be-

tween the observation location yik and the predicted locaˆ lk : tion x d2 = (ˆ xlk − yik )> (Σlk )−1 (ˆ xlk − yik ) ≤ dth .

(1)

The distance threshold dth determines the size of the gating area (see Fig. 1b).

3.3. Track Scoring Each track hypothesis is associated with a track score. The lth track’s score at frame k is defined as follows: l l S l (k) = wmot Smot (k) + wapp Sapp (k)

(2)

l l where Smot (k) and Sapp (k) are the motion and appearance scores, and wmot and wapp are the weights that control the contribution of the location measurement yik and the appearance measurement Xik to the track score, respectively. Following the original formulation [6], we use the log likelihood ratio (LLR) between the target hypothesis and the null hypothesis as the motion score. The target hypothesis assumes that the sequence of observations comes from the same target, and the null hypothesis assumes that the sequence of observations comes from the background. Then

the lth track’s motion score at time k is defined as: l Smot (k) = ln

p(yi1:k |i1:k ⊆ Tl ) p(yi1:k |i1:k ⊆ φ)

Xi1:t−1 . We utilize the constant probability c1 for the posterior of the background (null) hypothesis. (3)

where we use the notation i1:k for the sequence of observations i1 , i2 , ..., ik . We denote by i1:k ⊆ Tl the target hypothesis that the observation sequence comes from the lth track and we denote the null hypothesis by i1:k ⊆ φ. The likelihood factorizes as: Qk p(yi |yi , i1:t ⊆ Tl ) p(yi1:k |i1:k ⊆ Tl ) = t=1Qk t 1:t−1 p(yi1:k |i1:k ⊆ φ) t=1 p(yit |it ⊆ φ) (4) where we assume that measurements are conditionally independent under the null hypothesis. The likelihood for each location measurement at time t under the target hypothesis is assumed to be Gaussian. The ˆ lt and the covariance Σlt are estimated by a Kalman mean x filter for the measurements yi1:t−1 . The likelihood under the null hypothesis is assumed to be uniform. The factored likelihood terms at time t are then written as: ˆ lt , Σlt ), p(yit |yi1:t−1 , i1:t ⊆ Tl ) = N (yit ; x p(yit |it ⊆ φ) = 1/V

(5)

where V is the measurement space [6, 12], which is the image area or the area of the ground plane for 2.5D tracking. The appearance track score is defined as: p(i1:k ⊆ Tl |Xi1:k ) p(Xi1:k |i1:k ⊆ Tl ) = ln p(Xi1:k |i1:k ⊆ φ) p(i1:k ⊆ φ|Xi1:k ) (6) where we obtain the posterior LLR under the assumption of equal priors. The posterior ratio factorizes as: l Sapp (k) = ln

∆S l (k) =

eF (Xit ) (8) + e−F (Xit )

eF (Xit )

where F (·) is the classification score for the appearance features Xit and the classifier weights are learned from

S l (k) = S l (k − 1) + ∆S l (k), ( 1−PD ≈ ln(1 − PD ), ln 1−P FA l wmot ∆Smot (k)

+

(10) if ik = 0

l wapp ∆Sapp (k),

otherwise (11) where PD and PF A (assumed to be very small) are the probabilities of detection and false alarm, respectively. l l ∆Smot (k) and ∆Sapp (k) are the increments of the track motion score and the track appearance score at time k and are calculated using Eqs. (5), (8), and (9) as: 1 d2 V − ln |Σlk | − , 2π 2 2 l ∆Sapp (k) = − ln (1 + e−2F (Xik ) ) − ln c1 .

l ∆Smot (k) = ln

(12)

The score update continues as long as the track hypothesis is updated with detections. A track hypothesis which is assigned dummy observations for Nmiss consecutive frames is deleted from the hypothesis space.

3.4. Global Hypothesis Formation Given the set of trees that contains all trajectory hypotheses for all targets, we want to determine the most likely combination of object tracks at frame k. This can be formulated as the following k-dimensional assignment problem: max

t=1

p(it ⊆ Tl |i1:t−1 ⊆ Tl , Xi1:t ) =

(9)

The track score expresses whether a track hypothesis is more likely to be a true target (S l (k) > 0) or false alarm (S l (k) < 0). The score can be computed recursively [6]:

Qk

p(it ⊆ Tl |i1:t−1 ⊆ Tl , Xi1:t ) Qk t=1 p(it ⊆ φ|Xit ) (7) Sk where we utilize {i1:k ⊆ Tl } = t=1 {it ⊆ Tl } for the factorization. We assume that it ⊆ Tl is conditionally independent of future measurements Xit+1:k and the it ⊆ φ hypotheses are independent given the current measurement Xit . Each term in the factored posterior comes from the online learned classifier (Sec. 4) at time t. Given prior observations i1:t−1 , we define the posterior of the event that observation it is in the lth track as: p(i1:k ⊆ Tl |Xi1:k ) = p(i1:k ⊆ φ|Xi1:k )

p(it ⊆ φ|Xit ) = c1

z

subject to

M1 X M2 X

M1 X i1 =0

for

···

i1 =0 i2 =0

···

Mk X

si1 i2 ...ik zi1 i2 ...ik

ik =0

Mu−1

Mu+1

X

X

···

iu−1 =0 iu+1 =0

iu = 1,2, ..., Mu

and

Mk X

zi1 i2 ...iu ...ik = 1

ik =0

u = 1, 2, ..., k

(13) where we have one constraint for each observation iu , which ensures that it is assigned to a unique track. Each track is associated with its binary variable zi1 i2 ...ik and track score si1 i2 ...ik which is calculated by Eq. (2). Thus, the objective function in Eq. (13) represents the total score of the tracks in the global hypothesis. This optimization problem is known to be NP-hard when k is greater than 2. Following [34], the task of finding the most likely set of tracks can be formulated as a Maximum Weighted Independent Set (MWIS) problem. This problem was shown in [34] to be equivalent to the multidimensional assignment

problem (13) in the context of MHT. An undirected graph G = (V, E) is constructed by assigning each track hypothesis Tl to a graph vertex xl ∈ V (see Fig. 2a). Note that the number of track hypotheses needs to be controlled by track pruning (Sec. 3.5) at every frame in order to avoid the exponential growth of the graph size. Each vertex has a weight wl that corresponds to its track score S l (k). An edge (l, j) ∈ E connects two vertices xl and xj if the two tracks cannot co-exist due to shared observations at any frame. An independent set is a set of vertices with no edges in common. Thus, finding the maximum weight independent set is equivalent to finding the set of compatible tracks that maximizes the total track score. This leads to the following discrete optimization problem: X max wl xl x (14) l s.t. xl + xj ≤1, ∀(l, j) ∈ E, xl ∈ {0, 1}. We utilize either an exact algorithm [33] or an approximate algorithm [9] to solve the MWIS optimization problem, depending on its hardness (as determined by the number of nodes and the graph density).

3.5. Track Tree Pruning Pruning is an essential step for MHT due to the exponential increase in the number of track hypotheses over time. We adopt the standard N -scan pruning approach. First, we identify the tree branches that contain the object tracks within the global hypothesis obtained from Eq. (14). Then for each of the selected branches, we trace back to the node at frame k − N and prune the subtrees that diverge from the selected branch at that node (see Fig. 2b). In other words, we consolidate the data association decisions for old observations up to frame k−(N −1). The underlying assumption is that the ambiguities in data association for frames 1 to k − N can be resolved after looking ahead for a window of N frames [12]. A larger N implies a larger window hence the solution can be more accurate, but makes the running time longer. After pruning, track trees that do not contain any track in the global hypothesis will be deleted. Besides N -scan pruning, we also prune track trees that have grown too large. If at any specific time the number of branches in a track tree is more than a threshold Bth , then we prune the track tree to retain only the top Bth branches based on its track score. When we use MHT-DAM (see Table 1), the appearance model enables us to perform additional branch pruning. This enables us to explore a larger gating area without increasing the number of track hypotheses significantly. Specifically, we set ∆Sapp (t) = −∞, preventing the tree from spawning a branch for observation it , when its appearance score F (Xit ) < c2 . These are the only pruning mechanisms in our MHT implementation.

4. Online Appearance Modeling Since the data association problem is ill-posed, different sets of kinematically plausible trajectories always exist. Thus, many methods make strong assumptions on the motion model, such as linear motion or constant velocity [37, 44, 10]. However, such motion constraints are frequently invalid and can lead to poor solutions. For example, the camera can move or the target of interest may also suddenly change its direction and velocity. Thus, motion-based constraints are not very robust. When target appearances are distinctive, taking the appearance information into account is essential to improve the accuracy of the tracking algorithm. We adopt the multioutput regularized least squares framework [25] for learning appearance models of targets in the scene. As an online learning scheme, it is less susceptible to drifting than local appearance matching, because multiple appearances from many frames are taken into account. We first review the Multi-output Regularized Least Squares (MORLS) framework and then explain how this framework fits into MHT.

4.1. Multi-output Regularized Least Squares Multiple linear regressors are trained and updated simultaneously in multi-output regularized least squares. At frame k, the weight vectors for the linear regressors are represented by a d×n weight matrix Wk where d is the feature dimension and n is the number of regressors being trained. Let Xk = [Xk,1 |Xk,2 |...|Xk,nk ]> be a nk × d input matrix where nk is the number of feature vectors (i.e. detections), and Xk,i represents the appearance features from the i-th training example at time k. Let Vk = [Vk,1 |Vk,2 |...|Vk,n ] denote a nk × n response matrix where Vk,i is a nk × 1 response vector for the ith regressor at time k. When a new ˆ k+1 input matrix Xk+1 is received, the response matrix V for the new input can be predicted by Xk+1 Wk . The weight matrix Wk is learned at time k. Given all the training examples (Xi , Vi ) for 1 ≤ i ≤ k, the weight matrix can be obtained as: min Wk

k X

kXi Wk − Vi k2F + λkWk k2F

(15)

t=1

where k · kF is the Frobenius norm. The optimal solution is given by the following system of linear equations: (Hk + λI)Wk = Ck

(16)

Pk

> where Hk = t=1 Xt Xt is the covariance matrix, and Pk > Ck = t=1 Xt Vt is the correlation matrix. The model is online because at any given time only Hk and Ck need to be stored and updated. Hk and Ck can be updated recursively via:

Hk+1 = Hk + X> k+1 Xk+1 ,

(17)

Ck+1 = Ck + X> k+1 Vk+1

(18)

which only requires the inputs and responses at time k + 1.

4.2. Application of MORLS to MHT We utilize each detected bounding box as a training example. Appearance features from all detection boxes at time k form the input matrix Xk . Each tree branch (track hypothesis) is paired with a regressor which is trained with the detections from the time when the track tree was born to the current time k. Detections from the entire history of the track hypothesis serve as positive examples and all other detections serve as negative examples. The response for the positive example is 1, and the responses for the negative examples are set to −1. Note that a classification loss function (e.g. hinge loss) will be more suitable for this problem, but then the benefits of efficient updates and an analytic globally optimal solution would be lost. The online nature of the least squares framework makes it efficient to update multiple regressors as the track tree is extended over time. Starting from one appearance model at the root node, different appearance models will be generated as the track tree spawns different branches. H and C in the current tree layer (corresponding to the current frame) are copied into the next tree layer (next frame), and then updates according to Eqs. (17) and (18) are performed for all of the tree branches in the next tree layer. Suppose we have Hk−1 and Ck−1 and are branching into n branches at time k. Note that the update of Hk only depends on Xk and is done once, no matter how many branches are spawned at time k. Ck depends on both Xk and Vk . Hence, for each new tree branch i, one matrix-vector multiplication X> k Vk,i needs to be performed. The total time complex> > > ity for computing X> k Vk = [Xk Vk,1 |Xk Vk,2 |...|Xk Vk,n ] is then O(dnnk ) which is linear in both the number of tree branches n and the number of detections nk . The most time-consuming operation in training the model is updating and decomposing H in solving Eq. (16). This operation is shared among all the track trees that start at the same frame and is independent of the branches on the track trees. Thus, one can easily spawn many branches in each track tree with minimal additional computation required for appearance updating. This property is unique to tree-based MHT, where all the branches have the same ancestry. If one is training long-term appearance models using other global methods such as [31] and [32], then such computational benefits disappear, and the appearance model would need to be fully updated for each target separately, which would incur substantial computational cost. As for the appearance features, we utilize the convolutional neural network features trained on the ImageNet+PASCAL VOC dataset in [16]. We follow the protocol in [16] to extract the 4096-dimensional feature for each detection box. For better time and space complexity, a prin-

cipal component analysis (PCA) is then performed to reduce the dimensionality of the features. In the experiments we take the first 256 principal components.

5. Experiments In this section we first present several experiments that show the benefits of online appearance modeling on MHT. We use 11 MOT Challenge [24] training sequences and 5 PETS 2009 [14] sequences for these experiments. These sequences cover different difficulty levels of the tracking problem. In addition to these experimental results, we also report the performance of our method on the MOT Challenge and PETS benchmarks for quantitative comparison with other tracking methods. For performance evaluation, we follow the current evaluation protocols for visual multi-target tracking. The protocols include the multiple object tracking accuracy (MOTA) and multiple object tracking precision (MOTP) [5]. MOTA is a score which combines false positives, false negatives and identity switches (IDS) of the output trajectories. MOTP measures how well the trajectories are aligned with the ground truth trajectories in terms of the average distance between them. In addition to these metrics, the number of mostly tracked targets (MT), mostly lost targets (ML), track fragmentations (FM), and IDS are also reported. Detailed descriptions about these metrics can be found in [30]. Table 1 shows the default parameter setting for all of the experiments in this section. In the table, our baseline method that only uses motion information is denoted as MHT. This is a basic version of the MHT method described in Section 3 using only the motion score Smot (k). Our novel extension of MHT that incorporates online discriminative appearance modeling is denoted as MHT-DAM. MHT-DAM MHT

N-scan 5 5

Bth 100 100

Nmiss 15 15

PD 0.9 0.9

dth 12 6

wmot , wapp 0.1, 0.9 1.0, 0.0

c1 , c2 0.3, −0.8

Table 1. Parameter Setting

5.1. Pruning Effectiveness As we explained earlier, pruning is central to the success of MHT. It is preferable to have a discriminative score function so that more branches can be pruned early and reliably. A measure to quantify this notion is the entropy: X H(Bk ) = − p(Bk = v) ln p(Bk = v) (19) v

where p(Bk = v) is the probability of selecting v th tree branch at time k for a given track tree and defined as: v

e∆S (k) (20) p(Bk = v) = P ∆S v (k) . ve For the normalization, we take all the branches at time k from the same target tree.

(a)

(b)

(c)

Figure 3. (a) Average effective number of branches per track tree for different pruning mechanisms. MHT-DAM uses a gating threshold dth = 12 and MHT uses a gating threshold dth = 6. Even with a larger gating area, the appearance model for MHT-DAM is capable of significantly reducing the number of branches. (b) Sensitivity analysis for N -scan parameter N . (c) Sensitivity analysis for the maximum number of branches Bth . The Blue lines are the results from MHT-DAM and the Green lines are the results from MHT. The first row shows the MOTA score (higher is better) and the second row shows the number of ID switches (averaged per target, lower is better) over different pruning parameters. Table 2. Results from 2D MOT 2015 Challenge (accessed on 9/25/2015) Method

MOTA

MOTP

FAF

MT

ML

FP

FN

IDS

FM

Hz

MHT-DAM MHT LP SSVM [41] ELP [27] MotiCon [23] SegTrack [28] CEM [29] RMOT [43] SMOT [13] TBD [15] TC ODAL [2] DP NMS [35]

32.4 29.2 25.2 25.0 23.1 22.5 19.3 18.6 18.2 15.9 15.1 14.5

71.8 71.7 71.7 71.2 70.9 71.7 70.7 69.6 71.2 70.9 70.5 70.8

1.6 1.7 1.4 1.3 1.8 1.4 2.5 2.2 1.5 2.6 2.2 2.3

16.0% 12.1% 5.8% 7.5% 4.7% 5.8% 8.5% 5.3% 2.8% 6.4% 3.2% 6.0%

43.8% 53.3% 53.0% 43.8% 52.0% 63.9% 46.5% 53.3% 54.8% 47.9% 55.8% 40.8%

9, 064 9, 598 8, 369 7,345 10, 404 7,890 14, 180 12, 473 8, 780 14, 943 12, 970 13, 171

32,060 33,467 36, 932 37, 344 35, 844 39, 020 34, 591 36, 835 40, 310 34, 777 38, 538 34, 814

435 476 646 1, 396 1, 018 697 813 684 1, 148 1, 939 637 4, 537

826 781 849 1, 804 1, 061 737 1, 023 1, 282 2, 132 1, 963 1, 716 3, 090

0.7 0.8 41.3 5.7 1.4 0.2 1.1 7.9 2.7 0.7 1.7 444.8

With the entropy, we can define the effective number of the branches Neff within each track tree as: Neff = eH(Bk ) . (21) When all the branches in the target tree have the same probability (i.e. when the features are not discriminative), Neff is equal to the actual number of branches, which means one would need to explore all the possibilities. In the opposite case where a certain branch has the probability of 1, Neff is 1 and it is only necessary to examine a single branch. Fig. 3a shows the number of effective branches for different pruning mechanisms. For this experiment, we set the default gating threshold dth to 12. The highest bar (dark red) in each PETS sequence in Fig. 3a shows the average number of tree branches generated per frame with the default gating parameter. A smaller gating area (dth = 6) (yellow bar) only reduces the number of branches by a small amount but might prune out fast-moving hypotheses. Combined with the Kalman filter motion model, the reduction is more significant (cyan bar), but the algorithm still retains more than half of the effective branches compared to the full set with dth = 12.

Incorporating the appearance likelihood significantly reduces the effective number of branches. In both the MOT Challenge and PETS sequences, the average effective number of branches in a tree becomes ∼50% of the total number of branches. And this is achieved without lowering the size of the gating area, thereby retaining fast-moving targets. This shows that long-term appearance modeling significantly reduces the ambiguities in data association, which makes MHT search more effective and efficient. Analysis of Pruning Parameters. MHT was known to be sensitive to its parameter settings [32]. In this section, we perform a sensitivity analysis of MHT with respect to its pruning parameters and demonstrate that our appearance model helps to alleviate this parameter dependency. In our MHT implementation, there are two MHT pruning parameters. One is the N -scan pruning parameter N , the other is the maximum number of tree branches Bth . We tested MHT using 7 different values for N and 13 different values for Bth . We assessed the number of errors in terms of the MOTA score and identity switches (IDS). Fig. 3b shows the results from this analysis over different

N -scan parameters. We fix the maximum number of tree branches to 300, a large enough number so that very few branches are pruned when N is large. The results show that motion-based MHT is negatively affected when the N -scan parameter is small, while MHT-DAM is much less sensitive to the parameter change. This demonstrates that appearance features are more effective than motion features in reducing the number of look-ahead frames that are required to resolve data association ambiguities. This is intuitive, since many targets are capable of fast movement over a short time scale, while appearance typically changes more slowly. Fig. 3c illustrates the change in the MOTA and IDS scores when the maximum number of branches varies from 1 to 120. We fix the N -scan pruning parameter to 5 which is the setting for all other experiments in the paper. Note that appearance modeling is particularly helpful in preventing identity switches.

5.2. Benchmark Comparison We test our method on the MOT Challenge benchmark and the PETS 2009 sequences. The MOT benchmark contains 11 training and 11 testing sequences. Users tune their algorithms on the training sequences and then submit the results on the testing sequences to the evaluation server. This benchmark is of larger scale and includes more variations than the PETS benchmark. Table 2 shows our results on the benchmark where MHT-DAM outperforms the best previously published method by more than 7% on MOTA. In addition, 16.0% of the tracks are mostly tracked, as compared to the next competitor at 8.5%. We also achieved the lowest number of ID switches by a large margin. This shows the robustness of MHT-DAM over a large variety of videos under different conditions. Also note that because MOT is significantly more difficult than the PETS dataset, the appearance model becomes more important to the performance. Table 3 demonstrates the performance of MHT and MHT-DAM on the PETS sequences compared to one of the state-of-the-art tracking algorithms [31]. For a fair comparison, the detection inputs, ground truth annotations, and evaluation script provided by [31] were used. Our basic MHT implementation already achieves a better or comparable result in comparison to [31] for most PETS sequences and metrics. Cox’s method is also surprisingly close in performance to [31] with ∼6% lower MOTA on average with the exception of the S2L2 sequence where it is ∼20% lower. However, considering that Cox’s MHT implementation was done almost 20 years ago, and that it can run in real time due to the efficient implementation (40 FPS on average for PETS), the results from Cox’s method are impressive. After adding appearance modeling to MHT, our algorithm MHTDAM makes fewer ID switches and has higher MOTA and MOTP scores in comparison to previous methods.

6. Conclusion Multiple Hypothesis Tracking solves the multidimensional assignment problem through an efficient breadth-first search process centered around the construction and pruning of hypothesis trees. Although it has been a workhorse method for multi-target tracking in general, it has largely fallen out-of-favor for visual tracking. Recent advances in object detection have provided an opportunity to rehabilitate the MHT method. Our results demonstrate that a modern formulation of a standard MHT approach can achieve comparable performance to several state-of-the-art methods on reference datasets. Moreover, an implementation of MHT by Cox [12] from the 1990s comes surprisingly close to state-of-the-art performance on 4 out of 5 PETS sequences. We have further demonstrated that the MHT framework can be extended to include on-line learned appearance models, resulting in substantial performance gains. The software and evaluation results are available from our project website.2 Acknowledgments: This work was supported in part by the Simons Foundation award 288028, NSF Expedition award 1029679 and NSF IIS award 1320348. Table 3. Tracking Results on the PETS benchmark Sequence S2L1

S2L2

S2L3

S1L1-2

S1L2-1

Method

MOTA

MOTP

MT

ML

FM

IDS

MHT-DAM MHT Cox’s MHT [12] Milan [31]

92.6% 92.3% 84.1% 90.3%

79.1% 78.8% 77.5% 74.3%

18 18 17 18

0 0 0 0

12 15 65 15

13 17 45 22

MHT-DAM MHT Cox’s MHT [12] Milan [31]

59.2% 57.2% 38.0% 58.1%

61.4% 58.7% 58.8% 59.8%

10 7 3 11

2 1 8 1

162 150 273 153

120 134 154 167

MHT-DAM MHT Cox’s MHT [12] Milan [31]

38.5% 40.8% 34.8% 39.8%

70.8% 67.3% 66.1% 65.0%

9 10 6 8

22 21 22 19

9 19 65 22

8 18 35 27

MHT-A+M MHT-M Cox’s MHT [12] Milan [31]

62.1% 61.6% 52.0% 60.0%

70.3% 68.0% 66.5% 61.9%

21 22 17 21

9 12 14 11

14 23 52 19

11 31 41 22

MHT-DAM MHT Cox’s MHT [12] Milan [31]

25.4% 24.0% 22.6% 29.6%

62.2% 58.4% 57.4% 58.8%

3 5 2 2

24 23 23 21

30 29 57 34

25 33 34 42

References [1] A. Andriyenko, K. Schindler, and S. Roth. Discretecontinuous optimization for multi-target tracking. In CVPR, 2012. 2 [2] S.-H. Bae and K.-J. Yoon. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In CVPR, 2014. 2, 7 2 http://cpl.cc.gatech.edu/projects/MHT/

[3] H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua. Multicommodity network flow for tracking multiple people. PAMI, 2014. 2 [4] J. Berclaz, E. Turetken, F. Fleuret, and P. Fua. Multiple object tracking using K-shortest paths optimization. PAMI, 2011. 2 [5] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the CLEAR MOT metrics. Image and Video Processing, 2008. 6 [6] S. Blackman and R. Popoli. Design and Analysis of Modern Tracking Systems. Artech House, 1999. 1, 3, 4 [7] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. V. Gool. Online multiperson tracking-by-detection from a single, uncalibrated camera. PAMI, 2011. 2 [8] W. Brendel, M. Amer, and S. Todorovic. Multiobject tracking as maximum weight independent set. In CVPR, 2011. 2 [9] S. Busygin. A new trust region technique for the maximum weight clique problem. Discrete Appl. Math., 2006. 5 [10] A. Butt and R. Collins. Multi-target tracking by Lagrangian relaxation to min-cost network flow. In CVPR, 2013. 2, 5 [11] R. T. Collins. Multitarget data association with higher-order motion models. In CVPR, 2012. 2 [12] I. J. Cox and S. L. Hingorani. An efficient implementation of Reid’s multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking. PAMI, 1996. 1, 2, 4, 5, 8 [13] C. Dicle, O. Camps, and M. Sznaier. The way they move: Tracking targets with similar appearance. In ICCV, 2013. 7 [14] J. Ferryman and A. Ellis. PETS2010: Dataset and challenge. In AVSS, 2010. 6 [15] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun. 3D traffic scene understanding from movable platforms. PAMI, 2014. 7 [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 6 [17] M. Han, W. Xu, H. Tao, and Y. Gong. An algorithm for multiple object trajectory tracking. In CVPR, 2004. 2 [18] C. Huang, Y. Li, and R. Nevatia. Multiple target tracking by learning-based hierarchical association of detection responses. PAMI, 2013. 2 [19] Z. Khan, T. Balch, and F. Dellaert. MCMC-based particle filtering for tracking a variable number of interacting targets. PAMI, 2005. 2 [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1 [21] C.-H. Kuo, C. Huang, and R. Nevatia. Multi-target tracking by on-line learned discriminative appearance models. In CVPR, 2010. 2 [22] C.-H. Kuo and R. Nevatia. How does person identity recognition help multi-person tracking? In CVPR, 2011. 2 [23] L. Leal-Taix´e, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese. Learning an image-based motion context for multiple people tracking. In CVPR, 2014. 7 [24] L. Leal-Taix´e, A. Milan, I. Reid, S. Roth, and K. Schindler. MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942 [cs], 2015. 6 [25] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video

[26]

[27]

[28]

[29] [30]

[31]

[32]

[33] [34]

[35]

[36] [37]

[38]

[39]

[40]

[41] [42] [43]

[44]

[45]

segmentation by tracking many figure-ground segments. In ICCV, 2013. 1, 5 J. Liu, P. Carr, R. T. Collins, and Y. Liu. Tracking sports players with context-conditioned motion models. In CVPR, 2013. 2 N. McLaughlin, J. Martinez Del Rincon, and P. Miller. Enhancing linear programming with motion modeling for multi-target tracking. In WACV, 2015. 7 A. Milan, L. Leal-Taix´e, I. Reid, and K. Schindler. Joint tracking and segmentation of multiple targets. In CVPR, 2015. 7 A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. PAMI, 2014. 7 A. Milan, K. Schindler, and S. Roth. Challenges of ground truth evaluation of multi-target tracking. In CVPR Workshop, 2013. 6 A. Milan, K. Schindler, and S. Roth. Detection-and trajectory-level exclusion in multiple object tracking. In CVPR, 2013. 6, 8 S. Oh, S. Russell, and S. Sastry. Markov Chain Monte Carlo data association for multi-target tracking. IEEE Transactions on Automatic Control, 2009. 2, 6, 7 P. R. Ostergard. A new algorithm for the maximum-weight clique problem. Nordic Journal of Computing, 2001. 5 D. J. Papageorgiou and M. R. Salpukas. The maximum weight independent set problem for data association in multiple hypothesis tracking. Optimization and Cooperative Control Strategies, 2009. 2, 4 H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011. 2, 7 D. Reid. An algorithm for tracking multiple targets. IEEE Transactions on Automatic Control, 1979. 1, 2 A. Segal and I. Reid. Latent data association: Bayesian model selection for multi-target tracking. In ICCV, 2013. 2, 5 H. B. Shitrit, J. Berclaz, F. Fleuret, and P. Fua. Tracking multiple people under global appearance constraints. In ICCV, 2011. 2 A. Smeulder, D. Chu, R. Cucchiara, S. Calderara, A. Deghan, and M. Shah. Visual tracking: An experimental survey. PAMI, 2014. 2 X. Song, J. Cui, H. Zha, and H. Zhao. Vision-based multiple interacting targets tracking via on-line supervised learning. In ECCV, 2008. 2 S. Wang and F. C. Learning optimal parameters for multitarget tracking. In BMVC, 2015. 7 B. Yang and R. Nevatia. An online learned CRF model for multi-target tracking. In CVPR, 2012. 2 J. Yoon, H. Yang, J. Lim, and K. Yoon. Bayesian multiobject tracking using motion context from multiple objects. In WACV, 2015. 7 A. R. Zamir, A. Dehghan, and M. Shah. GMCP-tracker: Global multi-object tracking using generalized minimum clique graphs. In ECCV, 2012. 5 L. Zhang and R. Nevatia. Global data association for multiobject tracking using network flows. In CVPR, 2008. 2