The boffins claim FairMOT outperforms state of the art models on public data sets at 30 frames per second. If if ever ends up in a product it would be in places like elderly care, security, and could be used to track illnesses like COVID-19.
Most existing methods employ multiple models to track objects: (1) a detection model that localises objects of interest and (2) an association model that extracts features used to reidentify briefly obscured objects.
However, FairMOT adopts an anchor-free approach to estimate object centres on a high-resolution feature map, which allows the reidentification features to better align with the centres.
A parallel branch estimates the features used to predict the objects’ identities, while a “backbone” module fuses together the features to deal with objects of different scales.
The researchers tested FairMOT on a training data set compiled from six public corpora for human detection and search: ETH, CityPerson, CalTech, MOT17, CUHK-SYSU, and PRW.
Training took 30 hours on two Nvidia RTX 2080 graphics cards. After removing duplicate clips, they tested the trained model against benchmarks that included 2DMOT15, MOT16, and MOT17. All came from the MOT Challenge, a framework for validating people-tracking algorithms that ships with data sets, an evaluation tool providing several metrics, and tests for tasks like surveillance and sports analysis.
The team reports that FairMOT outperformed both on the MOT16 data set with an inference speed “near video rate.”
The report into the experiments said while there has been remarkable progress on object detection and re-identification in recent years little attention has been focused on accomplishing the two tasks in a single network to improve the inference speed.
“The initial attempts along this path ended up with degraded results mainly because the re-identification branch is not appropriately learned,” concluded the researchers in a paper describing FairMOT. “We find that the use of anchors in object detection and identity embedding is the main reason for the degraded results. In particular, multiple nearby anchors, which correspond to different parts of an object, may be responsible for estimating the same identity, which causes ambiguities for network training”, the report said.