您的位置:首页 > 其它

[MED]Summary of Methods used in MED task(TRECVID 2015)

2016-04-24 17:53 316 查看
There were 16 teams participated the TRECVID MED task last year, but I only focus on methods of 8 teams which produced better results than our team.

BCMI-SJTU[1]:

Given a video, we first extract key frames(detect video shot boundary and use the frame in the meddle to represent the shot) from the video, then CNN(VGG-16[2]. Input:10 patches, extracted at 10 pre-defined regions from each key frame, which are resized to 224x224. Output: last convolutional layer conv5 using LCD[3] to reduce dimensions[4]) is used to exract frame level feature. Next, the frame-level features are pooing(VLAD-Fisher Vector[5,6]) into a video level feature, and finally, a SVM classifier is adopted to score each video.

NTT-Fudan[7]

feature

traditional features:

IDT(MBH, HOG, HOF)[8]

Along with the densely extracted trajectories, three features are computed: HOG, HOF, MBF. We first reduce the dimension of HOG, HOF and MBH descriptors by a factor of two using Principal Component Analysis(PCA). Then three features are further quantized respectively using the FV representation with he vocabulary size being 256.

MFCC

It is first computed over each 32ms time-window(with 16ms overlap) of the soundtrack and then all the descriptors are quantized into a single BoW feature representation.

Deep Features

VGG19

They finetune the VGG19 model using the full ImageNet dataset to increase the generalization ability of the model.

Given a video clip, they extract the ouputs from the two fully-connected layers(VGG19-fc6, VGG19-fc7) of each frame, then average the frame-level features into video-level representations.

LSTM

They directly adopt LSTM model trained with another video dataset(the UCF-101 dataset) and use the average output from all the timesteps of the last LSTM layer as the feature(512-d).

Concept Feature(VGG19 CNN model):

C-20K

For each key frame in a given video, they obtain a 20,574-d concept score with the trained VGG19 model and frame-level scores are then averaged to generate a video-level concept feature vector.

C-233

a video-level concept representation is obtained by average pooling the scores of all frames.

classification

Linear SVMs

Concatenate appearance feature and motion feature into a long vector

X2X^2SVMs

For MFCC audio feature, depp feaures and concept scores, they are mapped into X2X^2-kernel separately, and then trained by independent classifers.

fusion

For each video, the output scores of multiple classifier are fused to compute the final prediction. (How to?)

FUDAN[9]

The feature part and the classification part are totally same.



Fusion

They use the adaptive multi-stream fusion[10] to learn the optimal fusion weights adaptively for each class. The optimal fusion weights are learned adaptively for each class, and the learning process is reguarized by automatically estimated class relationships(in detail, they use a correlation matrix VmV_m of the classes for the m−thm-th stream using the corresponding prediction scores, where each entry VijV_ij indicates the percentage of the samples with the ground-truth label of class ii being wrongly classified into class jj).

NII-HITACHI-UIT[11]

Preprocessing

NII-UIT

video:

size:320x240

frame rates: less than 50 fps-> keep; larger than 50 fps-> 25fps

keyframe: extract one keyframe from resized videos at every two seconds

audio

extracted from the original videos

Hitachi

audio

downsampled to 16000Hz

training and validation: both the channels(if dual channel) of audio

testing: right channel(if dual channel)

Feature Extraction

NII-UIT

image: SIFT with Hessian Laplace detector

motion: Improved Dense Trajectory with HOGHOF,MBH

audio: MFCC

deep features

Hitachi

audio: MFCC(LSTM classifier, feed forward Neural Network classifier)

Feature Representation

NII-NUI

GMM + PCA + Fisher Vector

Hitachi

Bag-of-Words(k-means)

Event Detection

NII-NIT

SVM

Hitachi

traditional fedd forward neural network for training event detector using the scaled conjugate gradient method

LSTM layer with 10 time steps

Fusion Methods

the late fusion approach -> all features are combined with equal weights

zero-shot event detection

Use concept detection, Bag-of-Words with tf-idf weights, conceptexpansion strategy and cosine similarity.

First, detect concepts at sample frames of the video and make the average aggregation. Secondly, use bag-of-words(the dictionary is obtained from the English Wikipedia corpus) to represent the event description. Next, the concept expasion strategy[12] is used to add the 10 most similar concepts obtained from word2vec model to expand this category, then the video is represented by a bag-of-words vector. Finally, the cosine similarity is calculated between each video and the concept-event.

VIREO-TNO[13]

features

visual

keyframe-level: one frame per two seconds

shot-level: the time duration is set to be five seconds

6 high-level concept features(ImageNet_1000, SIN_346,RC_497,Places_205,FCVID_239, sPORTS_487 <- all DCNN) & 1 low-level motion feature(IDT-HOG,HOF,MBH-PCA,FV -> SVM)

textual

Use Tesseract OCR[14] to extract text from key frames(roughly one per secod), then check if there is meaningful text using 3 rules. The results are fed into Lucene to index and search in the text. A manually defined Boolean query based on the event descriptioin and Wikipedia is used in combination with the term frequency to retreve the positive videos.

speech

a manually defined boolean query in combination with a PhraseQuery is used to search for relevant speech information. The standard Dirichlet Language Model in Lucene is used to calculate the similarity between this manually defined query and audio file of a video and used to rank the videos.

frameworks for every subtask

0-example

automatic SQG:(A large concept bank + a trategy of wisely pick up the right concepts for each event) when there already exitsts a concept detector for detecting the whole event, it would be generally wise to only include concepts that are distinctive to this event(they simply ignore other concepts if a detector for the whole event is found for automatic SQG); if no detector for the whole event is found, it is beneficial to use a few acurate concepts to represent the event.

manual SQG:(based on the results of automatic)

match(direct match, indirect match, top3) similarity(used as weight) between the event name and each of our concept classifer labels.

fusion SQG+ASR+OCR(回头再看)

10-example

visual classfier

1.concept-bank features are concatenated to one feature vector and then train an event classifer using Chi-Square SVM

2.IDT feature+linear SVM[8]

fusion:

visual-system: Concept-bank+IDT ->average fusion

visual-zero system: visual-system+0-ex ->average fusion

visual-zero system+OCR/ASR ->(同0-ex的fusion)

100-example

fusion: (visual-system) using joint probability, only videos that receive a low score from both classifers will be put at the bottom of the list.

MediaMill[15]

mainly use Google’s Inception network[16], which is trained on a large personalized selction of ImageNet concepts[17]

0-example

They employ a semantic embedding space to translate video-level concept probabilities into event-specific concepts

probabilies: computed by averaging frame-level scores from the probability layer of the deep network

event-specific concepts: teh top-ranking terms from the event kit, based on tf-idf

embedding space: word2vec model[18]

10-example

deep learning features: frame representations twice per second at both the pool5 layer and the probability layer. For both layers, the features are averaged per video and then normalized.A histogram intersection kernel SVM model is trained on the representations from both layers.

modality1: IDT(MBH,HOG)-FV-linear SVM

modality2: mfcc–FV-histogram intersection kenel SVM model

100-example

same as 10-example + bag-of-fragments model[19](re-uses the pool5 layer for the frame representations)

for each event, the most discriminative video fragments are discovered from the hundred traning examples and these fragments are max-pooled over a video to obtain the fragment-based video representation.

ITI-CERTH[20]

0-example[21]



Given the textual description of the event class, our framework first identifies N words or phrases that most closely relate to the event class; we call this word-set the Event Language Model (ELM). In parallel, for each of the N c concepts of our concept pool, it similarly identifies M words or phrases; we call this set the Concept Language Model (CLM) of the corresponding concept.

Subsequently, for each word in ELM and each word in each one of CLMs we calculate the Explicit Semantic Analysis (ESA) distance [34] between them. For each CLM, the resulting N × M distance matrix expresses the relation between the given event class and the corresponding concepts. In order to compute a single score expressing this relation, we apply to this matrix different operators, such as various matrix norms or distance measures. Consequently, a score is computed for each pair of ELM and CLM. The N c considered concepts are ordered according to these scores (in descending order) and the K top concepts along with their scores constitute our event detector. In order to perform event detection in a video collection, we compare this event detector with the output scores of concept detectors applied on each video, using different similarity measures (Fig. 2b). Thus, the final output is a ranked list of the most relevant videos. Alternatively, multiple event detectors can be generated using more than one different algorithm variations in each of the shaded blocks of Fig. 2a and b, and these can be used as pseudo-positive samples for training an SVM, which can then be applied to the videos so as to generate a ranked list of those depicting the target event (Fig. 2c).

10,100-example

motion features IDT(HOG,HOF,MBH)-FV-GMM

local descriptors(sift, opponentSIFT, RGB-SIFT,RGB-SURF): VLAD; average all keyframes of the video; 4 feature vectors are thenconcantenated.

dcnn-based features fc7,fc8. average all the keyframes representations.

label:

+1: positive

-1: negative

0 : “near miss”. the video does not exactly fulfill the requirements to be characterized as true positive example, but nevertheless is closely related to the corresponding event class.

concatenate the feature vectors derived for each visual modality(static local, motion, model vectors)

1.KSDA(kernel subclass disciriminat analysis for dimiensionality reduction)+LSVM

2.Relevance Degree SVM(for handling “near-miss” video samples as weighted negative of weighted posive ones using an automatic weghting selection scheme in [21])

CMU[22]

0-EX

SEMANTIC QUERY GENERATION**

event kit description -> a set of multimodal system queries

First, since the semantic vocabulary is usually limited, how to address the out-of-vocabulary issue in the event-kit description.

Second given a query word, how to determine its modality as well as the weight associated with that modality.

For the first challenge, we use WordNet similarity[23], Point-wise Mutual Information on Wikipedia, and word2vec[23,24] to generate a preliminary mapping that maps the event-kit description to the concepts in our vocabulary. Then it is then examined by human experts to figure out the final system query. The second challenge is tackled by prior knowledge provided by human experts.

Event Search component

retrieves multiple ranked lists for a given system query. Our system incorporates various retrieval methods such as Vector Space Model, tf-idf, BM25, language model, etc. We found that different retrieval algorithms are good at different modalities.

After retrieving the ranked lists for all modalities, we apply a normalized fusion to fuse different ranked lists according to the weights specified in SQG.

PRF component

refines the retrieved ranked lists by reranking videos. Our system incorporates MMPRF [25] and SPaR [26] to conduct the reranking, in which MMPRF is used to assign the starting values, and SPaR is used as the core reranking algorithm.

10/100EX

low-level features -> FV/Bag-of-words -> high-level feature -> SVMs+Linear regression models

fusion: multistage hybrid late fusion method[27]

Reference

[1] BCMI-SJTU participation to TRECVID 2015: SED and MED

[2] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

[3] Xu Z, Yang Y, Hauptmann A G. A discriminative CNN video representation for event detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1798-1807.

[4] Zha S, Luisier F, Andrews W, et al. Exploiting image-trained cnn architectures for unconstrained video classification[J]. arXiv preprint arXiv:1503.04144, 2015.

[5] Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification[M]//Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010: 143-156.

[6] Jégou H, Perronnin F, Douze M, et al. Aggregating local image descriptors into compact codes[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2012, 34(9): 1704-1716.

MLA

[7] NTT-Fudan Tam@TRECVID 2015: Multimedia Event Detection

[8] Wang H, Schmid C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3551-3558.

[9] Fudan at TRECVID 2015: Adaptive Feature Fusion For Multimedia Event Detection in Videos

[10] Wu Z, Jiang Y G, Wang X, et al. Fusing Multi-Stream Deep Networks for Video Classification[J]. arXiv preprint arXiv:1509.06086, 2015.

[11]

[12] Chen J, Cui Y, Ye G, et al. Event-driven semantic concept discovery by exploiting weakly tagged internet images[C]//Proceedings of International Conference on Multimedia Retrieval. ACM, 2014: 1.(这篇文章里从互联网上找大量图来提取tag的方法挺有趣的!!!!)

[13]

[14] Smith R. An overview of the Tesseract OCR engine[C]//icdar. IEEE, 2007: 629-633.

[15]

[16] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1-9.

[17] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.

[18] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.

[19] Mettes P, van Gemert J C, Cappallo S, et al. Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015: 427-434.

[20]

[21] Tzelepis C, Galanopoulos D, Mezaris V, et al. Learning to detect video events from zero or very few video examples[J]. Image and Vision Computing, 2015.

[22]

[23] “WordNet Similarity for Java, https://code.google.com/p/ws4j/“.

[24] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in NIPS, 2013.

[25] L. Jiang, T. Mitamura, S.-I. Yu and A. Hauptmann, “Zero-Example Event Search using MultiModal Pseudo Relevance Feedback,” in ICMR, 2014.

[26] L. Jiang, D. Meng, T. Mitamura and A. Hauptmann, “Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search,” in ACM MM, 2014.

[27] Z.-z. Lan, L. Bao, S.-I. Yu, W. Liu and A. G. Hauptmann, “Multimedia classification and event detection using double fusion,” in Multimedia Tools and Applications, 2013.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: