[MED]Summary of Methods used in MED task(TRECVID 2015)
2016-04-24 17:53
316 查看
There were 16 teams participated the TRECVID MED task last year, but I only focus on methods of 8 teams which produced better results than our team.
Along with the densely extracted trajectories, three features are computed: HOG, HOF, MBF. We first reduce the dimension of HOG, HOF and MBH descriptors by a factor of two using Principal Component Analysis(PCA). Then three features are further quantized respectively using the FV representation with he vocabulary size being 256.
MFCC
It is first computed over each 32ms time-window(with 16ms overlap) of the soundtrack and then all the descriptors are quantized into a single BoW feature representation.
They finetune the VGG19 model using the full ImageNet dataset to increase the generalization ability of the model.
Given a video clip, they extract the ouputs from the two fully-connected layers(VGG19-fc6, VGG19-fc7) of each frame, then average the frame-level features into video-level representations.
LSTM
They directly adopt LSTM model trained with another video dataset(the UCF-101 dataset) and use the average output from all the timesteps of the last LSTM layer as the feature(512-d).
For each key frame in a given video, they obtain a 20,574-d concept score with the trained VGG19 model and frame-level scores are then averaged to generate a video-level concept feature vector.
C-233
a video-level concept representation is obtained by average pooling the scores of all frames.
Concatenate appearance feature and motion feature into a long vector
X2X^2SVMs
For MFCC audio feature, depp feaures and concept scores, they are mapped into X2X^2-kernel separately, and then trained by independent classifers.
size:320x240
frame rates: less than 50 fps-> keep; larger than 50 fps-> 25fps
keyframe: extract one keyframe from resized videos at every two seconds
audio
extracted from the original videos
downsampled to 16000Hz
training and validation: both the channels(if dual channel) of audio
testing: right channel(if dual channel)
motion: Improved Dense Trajectory with HOGHOF,MBH
audio: MFCC
deep features
LSTM layer with 10 time steps
First, detect concepts at sample frames of the video and make the average aggregation. Secondly, use bag-of-words(the dictionary is obtained from the English Wikipedia corpus) to represent the event description. Next, the concept expasion strategy[12] is used to add the 10 most similar concepts obtained from word2vec model to expand this category, then the video is represented by a bag-of-words vector. Finally, the cosine similarity is calculated between each video and the concept-event.
shot-level: the time duration is set to be five seconds
6 high-level concept features(ImageNet_1000, SIN_346,RC_497,Places_205,FCVID_239, sPORTS_487 <- all DCNN) & 1 low-level motion feature(IDT-HOG,HOF,MBH-PCA,FV -> SVM)
manual SQG:(based on the results of automatic)
match(direct match, indirect match, top3) similarity(used as weight) between the event name and each of our concept classifer labels.
fusion SQG+ASR+OCR(回头再看)
1.concept-bank features are concatenated to one feature vector and then train an event classifer using Chi-Square SVM
2.IDT feature+linear SVM[8]
fusion:
visual-system: Concept-bank+IDT ->average fusion
visual-zero system: visual-system+0-ex ->average fusion
visual-zero system+OCR/ASR ->(同0-ex的fusion)
probabilies: computed by averaging frame-level scores from the probability layer of the deep network
event-specific concepts: teh top-ranking terms from the event kit, based on tf-idf
embedding space: word2vec model[18]
modality1: IDT(MBH,HOG)-FV-linear SVM
modality2: mfcc–FV-histogram intersection kenel SVM model
for each event, the most discriminative video fragments are discovered from the hundred traning examples and these fragments are max-pooled over a video to obtain the fragment-based video representation.
Given the textual description of the event class, our framework first identifies N words or phrases that most closely relate to the event class; we call this word-set the Event Language Model (ELM). In parallel, for each of the N c concepts of our concept pool, it similarly identifies M words or phrases; we call this set the Concept Language Model (CLM) of the corresponding concept.
Subsequently, for each word in ELM and each word in each one of CLMs we calculate the Explicit Semantic Analysis (ESA) distance [34] between them. For each CLM, the resulting N × M distance matrix expresses the relation between the given event class and the corresponding concepts. In order to compute a single score expressing this relation, we apply to this matrix different operators, such as various matrix norms or distance measures. Consequently, a score is computed for each pair of ELM and CLM. The N c considered concepts are ordered according to these scores (in descending order) and the K top concepts along with their scores constitute our event detector. In order to perform event detection in a video collection, we compare this event detector with the output scores of concept detectors applied on each video, using different similarity measures (Fig. 2b). Thus, the final output is a ranked list of the most relevant videos. Alternatively, multiple event detectors can be generated using more than one different algorithm variations in each of the shaded blocks of Fig. 2a and b, and these can be used as pseudo-positive samples for training an SVM, which can then be applied to the videos so as to generate a ranked list of those depicting the target event (Fig. 2c).
local descriptors(sift, opponentSIFT, RGB-SIFT,RGB-SURF): VLAD; average all keyframes of the video; 4 feature vectors are thenconcantenated.
dcnn-based features fc7,fc8. average all the keyframes representations.
label:
+1: positive
-1: negative
0 : “near miss”. the video does not exactly fulfill the requirements to be characterized as true positive example, but nevertheless is closely related to the corresponding event class.
concatenate the feature vectors derived for each visual modality(static local, motion, model vectors)
1.KSDA(kernel subclass disciriminat analysis for dimiensionality reduction)+LSVM
2.Relevance Degree SVM(for handling “near-miss” video samples as weighted negative of weighted posive ones using an automatic weghting selection scheme in [21])
First, since the semantic vocabulary is usually limited, how to address the out-of-vocabulary issue in the event-kit description.
Second given a query word, how to determine its modality as well as the weight associated with that modality.
For the first challenge, we use WordNet similarity[23], Point-wise Mutual Information on Wikipedia, and word2vec[23,24] to generate a preliminary mapping that maps the event-kit description to the concepts in our vocabulary. Then it is then examined by human experts to figure out the final system query. The second challenge is tackled by prior knowledge provided by human experts.
After retrieving the ranked lists for all modalities, we apply a normalized fusion to fuse different ranked lists according to the weights specified in SQG.
fusion: multistage hybrid late fusion method[27]
[2] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[3] Xu Z, Yang Y, Hauptmann A G. A discriminative CNN video representation for event detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1798-1807.
[4] Zha S, Luisier F, Andrews W, et al. Exploiting image-trained cnn architectures for unconstrained video classification[J]. arXiv preprint arXiv:1503.04144, 2015.
[5] Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification[M]//Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010: 143-156.
[6] Jégou H, Perronnin F, Douze M, et al. Aggregating local image descriptors into compact codes[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2012, 34(9): 1704-1716.
MLA
[7] NTT-Fudan Tam@TRECVID 2015: Multimedia Event Detection
[8] Wang H, Schmid C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3551-3558.
[9] Fudan at TRECVID 2015: Adaptive Feature Fusion For Multimedia Event Detection in Videos
[10] Wu Z, Jiang Y G, Wang X, et al. Fusing Multi-Stream Deep Networks for Video Classification[J]. arXiv preprint arXiv:1509.06086, 2015.
[11]
[12] Chen J, Cui Y, Ye G, et al. Event-driven semantic concept discovery by exploiting weakly tagged internet images[C]//Proceedings of International Conference on Multimedia Retrieval. ACM, 2014: 1.(这篇文章里从互联网上找大量图来提取tag的方法挺有趣的!!!!)
[13]
[14] Smith R. An overview of the Tesseract OCR engine[C]//icdar. IEEE, 2007: 629-633.
[15]
[16] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1-9.
[17] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[18] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.
[19] Mettes P, van Gemert J C, Cappallo S, et al. Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015: 427-434.
[20]
[21] Tzelepis C, Galanopoulos D, Mezaris V, et al. Learning to detect video events from zero or very few video examples[J]. Image and Vision Computing, 2015.
[22]
[23] “WordNet Similarity for Java, https://code.google.com/p/ws4j/“.
[24] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in NIPS, 2013.
[25] L. Jiang, T. Mitamura, S.-I. Yu and A. Hauptmann, “Zero-Example Event Search using MultiModal Pseudo Relevance Feedback,” in ICMR, 2014.
[26] L. Jiang, D. Meng, T. Mitamura and A. Hauptmann, “Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search,” in ACM MM, 2014.
[27] Z.-z. Lan, L. Bao, S.-I. Yu, W. Liu and A. G. Hauptmann, “Multimedia classification and event detection using double fusion,” in Multimedia Tools and Applications, 2013.
BCMI-SJTU[1]:
Given a video, we first extract key frames(detect video shot boundary and use the frame in the meddle to represent the shot) from the video, then CNN(VGG-16[2]. Input:10 patches, extracted at 10 pre-defined regions from each key frame, which are resized to 224x224. Output: last convolutional layer conv5 using LCD[3] to reduce dimensions[4]) is used to exract frame level feature. Next, the frame-level features are pooing(VLAD-Fisher Vector[5,6]) into a video level feature, and finally, a SVM classifier is adopted to score each video.NTT-Fudan[7]
feature
traditional features:
IDT(MBH, HOG, HOF)[8]Along with the densely extracted trajectories, three features are computed: HOG, HOF, MBF. We first reduce the dimension of HOG, HOF and MBH descriptors by a factor of two using Principal Component Analysis(PCA). Then three features are further quantized respectively using the FV representation with he vocabulary size being 256.
MFCC
It is first computed over each 32ms time-window(with 16ms overlap) of the soundtrack and then all the descriptors are quantized into a single BoW feature representation.
Deep Features
VGG19They finetune the VGG19 model using the full ImageNet dataset to increase the generalization ability of the model.
Given a video clip, they extract the ouputs from the two fully-connected layers(VGG19-fc6, VGG19-fc7) of each frame, then average the frame-level features into video-level representations.
LSTM
They directly adopt LSTM model trained with another video dataset(the UCF-101 dataset) and use the average output from all the timesteps of the last LSTM layer as the feature(512-d).
Concept Feature(VGG19 CNN model):
C-20KFor each key frame in a given video, they obtain a 20,574-d concept score with the trained VGG19 model and frame-level scores are then averaged to generate a video-level concept feature vector.
C-233
a video-level concept representation is obtained by average pooling the scores of all frames.
classification
Linear SVMsConcatenate appearance feature and motion feature into a long vector
X2X^2SVMs
For MFCC audio feature, depp feaures and concept scores, they are mapped into X2X^2-kernel separately, and then trained by independent classifers.
fusion
For each video, the output scores of multiple classifier are fused to compute the final prediction. (How to?)FUDAN[9]
The feature part and the classification part are totally same.Fusion
They use the adaptive multi-stream fusion[10] to learn the optimal fusion weights adaptively for each class. The optimal fusion weights are learned adaptively for each class, and the learning process is reguarized by automatically estimated class relationships(in detail, they use a correlation matrix VmV_m of the classes for the m−thm-th stream using the corresponding prediction scores, where each entry VijV_ij indicates the percentage of the samples with the ground-truth label of class ii being wrongly classified into class jj).NII-HITACHI-UIT[11]
Preprocessing
NII-UIT
video:size:320x240
frame rates: less than 50 fps-> keep; larger than 50 fps-> 25fps
keyframe: extract one keyframe from resized videos at every two seconds
audio
extracted from the original videos
Hitachi
audiodownsampled to 16000Hz
training and validation: both the channels(if dual channel) of audio
testing: right channel(if dual channel)
Feature Extraction
NII-UIT
image: SIFT with Hessian Laplace detectormotion: Improved Dense Trajectory with HOGHOF,MBH
audio: MFCC
deep features
Hitachi
audio: MFCC(LSTM classifier, feed forward Neural Network classifier)Feature Representation
NII-NUI
GMM + PCA + Fisher VectorHitachi
Bag-of-Words(k-means)Event Detection
NII-NIT
SVMHitachi
traditional fedd forward neural network for training event detector using the scaled conjugate gradient methodLSTM layer with 10 time steps
Fusion Methods
the late fusion approach -> all features are combined with equal weightszero-shot event detection
Use concept detection, Bag-of-Words with tf-idf weights, conceptexpansion strategy and cosine similarity.First, detect concepts at sample frames of the video and make the average aggregation. Secondly, use bag-of-words(the dictionary is obtained from the English Wikipedia corpus) to represent the event description. Next, the concept expasion strategy[12] is used to add the 10 most similar concepts obtained from word2vec model to expand this category, then the video is represented by a bag-of-words vector. Finally, the cosine similarity is calculated between each video and the concept-event.
VIREO-TNO[13]
features
visual
keyframe-level: one frame per two secondsshot-level: the time duration is set to be five seconds
6 high-level concept features(ImageNet_1000, SIN_346,RC_497,Places_205,FCVID_239, sPORTS_487 <- all DCNN) & 1 low-level motion feature(IDT-HOG,HOF,MBH-PCA,FV -> SVM)
textual
Use Tesseract OCR[14] to extract text from key frames(roughly one per secod), then check if there is meaningful text using 3 rules. The results are fed into Lucene to index and search in the text. A manually defined Boolean query based on the event descriptioin and Wikipedia is used in combination with the term frequency to retreve the positive videos.speech
a manually defined boolean query in combination with a PhraseQuery is used to search for relevant speech information. The standard Dirichlet Language Model in Lucene is used to calculate the similarity between this manually defined query and audio file of a video and used to rank the videos.frameworks for every subtask
0-example
automatic SQG:(A large concept bank + a trategy of wisely pick up the right concepts for each event) when there already exitsts a concept detector for detecting the whole event, it would be generally wise to only include concepts that are distinctive to this event(they simply ignore other concepts if a detector for the whole event is found for automatic SQG); if no detector for the whole event is found, it is beneficial to use a few acurate concepts to represent the event.manual SQG:(based on the results of automatic)
match(direct match, indirect match, top3) similarity(used as weight) between the event name and each of our concept classifer labels.
fusion SQG+ASR+OCR(回头再看)
10-example
visual classfier1.concept-bank features are concatenated to one feature vector and then train an event classifer using Chi-Square SVM
2.IDT feature+linear SVM[8]
fusion:
visual-system: Concept-bank+IDT ->average fusion
visual-zero system: visual-system+0-ex ->average fusion
visual-zero system+OCR/ASR ->(同0-ex的fusion)
100-example
fusion: (visual-system) using joint probability, only videos that receive a low score from both classifers will be put at the bottom of the list.MediaMill[15]
mainly use Google’s Inception network[16], which is trained on a large personalized selction of ImageNet concepts[17]0-example
They employ a semantic embedding space to translate video-level concept probabilities into event-specific conceptsprobabilies: computed by averaging frame-level scores from the probability layer of the deep network
event-specific concepts: teh top-ranking terms from the event kit, based on tf-idf
embedding space: word2vec model[18]
10-example
deep learning features: frame representations twice per second at both the pool5 layer and the probability layer. For both layers, the features are averaged per video and then normalized.A histogram intersection kernel SVM model is trained on the representations from both layers.modality1: IDT(MBH,HOG)-FV-linear SVM
modality2: mfcc–FV-histogram intersection kenel SVM model
100-example
same as 10-example + bag-of-fragments model[19](re-uses the pool5 layer for the frame representations)for each event, the most discriminative video fragments are discovered from the hundred traning examples and these fragments are max-pooled over a video to obtain the fragment-based video representation.
ITI-CERTH[20]
0-example[21]
Given the textual description of the event class, our framework first identifies N words or phrases that most closely relate to the event class; we call this word-set the Event Language Model (ELM). In parallel, for each of the N c concepts of our concept pool, it similarly identifies M words or phrases; we call this set the Concept Language Model (CLM) of the corresponding concept.
Subsequently, for each word in ELM and each word in each one of CLMs we calculate the Explicit Semantic Analysis (ESA) distance [34] between them. For each CLM, the resulting N × M distance matrix expresses the relation between the given event class and the corresponding concepts. In order to compute a single score expressing this relation, we apply to this matrix different operators, such as various matrix norms or distance measures. Consequently, a score is computed for each pair of ELM and CLM. The N c considered concepts are ordered according to these scores (in descending order) and the K top concepts along with their scores constitute our event detector. In order to perform event detection in a video collection, we compare this event detector with the output scores of concept detectors applied on each video, using different similarity measures (Fig. 2b). Thus, the final output is a ranked list of the most relevant videos. Alternatively, multiple event detectors can be generated using more than one different algorithm variations in each of the shaded blocks of Fig. 2a and b, and these can be used as pseudo-positive samples for training an SVM, which can then be applied to the videos so as to generate a ranked list of those depicting the target event (Fig. 2c).
10,100-example
motion features IDT(HOG,HOF,MBH)-FV-GMMlocal descriptors(sift, opponentSIFT, RGB-SIFT,RGB-SURF): VLAD; average all keyframes of the video; 4 feature vectors are thenconcantenated.
dcnn-based features fc7,fc8. average all the keyframes representations.
label:
+1: positive
-1: negative
0 : “near miss”. the video does not exactly fulfill the requirements to be characterized as true positive example, but nevertheless is closely related to the corresponding event class.
concatenate the feature vectors derived for each visual modality(static local, motion, model vectors)
1.KSDA(kernel subclass disciriminat analysis for dimiensionality reduction)+LSVM
2.Relevance Degree SVM(for handling “near-miss” video samples as weighted negative of weighted posive ones using an automatic weghting selection scheme in [21])
CMU[22]
0-EX
SEMANTIC QUERY GENERATION**
event kit description -> a set of multimodal system queriesFirst, since the semantic vocabulary is usually limited, how to address the out-of-vocabulary issue in the event-kit description.
Second given a query word, how to determine its modality as well as the weight associated with that modality.
For the first challenge, we use WordNet similarity[23], Point-wise Mutual Information on Wikipedia, and word2vec[23,24] to generate a preliminary mapping that maps the event-kit description to the concepts in our vocabulary. Then it is then examined by human experts to figure out the final system query. The second challenge is tackled by prior knowledge provided by human experts.
Event Search component
retrieves multiple ranked lists for a given system query. Our system incorporates various retrieval methods such as Vector Space Model, tf-idf, BM25, language model, etc. We found that different retrieval algorithms are good at different modalities.After retrieving the ranked lists for all modalities, we apply a normalized fusion to fuse different ranked lists according to the weights specified in SQG.
PRF component
refines the retrieved ranked lists by reranking videos. Our system incorporates MMPRF [25] and SPaR [26] to conduct the reranking, in which MMPRF is used to assign the starting values, and SPaR is used as the core reranking algorithm.10/100EX
low-level features -> FV/Bag-of-words -> high-level feature -> SVMs+Linear regression modelsfusion: multistage hybrid late fusion method[27]
Reference
[1] BCMI-SJTU participation to TRECVID 2015: SED and MED[2] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[3] Xu Z, Yang Y, Hauptmann A G. A discriminative CNN video representation for event detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1798-1807.
[4] Zha S, Luisier F, Andrews W, et al. Exploiting image-trained cnn architectures for unconstrained video classification[J]. arXiv preprint arXiv:1503.04144, 2015.
[5] Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification[M]//Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010: 143-156.
[6] Jégou H, Perronnin F, Douze M, et al. Aggregating local image descriptors into compact codes[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2012, 34(9): 1704-1716.
MLA
[7] NTT-Fudan Tam@TRECVID 2015: Multimedia Event Detection
[8] Wang H, Schmid C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3551-3558.
[9] Fudan at TRECVID 2015: Adaptive Feature Fusion For Multimedia Event Detection in Videos
[10] Wu Z, Jiang Y G, Wang X, et al. Fusing Multi-Stream Deep Networks for Video Classification[J]. arXiv preprint arXiv:1509.06086, 2015.
[11]
[12] Chen J, Cui Y, Ye G, et al. Event-driven semantic concept discovery by exploiting weakly tagged internet images[C]//Proceedings of International Conference on Multimedia Retrieval. ACM, 2014: 1.(这篇文章里从互联网上找大量图来提取tag的方法挺有趣的!!!!)
[13]
[14] Smith R. An overview of the Tesseract OCR engine[C]//icdar. IEEE, 2007: 629-633.
[15]
[16] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1-9.
[17] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[18] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.
[19] Mettes P, van Gemert J C, Cappallo S, et al. Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015: 427-434.
[20]
[21] Tzelepis C, Galanopoulos D, Mezaris V, et al. Learning to detect video events from zero or very few video examples[J]. Image and Vision Computing, 2015.
[22]
[23] “WordNet Similarity for Java, https://code.google.com/p/ws4j/“.
[24] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in NIPS, 2013.
[25] L. Jiang, T. Mitamura, S.-I. Yu and A. Hauptmann, “Zero-Example Event Search using MultiModal Pseudo Relevance Feedback,” in ICMR, 2014.
[26] L. Jiang, D. Meng, T. Mitamura and A. Hauptmann, “Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search,” in ACM MM, 2014.
[27] Z.-z. Lan, L. Bao, S.-I. Yu, W. Liu and A. G. Hauptmann, “Multimedia classification and event detection using double fusion,” in Multimedia Tools and Applications, 2013.
相关文章推荐
- hibernate一级缓存、二级缓存
- G.易彰彪的一张表
- 记编程过程中遇到的问题
- 数据可视化:dc.js的使用
- pip安装模块警告InsecurePlatformWarning: A true SSLContext object is not available.
- Elasticsearch 学习~
- 《Java程序设计》第8周学习总结
- 关于Java的散列桶, 以及附上一个案例-重写map集合
- list.extend()
- stm32自学篇——(1)驱动安装
- NSURLSession(二)POST请求
- DayDayUP_Python自学记录[3]_Python条件判断语句(if while for)
- Dubbo与Zookeeper、SpringMVC整合和使用(负载均衡、容错)(转)
- 通过手动创建统计信息优化sql查询性能案例
- hdu2444The Accomodation of Students(二分图判断+最大匹配)
- react与jsx语法介绍--先行篇
- 简单开发相机
- 稀疏编码(Sparse Coding)的前世今生(一) 转自http://blog.csdn.net/marvin521/article/details/8980853
- 20145311王亦徐 实验三 "敏捷开发与XP实践"
- LeetCode 344. Reverse String