Presented herein are systems, methods, and datasets for automatically and precisely generating highlight or summary videos of content. For example, in one or more embodiments, videos of sporting events may be digested or condensed into highlights, which will dramatically benefit sports media, broadcasters, video creators or commentators, or other short video creators, in terms of cost reduction, fast, and mass production, and saving tedious engineering hours. Embodiment of the framework may also be used or adapted for use to better promote sports teams, players, and/or games, and produce stories to glorify the spirit of sports or its players. While presented in the context of sports, it shall be noted that the methodology may be used for videos comprising other content and events.
Generating Highlight Video From Video And Text Inputs
- Sunnyvale CA, US Le KANG - Dublin CA, US Zhiyu CHENG - Sunnyvale CA, US Hao TIAN - Cupertino CA, US Daming LU - Dublin CA, US Dapeng LI - Los Altos CA, US Jingya XUN - San Jose CA, US Jeff WANG - San Jose CA, US Xi CHEN - San Jose CA, US Xing LI - Santa Clara CA, US
Presented herein are systems, methods, and datasets for automatically and precisely generating highlight or summary videos of content. In one or more embodiments, the inputs comprise a text (e.g., an article) of the key event(s) (e.g., a goal, a player action, etc.) in an activity (e.g., a game, a concert, etc.) and a video or videos of the activity. In one or more embodiments, the output is a short video of an event or events in the text, in which the video may include commentary about the highlighted events and/or other audio (e.g., music), which may also be automatically synthesized.
- Sunnyvale CA, US Le KANG - Dublin CA, US Xin ZHOU - Mountain View CA, US Hao TIAN - Cupertino CA, US Xing LI - Santa Clara CA, US Bo HE - Sunnyvale CA, US Jingyu XIN - Tucson AZ, US
Assignee:
Baidu USA LLC - Sunnyvale CA
International Classification:
G06V 20/40 G06N 3/08 G06V 10/42
Abstract:
With rapidly evolving technologies and emerging tools, sports-related videos generated online are rapidly increasing. To automate the sports video editing/highlight generation process, a key task is to precisely recognize and locate events-of-interest in videos. Embodiments herein comprise a two-stage paradigm to detect categories of events and when these events happen in videos. In one or more embodiments, multiple action recognition models extract high-level semantic features, and a transformer-based temporal detection module locates target events. These novel approaches achieved state-of-the-art performance in both action spotting and replay grounding. While presented in the context of sports, it shall be noted that the systems and methods herein may be used for videos comprising other content and events.