https://pytorch.org/vision/0.11/auto_examples/plot_video_api.html

This example illustrates some of the APIs that torchvision offers for videos, together with the examples on how to build datasets and more.

1. Introduction: building a new video object and examining the properties

First we select a video to test the object out. For the sake of argument we’re using one from kinetics400 dataset. To create it, we need to define the path and the stream we want to use.

Chosen video statistics:

import torch
import torchvision
from torchvision.datasets.utils import download_url

# Download the sample video
download_url(
    "<https://github.com/pytorch/vision/blob/main/test/assets/videos/WUzgd7C1pWA.mp4?raw=true>",
    ".",
    "WUzgd7C1pWA.mp4"
)
video_path = "./WUzgd7C1pWA.mp4"

Out:

Downloading <https://raw.githubusercontent.com/pytorch/vision/main/test/assets/videos/WUzgd7C1pWA.mp4> to ./WUzgd7C1pWA.mp4

0.1%
0.2%
0.3%
0.5%
0.6%
0.7%
0.8%
0.9%
1.0%
1.2%
1.3%
1.4%
1.5%
1.6%
1.7%
1.8%
2.0%
2.1%
2.2%
2.3%
2.4%
2.5%
2.6%
2.8%
2.9%
3.0%
3.1%
3.2%
3.3%
3.5%
3.6%
3.7%
3.8%
3.9%
4.0%
4.1%
4.3%
4.4%
4.5%
4.6%
4.7%
4.8%
4.9%
5.1%
5.2%
5.3%
5.4%
5.5%
5.6%
5.8%
5.9%
6.0%
6.1%
6.2%
6.3%
6.4%
6.6%
6.7%
6.8%
6.9%
7.0%
7.1%
7.3%
7.4%
7.5%
7.6%
7.7%
7.8%
7.9%
8.1%
8.2%
8.3%
8.4%
8.5%
8.6%
8.7%
8.9%
9.0%
9.1%
9.2%
9.3%
9.4%
9.6%
9.7%
9.8%
9.9%
10.0%
10.1%
10.2%
10.4%
10.5%
10.6%
10.7%
10.8%
10.9%
11.1%
11.2%
11.3%
11.4%
11.5%
11.6%
11.7%
11.9%
12.0%
12.1%
12.2%
12.3%
12.4%
12.5%
12.7%
12.8%
12.9%
13.0%
13.1%
13.2%
13.4%
13.5%
13.6%
13.7%
13.8%
13.9%
14.0%
14.2%
14.3%
14.4%
14.5%
14.6%
14.7%
14.8%
15.0%
15.1%
15.2%
15.3%
15.4%
15.5%
15.7%
15.8%
15.9%
16.0%
16.1%
16.2%
16.3%
16.5%
16.6%
16.7%
16.8%
16.9%
17.0%
17.2%
17.3%
17.4%
17.5%
17.6%
17.7%
17.8%
18.0%
18.1%
18.2%
18.3%
18.4%
18.5%
18.6%
18.8%
18.9%
19.0%
19.1%
19.2%
19.3%
19.5%
19.6%
19.7%
19.8%
19.9%
20.0%
20.1%
20.3%
20.4%
20.5%
20.6%
20.7%
20.8%
20.9%
21.1%
21.2%
21.3%
21.4%
21.5%
21.6%
21.8%
21.9%
22.0%
22.1%
22.2%
22.3%
22.4%
22.6%
22.7%
22.8%
22.9%
23.0%
23.1%
23.3%
23.4%
23.5%
23.6%
23.7%
23.8%
23.9%
24.1%
24.2%
24.3%
24.4%
24.5%
24.6%
24.7%
24.9%
25.0%
25.1%
25.2%
25.3%
25.4%
25.6%
25.7%
25.8%
25.9%
26.0%
26.1%
26.2%
26.4%
26.5%
26.6%
26.7%
26.8%
26.9%
27.0%
27.2%
27.3%
27.4%
27.5%
27.6%
27.7%
27.9%
28.0%
28.1%
28.2%
28.3%
28.4%
28.5%
28.7%
28.8%
28.9%
29.0%
29.1%
29.2%
29.4%
29.5%
29.6%
29.7%
29.8%
29.9%
30.0%
30.2%
30.3%
30.4%
30.5%
30.6%
30.7%
30.8%
31.0%
31.1%
31.2%
31.3%
31.4%
31.5%
31.7%
31.8%
31.9%
32.0%
32.1%
32.2%
32.3%
32.5%
32.6%
32.7%
32.8%
32.9%
33.0%
33.2%
33.3%
33.4%
33.5%
33.6%
33.7%
33.8%
34.0%
34.1%
34.2%
34.3%
34.4%
34.5%
34.6%
34.8%
34.9%
35.0%
35.1%
35.2%
35.3%
35.5%
35.6%
35.7%
35.8%
35.9%
36.0%
36.1%
36.3%
36.4%
36.5%
36.6%
36.7%
36.8%
36.9%
37.1%
37.2%
37.3%
37.4%
37.5%
37.6%
37.8%
37.9%
38.0%
38.1%
38.2%
38.3%
38.4%
38.6%
38.7%
38.8%
38.9%
39.0%
39.1%
39.3%
39.4%
39.5%
39.6%
39.7%
39.8%
39.9%
40.1%
40.2%
40.3%
40.4%
40.5%
40.6%
40.7%
40.9%
41.0%
41.1%
41.2%
41.3%
41.4%
41.6%
41.7%
41.8%
41.9%
42.0%
42.1%
42.2%
42.4%
42.5%
42.6%
42.7%
42.8%
42.9%
43.0%
43.2%
43.3%
43.4%
43.5%
43.6%
43.7%
43.9%
44.0%
44.1%
44.2%
44.3%
44.4%
44.5%
44.7%
44.8%
44.9%
45.0%
45.1%
45.2%
45.4%
45.5%
45.6%
45.7%
45.8%
45.9%
46.0%
46.2%
46.3%
46.4%
46.5%
46.6%
46.7%
46.8%
47.0%
47.1%
47.2%
47.3%
47.4%
47.5%
47.7%
47.8%
47.9%
48.0%
48.1%
48.2%
48.3%
48.5%
48.6%
48.7%
48.8%
48.9%
49.0%
49.2%
49.3%
49.4%
49.5%
49.6%
49.7%
49.8%
50.0%
50.1%
50.2%
50.3%
50.4%
50.5%
50.6%
50.8%
50.9%
51.0%
51.1%
51.2%
51.3%
51.5%
51.6%
51.7%
51.8%
51.9%
52.0%
52.1%
52.3%
52.4%
52.5%
52.6%
52.7%
52.8%
52.9%
53.1%
53.2%
53.3%
53.4%
53.5%
53.6%
53.8%
53.9%
54.0%
54.1%
54.2%
54.3%
54.4%
54.6%
54.7%
54.8%
54.9%
55.0%
55.1%
55.3%
55.4%
55.5%
55.6%
55.7%
55.8%
55.9%
56.1%
56.2%
56.3%
56.4%
56.5%
56.6%
56.7%
56.9%
57.0%
57.1%
57.2%
57.3%
57.4%
57.6%
57.7%
57.8%
57.9%
58.0%
58.1%
58.2%
58.4%
58.5%
58.6%
58.7%
58.8%
58.9%
59.0%
59.2%
59.3%
59.4%
59.5%
59.6%
59.7%
59.9%
60.0%
60.1%
60.2%
60.3%
60.4%
60.5%
60.7%
60.8%
60.9%
61.0%
61.1%
61.2%
61.4%
61.5%
61.6%
61.7%
61.8%
61.9%
62.0%
62.2%
62.3%
62.4%
62.5%
62.6%
62.7%
62.8%
63.0%
63.1%
63.2%
63.3%
63.4%
63.5%
63.7%
63.8%
63.9%
64.0%
64.1%
64.2%
64.3%
64.5%
64.6%
64.7%
64.8%
64.9%
65.0%
65.1%
65.3%
65.4%
65.5%
65.6%
65.7%
65.8%
66.0%
66.1%
66.2%
66.3%
66.4%
66.5%
66.6%
66.8%
66.9%
67.0%
67.1%
67.2%
67.3%
67.5%
67.6%
67.7%
67.8%
67.9%
68.0%
68.1%
68.3%
68.4%
68.5%
68.6%
68.7%
68.8%
68.9%
69.1%
69.2%
69.3%
69.4%
69.5%
69.6%
69.8%
69.9%
70.0%
70.1%
70.2%
70.3%
70.4%
70.6%
70.7%
70.8%
70.9%
71.0%
71.1%
71.3%
71.4%
71.5%
71.6%
71.7%
71.8%
71.9%
72.1%
72.2%
72.3%
72.4%
72.5%
72.6%
72.7%
72.9%
73.0%
73.1%
73.2%
73.3%
73.4%
73.6%
73.7%
73.8%
73.9%
74.0%
74.1%
74.2%
74.4%
74.5%
74.6%
74.7%
74.8%
74.9%
75.0%
75.2%
75.3%
75.4%
75.5%
75.6%
75.7%
75.9%
76.0%
76.1%
76.2%
76.3%
76.4%
76.5%
76.7%
76.8%
76.9%
77.0%
77.1%
77.2%
77.4%
77.5%
77.6%
77.7%
77.8%
77.9%
78.0%
78.2%
78.3%
78.4%
78.5%
78.6%
78.7%
78.8%
79.0%
79.1%
79.2%
79.3%
79.4%
79.5%
79.7%
79.8%
79.9%
80.0%
80.1%
80.2%
80.3%
80.5%
80.6%
80.7%
80.8%
80.9%
81.0%
81.1%
81.3%
81.4%
81.5%
81.6%
81.7%
81.8%
82.0%
82.1%
82.2%
82.3%
82.4%
82.5%
82.6%
82.8%
82.9%
83.0%
83.1%
83.2%
83.3%
83.5%
83.6%
83.7%
83.8%
83.9%
84.0%
84.1%
84.3%
84.4%
84.5%
84.6%
84.7%
84.8%
84.9%
85.1%
85.2%
85.3%
85.4%
85.5%
85.6%
85.8%
85.9%
86.0%
86.1%
86.2%
86.3%
86.4%
86.6%
86.7%
86.8%
86.9%
87.0%
87.1%
87.2%
87.4%
87.5%
87.6%
87.7%
87.8%
87.9%
88.1%
88.2%
88.3%
88.4%
88.5%
88.6%
88.7%
88.9%
89.0%
89.1%
89.2%
89.3%
89.4%
89.6%
89.7%
89.8%
89.9%
90.0%
90.1%
90.2%
90.4%
90.5%
90.6%
90.7%
90.8%
90.9%
91.0%
91.2%
91.3%
91.4%
91.5%
91.6%
91.7%
91.9%
92.0%
92.1%
92.2%
92.3%
92.4%
92.5%
92.7%
92.8%
92.9%
93.0%
93.1%
93.2%
93.4%
93.5%
93.6%
93.7%
93.8%
93.9%
94.0%
94.2%
94.3%
94.4%
94.5%
94.6%
94.7%
94.8%
95.0%
95.1%
95.2%
95.3%
95.4%
95.5%
95.7%
95.8%
95.9%
96.0%
96.1%
96.2%
96.3%
96.5%
96.6%
96.7%
96.8%
96.9%
97.0%
97.1%
97.3%
97.4%
97.5%
97.6%
97.7%
97.8%
98.0%
98.1%
98.2%
98.3%
98.4%
98.5%
98.6%
98.8%
98.9%
99.0%
99.1%
99.2%
99.3%
99.5%
99.6%
99.7%
99.8%
99.9%
100.0%

Streams are defined in a similar fashion as torch devices. We encode them as strings in a form of stream_type:stream_id where stream_type is a string and stream_id a long int. The constructor accepts passing a stream_type only, in which case the stream is auto-discovered. Firstly, let’s get the metadata for our particular video:

Out:

{'video': {'duration': [10.9109], 'fps': [29.97002997002997]}, 'audio': {'duration': [10.9], 'framerate': [48000.0]}, 'subtitles': {'duration': []}, 'cc': {'duration': []}}

Here we can see that video has two streams - a video and an audio stream. Currently available stream types include [‘video’, ‘audio’]. Each descriptor consists of two parts: stream type (e.g. ‘video’) and a unique stream id (which are determined by video encoding). In this way, if the video container contains multiple streams of the same type, users can access the one they want. If only stream type is passed, the decoder auto-detects first stream of that type and returns it.

Let’s read all the frames from the video stream. By default, the return value of next(video_reader) is a dict containing the following fields.

The return fields are:

metadata = video.get_metadata()
video.set_current_stream("audio")

frames = []  # we are going to save the frames here.
ptss = []  # pts is a presentation timestamp in seconds (float) of each frame
forframe in video:
frames.append(frame['data'])
ptss.append(frame['pts'])

print("PTS for first five frames ",ptss[:5])
print("Total number of frames: ", len(frames))
approx_nf =metadata['audio']['duration'][0] *metadata['audio']['framerate'][0]
print("Approx total number of datapoints we can expect: ",approx_nf)
print("Read data size: ",frames[0].size(0) * len(frames))

Out:

PTS for first five frames  [0.0, 0.021332999999999998, 0.042667, 0.064, 0.08533299999999999]
Total number of frames:  511
Approx total number of datapoints we can expect:  523200.0
Read data size:  523264