2019-05-14 22:26:17 -04:00

160 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Automating Youtube Archival"
date: 2019-05-11
draft: false
tags: ["youtube-dl", "automation"]
author: "simon987"
---
Google has been known to terminate entire Youtube channels without
notice in the hope of staying advertiser-friendly. **58 million "problematic" videos
were deleted from the platform in Q3 2018** <sup>[1](#sources)</sup>.
This week we are exploring various Youtube archival solutions that utilizes
[youtube-dl](https://github.com/ytdl-org/youtube-dl/),
a command-line program that extracts and download videos from web pages.
{{< figure src="/ytdl/1.png" title="Channels removed, by removal reason">}}
# Installation
Install youtube-dl via pip to ensure that you have the latest version. Google often
pushes updates that breaks youtube-dl so you always want to stay up to date. You
might be able to install it via your distibutions package manager but its often several
versions behind.
```
pip install --upgrade youtube-dl
youtube-dl --version
```
# Basic usage
You can now use the youtube-dl command to download a single video, a playlist or a channel:
```
youtube-dl https://www.youtube.com/watch?v=XXXXXXXXXX
youtube-dl https://www.youtube.com/channel/XXXXXXXXXX
```
You can use the same command to download videos from various websites.Youtube-dl
is also a Python library that you can use in scripts:
{{<highlight python "linenos=table">}}
#!/usr/bin/python
import youtube_dl
yt_dl = youtube_dl.YoutubeDL()
yt_dl.download("https://www.youtube.com/watch?v=XXXXXXXXXXXX")
{{</highlight>}}
## Command line arguments & Scripting
This document is not a replacement for youtube-dls documentation, you can find the
updated list of command-line arguments on its [Github page](https://github.com/ytdl-org/youtube-dl/).
Below is a bash script that will download everything specified in `list.txt`, a text file
with a youtube channel or video on each line. The script will save the URLs of the videos
in archive.txt as it downloads them to speedup the subsequent executions.
The `--write-info-json` and `--write-thumbnail` options ensures that we also download
the video metadata such as the description and the title.
{{<highlight Bash "linenos=table">}}
youtube-dl -a list.txt -o '%(uploader)s/%(title)s-%(id)s.%(ext)s' \
--write-thumbnail\
-f "bestvideo[ext=webm]+bestaudio[ext=webm]/bestvideo[ext=mp4]+bestaudio[ext=m4a]/bestvideo+bestaudio/best"\
--write-info-json\
--geo-bypass\
--ignore-errors\
--download-archive archive.txt
{{</highlight>}}
You can decide to run the script periodically or to schedule it with cron:
```
crontab -e
```
{{<highlight Bash>}}
# Download new videos every 2 hours
0 */2 * * *
/mnt/Archive/Youtube/archive_yt.sh
{{</highlight>}}
## Live streams
While a cron job downloads all videos uploaded by a channel (if the uploader does
not delete the video between executions), it does not handle live streams. Youtube-dl
allows you to download live streams with the same command but you obviously have to
start the execution during the stream.
Below is a different approach that takes advantage of the Youtube email notification
feature. This simple Python script reads your last 3 emails and searches for a Youtube
link in the email body. It will immediately start downloading the video using the youtube-dl
Python library. You can use this method to download uploaded videos as well as live
streams. If you are using a Gmail account, you will need to generate an [App Password](https://security.google.com/settings/security/apppasswords)
to allow the script to login.
{{<highlight python "linenos=table">}}
#!/usr/bin/python
import imaplib
import re
import youtube_dl
# Initialize the youtube-dl downloader, nooverwrites param will
# skip videos that are already downloaded or
# currently being downloaded
yt_dl = youtube_dl.YoutubeDL(params={
"nooverwrites": True,
"nopart": True,
"format": "bestvideo[ext=webm]+bestaudio[ext=webm]/bestvideo[ext=mp4]+bestaudio[ext=m4a]/bestvideo+bestaudio/best"
})
# This regex pattern matches Youtube video links
YT_LINK = re.compile("Fv%3D([^%]*)%")
mail = imaplib.IMAP4_SSL("imap.gmail.com")
mail.login("username@gmail.com", "password")
mail.list()
mail.select("INBOX")
# Fetch the last 3 emails
_, data = mail.search(None, "ALL")
last_emails = list(reversed(data[0].split()))[:3]
for num in last_emails:
_, data = mail.fetch(num, "(RFC822)")
body = data[0][1].decode()
# Check for pattern match
match = YT_LINK.search(body)
if match:
url = "https://youtube.com/watch?v=" + match.group(1)
yt_dl.download([url, ]) # immediately download
{{</highlight>}}
## Automatically upload to rclone remote
To take advantage of cloud storage, you can setup [ytdlrc](http://github.com/bardisty/ytdlrc) to automatically move
videos to an rclone remote as they are downloaded. This simple script is completely interchangeable with
youtube-dl and can be set up on a machine with low disk space.
The script uses your existing youtube-dl and rclone configuration and is ideal for
setting up automatic Youtube archival on a cheap VPS.
## Archiving Metadata
If you wish to save a videos metadata without downloading the actual video, there are command-line utilities dedicated to this task.
* [Youtube-MA](https://github.com/CorentinB/YouTube-MA)
* [yt-mango](https://github.com/terorie/yt-mango)
----
# Sources
* https://transparencyreport.google.com/youtube-policy/removals