using bash and CSS selectors for web-scraping # source
I do a lot of web scraping.
I also love bash.
I have used things like Python's BeautifulSoup before, but it feels like overkill when most of the time I only want to "get the value of a specific item on a webpage".
I've been trying to find some nice tools for querying HTML on the terminal, and I've found some nice tools from html-xml-utils
, that let you pipe HTML into hxselect
and use CSS selectors to find content.
Here are some annotated examples with this website:
setup
# install packages
sudo apt install html-xml-utils jq
# set URL
url="https://blog.alifeee.co.uk/notes/"
# save site to variable, normalised, otherwise hxselect complains about invalid HTML
# might have to manually change the HTML so it normalises nicely
site=$(curl -s "${url}" | hxnormalize -x -l 240)
hxselect
hxselect
is the main tool here. Run man hxselect
or visit https://manpages.org/hxselect to read about it. You provide a CSS selector like main > p .class
, and you can provide -c
to output only the content of the match, and -s <sep>
to put <sep>
in between each match if there are multiple matches (usually a newline \n
).
Getting a single thing
# get the site header
$ echo "${site}" | hxselect -c 'header h1'
notes by alifeee <img alt="profile picture" src="/notes/og-image.png"></img> <a class="source" href="/notes/feed.xml">rss</a>
# get the links of all images on the page
$ echo "${site}" | hxselect -s '\n' -c 'img::attr(src)'
/notes/og-image.png
# get a list of all URLs on the page
$ echo "${site}" | hxselect -s '\n' -c 'a::attr(href)' | head
/notes/feed.xml
/notes/
https://blog.alifeee.co.uk
https://alifeee.co.uk
https://weeknotes.alifeee.co.uk
https://linktr.ee/alifeee
https://github.com/alifeee/blog/tree/main/notes
/notes/
/notes/tagged/bash
/notes/tagged/scripting
# get a specific element, here the site description
$ echo "${site}" | hxselect -c 'header p:nth-child(3)'
here I may post some short, text-only notes, mostly about programming. <a href="https://github.com/alifeee/blog/tree/main/notes">source code</a>.
Getting multiple things
# save all matches to a variable results
# first with sed we replace all @ (choose any character)
# then we match our selector, separated by @
# we remove "@" first so we know they are definitely our separators
results=$(
curl -s "${url}" \
| hxnormalize -x -l 240 \
| sed 's/@/(at)/g' \
| hxselect -s '@' 'main article'
)
# this separtes the variable results into an array "a", using the separator @
mapfile -d@ a <<< "${results}"; unset 'a[-1]'
# to test, run `item="${a[1]}"` and copy the commands inside the loop
for item in "${a[@]}"; do
# select all h2 elements, remove spaces before and links afterwards
title=$(
echo "${item}" \
| hxselect -c 'h2' \
| sed 's/^ *//' \
| sed 's/ *<a.*//g'
)
# select all "a" elements inside ".tags" and separate with a comma, then remove final comma
tags=$(
echo "${item}" \
| hxselect -s ', ' -c '.tags a' \
| sed 's/, $//'
)
# separate the tags by ",", and print them in between square brackets so it looks like json
tags_json=$(
echo "${tags}" \
| awk -F', ' '{printf "["; fs=""; for (i=1; i<=NF; i++) {printf "%s\"%s\"", fs, $i; fs=", "}; printf "]"}'
)
# select the href of the 3rd "a" element inside the "h2" element
url=$(
echo "${item}" \
| hxselect -c 'h2 a.source:nth-of-type(3)::attr(href)'
)
# put it all together so it looks like JSON
# must be very wary of all the quotes - pretty horrible tbh
echo '{"title": "'"${title}"'", "tags": '"${tags_json}"', "url": "'"${url}"'"}' | jq
done
The output of the above looks like
{"title":"converting HTML entities to 'normal' UTF-8 in bash","tags":["bash","html"],"url":"#updating-a-file-in-a-github-repository-with-a-workflow"}
{"title":"updating a file in a GitHub repository with a workflow","tags":["github-actions","github","scripting"],"url":"/notes/updating-a-file-in-a-git-hub-repository-with-a-workflow/#note"}
{"title":"you should do Advent of Code using bash","tags":["advent-of-code","bash"],"url":"/notes/you-should-do-advent-of-code-using-bash/#note"}
{"title":"linting markdown from inside Obsidian","tags":["obsidian","scripting","markdown"],"url":"/notes/linting-markdown-from-inside-obsidian/#note"}
{"title":"installing nvm globally so automated scripts can use node and npm","tags":["node","scripting"],"url":"/notes/installing-nvm-globally-so-automated-scripts-can-use-node-and-npm/#note"}
{"title":"copying all the files that are ignored by .gitignore in multiple projects","tags":["bash","git"],"url":"/notes/copying-all-the-files-that-are-ignored-by-gitignore-in-multiple-projects/#note"}
{"title":"cron jobs are hard to make","tags":["cron"],"url":"/notes/cron-jobs-are-hard-to-make/#note"}
{"title":"ActivityPub posts and the ACCEPT header","tags":["ActivityPub"],"url":"/notes/activity-pub-posts-and-the-accept-header/#note"}
Other useful commands
A couple of things that come up a lot when you do this kind of web scraping are:
- removing leading/trailing spaces - pipe into
| sed 's/^ *//' | sed 's/ *$//'
- removing new lines - pipe into
| tr -d '\n'
- replacing new lines with spaces - pipe into
| tr ' ' '\n'
- decoding HTML entities - pipe into
php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'
or replace manually withsed
(see https://blog.alifeee.co.uk/notes/converting-html-entities-to-normal-utf-8-in-bash/#note)