I have used things like Python's BeautifulSoup before, but it feels like overkill when most of the time I only want to "get the value of a specific item on a webpage".
I've been trying to find some nice tools for querying HTML on the terminal, and I've found some nice tools from html-xml-utils, that let you pipe HTML into hxselect and use CSS selectors to find content.
Here are some annotated examples with this website:
setup
# install packages
sudo apt install html-xml-utils jq
# set URL
url="https://blog.alifeee.co.uk/notes/"
# save site to variable, normalised, otherwise hxselect complains about invalid HTML
# might have to manually change the HTML so it normalises nicely
site=$(curl -s "${url}" | hxnormalize -x -l 240)
hxselect
hxselect is the main tool here. Run man hxselect or visit https://manpages.org/hxselect to read about it. You provide a CSS selector like main > p .class, and you can provide -c to output only the content of the match, and -s <sep> to put <sep> in between each match if there are multiple matches (usually a newline \n).
Getting a single thing
# get the site header
$ echo "${site}" | hxselect -c 'header h1'
notes by alifeee <img alt="profile picture" src="/notes/og-image.png"></img> <a class="source" href="/notes/feed.xml">rss</a>
# get the links of all images on the page
$ echo "${site}" | hxselect -s '\n' -c 'img::attr(src)'
/notes/og-image.png
# get a list of all URLs on the page
$ echo "${site}" | hxselect -s '\n' -c 'a::attr(href)' | head
/notes/feed.xml
/notes/
https://blog.alifeee.co.uk
https://alifeee.co.uk
https://weeknotes.alifeee.co.uk
https://linktr.ee/alifeee
https://github.com/alifeee/blog/tree/main/notes
/notes/
/notes/tagged/bash
/notes/tagged/scripting
# get a specific element, here the site description
$ echo "${site}" | hxselect -c 'header p:nth-child(3)'
here I may post some short, text-only notes, mostly about programming. <a href="https://github.com/alifeee/blog/tree/main/notes">source code</a>.
Getting multiple things
# save all matches to a variable results
# first with sed we replace all @ (choose any character)
# then we match our selector, separated by @
# we remove "@" first so we know they are definitely our separators
results=$(
curl -s "${url}" \
| hxnormalize -x -l 240 \
| sed 's/@/(at)/g' \
| hxselect -s '@' 'main article'
)
# this separtes the variable results into an array "a", using the separator @
mapfile -d@ a <<< "${results}"; unset 'a[-1]'
# to test, run `item="${a[1]}"` and copy the commands inside the loop
for item in "${a[@]}"; do
# select all h2 elements, remove spaces before and links afterwards
title=$(
echo "${item}" \
| hxselect -c 'h2' \
| sed 's/^ *//' \
| sed 's/ *<a.*//g'
)
# select all "a" elements inside ".tags" and separate with a comma, then remove final comma
tags=$(
echo "${item}" \
| hxselect -s ', ' -c '.tags a' \
| sed 's/, $//'
)
# separate the tags by ",", and print them in between square brackets so it looks like json
tags_json=$(
echo "${tags}" \
| awk -F', ' '{printf "["; fs=""; for (i=1; i<=NF; i++) {printf "%s\"%s\"", fs, $i; fs=", "}; printf "]"}'
)
# select the href of the 3rd "a" element inside the "h2" element
url=$(
echo "${item}" \
| hxselect -c 'h2 a.source:nth-of-type(3)::attr(href)'
)
# put it all together so it looks like JSON
# must be very wary of all the quotes - pretty horrible tbh
echo '{"title": "'"${title}"'", "tags": '"${tags_json}"', "url": "'"${url}"'"}' | jq
done
The output of the above looks like
{"title":"converting HTML entities to 'normal' UTF-8 in bash","tags":["bash","html"],"url":"#updating-a-file-in-a-github-repository-with-a-workflow"}
{"title":"updating a file in a GitHub repository with a workflow","tags":["github-actions","github","scripting"],"url":"/notes/updating-a-file-in-a-git-hub-repository-with-a-workflow/#note"}
{"title":"you should do Advent of Code using bash","tags":["advent-of-code","bash"],"url":"/notes/you-should-do-advent-of-code-using-bash/#note"}
{"title":"linting markdown from inside Obsidian","tags":["obsidian","scripting","markdown"],"url":"/notes/linting-markdown-from-inside-obsidian/#note"}
{"title":"installing nvm globally so automated scripts can use node and npm","tags":["node","scripting"],"url":"/notes/installing-nvm-globally-so-automated-scripts-can-use-node-and-npm/#note"}
{"title":"copying all the files that are ignored by .gitignore in multiple projects","tags":["bash","git"],"url":"/notes/copying-all-the-files-that-are-ignored-by-gitignore-in-multiple-projects/#note"}
{"title":"cron jobs are hard to make","tags":["cron"],"url":"/notes/cron-jobs-are-hard-to-make/#note"}
{"title":"ActivityPub posts and the ACCEPT header","tags":["ActivityPub"],"url":"/notes/activity-pub-posts-and-the-accept-header/#note"}
Other useful commands
A couple of things that come up a lot when you do this kind of web scraping are:
removing leading/trailing spaces - pipe into | sed 's/^ *//' | sed 's/ *$//'
removing new lines - pipe into | tr -d '\n'
replacing new lines with spaces - pipe into | tr ' ' '\n'
Naturally, you usually want to turn them back into characters, usually UTF-8. I came across a particularly gnarly site that had some normal HTML entities, some rare (Unicode) ones, and also some special characters that weren't encoded at all. From it, I made file to test HTML entity decoders on. Here it is, as file.txt:
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante
I wanted to find a way to convert the entities (i.e., decode '&º, but NOT decode ’– (nbsp) &) with a single command I could put in a bash pipe. I tried several contenders:
people often attempt it in a programming language which is new to them, so they can have a nice time learning how to use it. I think this language should be "the terminal".
I learnt a lot about sed and awk, and more general things about bash like pipes, redirection, stdout and stderr, and reading from and editing files.
Ever since, I've learnt more and more about the terminal and become more comfortable there, and now I feel comfortable doing some nice loops and commands, such that I process a lot of data purely on the terminal, like some of the maps I've been making recently.
They're mostly in bash, and they mostly solve quite quickly. I stopped after day 12 because things got too confusing and "dynamic programming", and my method of programming ("not having done a computer science degree") doesn't do well with recursion and complicated stuff.
I used a lot of awk, which I found very fun, and now I probably use awk almost every day.
For fun, here are how many times each of awk's string functions appear in my solutions
I clone all git projects into the same folder, (~/git on my personal linux computers, something like Documents/GitHub or Documents/AGitFolder on Windows).
I want to move my Windows laptop to Linux, but I don't want to keep the Windows hard drive. The laptop has only one drive slot, and I can't be bothered to take it apart and salvage the drive - I'd rather just wipe the whole drive. However, I'd like to keep some things to move to Linux, like my SSH keys, and any secrets or passwords I use in apps.
One source of secrets is the many files ignored by git, like .env or secrets.py. Thankfully, these are usually excluded from git tracking by writing them in .gitignore files. So... I should be able to find all the files by searching for .gitignore files in my git folder!
Here is a command I crafted to do that (I truncated the output to only show a few examples):
while read file; do # this runs the loop for each input line of AA below
folder=$(echo "${file}" | sed s'/[^\/]*$//'); # turn "./folder/.gitignore" into "./folder" by stripping all content after final "/"
cat "${file}" # output contents of file
# skip all lines of .gitignores which start with "#", are blank lines, or contain "*".
# then select only lines that end with ".something" (i.e., are files not directories) and print them
| awk '/^#/ {next} /^ *$/ {next} /\*/ {next} /.*\..+[^\/]$/ {print}'
# some files are "file.txt", some are "/file.txt", so first remove leading slashes, then print folder and file
| awk '{sub(/^\//, ""); printf "'"${folder}"'%s\n", $0}'
> /tmp/files13131.txt; # in order to check each file exists we need another loop, so put the list of files in a temporary file for now
while read f; do # run the loop for each input line of BB below
[ -f "${f}" ] && echo "${f}"; # before the && is a test to see if the file "${f}" exists, and it is echo'd only if the file does exist
done < /tmp/files13131.txt; # this is BB, "<" sends the output of the next file to the loop
done <<< $( # this is AA, "<<< $(" sends the output of the following command to the loop
find . -maxdepth 4 -name ".gitignore" # this finds all files named .gitignore in the current directory to a max depth of 4 directories
)
if we send the output of the command to files.txt (or copy and paste into there), then we can copy all these files to a folder _git, and preserve the structure of the folders, with:
rootf="_git"; while read file; do relfile=$(echo "${file}" | sed 's/^\.\///'); relfolder=$(echo "${relfile}" | sed 's/[^\/]*$//'); mkdir -p "${rootf}/${relfolder}"; cp "${file}" "${rootf}/${relfile}"; done < files.txt
and explained:
rootf="_git"; # set a variable to the folder to copy to so we can change it easily
while read file; do # loop over the file given in AA and name the loop variable "file"
# turn "./folder/sub/file.txt" into "folder/sub/file.txt" by replacing an initial "./" with nothing
relfile=$(echo "${file}" | sed 's/^\.\///');
# turn "folder/sub/file.txt" into "folder/sub/" by replacing all content after the final "/" with nothing
relfolder=$(echo "${relfile}" | sed 's/[^\/]*$//');
# create directories for the relative folder e.g., "folder/sub", "-p" means create all required sub-directories
mkdir -p "${rootf}/${relfolder}";
# copy the file to the folder we just created
cp "${file}" "${rootf}/${relfile}";
done < files.txt # use files.txt as the input to the loop
Now, I can put the files on a memory stick, and I have a very volatile memory stick that I shouldn't lose.
and I can put the memory stick into Linux when I boot it, so I don't lose any of my secrets.
To be honest, I don't think many of them were really important, and would probably just require me to go searching for API keys whenever I wanted to edit a project.