I create a lot of websites. I'm also a big fan of very performant, very accessible websites. I'm also a big fan of the phrase "the important thing first is for a webpage to exist, then after that to look nice".
With that in mind, this is the HTML template I usually start with to make a new webpage. It's made to manually replace the things in curly brackets, but also has the bonus that you could use it with a templating language like https://handlebarsjs.com/ (my favourite).
<!DOCTYPEhtml><htmllang="en"><head><!-- browser metas --><title>{{title}}</title><metaname="description"content="{{description}}"/><!-- allow unicode characters --><metacharset="utf-8"/><!-- 'zoom' on mobile --><metaname="viewport"content="width=device-width, initial-scale=1.0"/><!-- embed metas - https://ogp.me/ - visible when sharing on social media --><metaproperty="og:title"content="{{title}}"/><metaproperty="og:type"content="website"/><metaproperty="og:site"content="{{title}}"/><metaproperty="og:url"content="{{base_url}}/{{page}}"/><metaproperty="og:image"content="{{base_url}}/{{image}}"/><metaproperty="og:description"content="{{description}}"/><metaproperty="og:locale"content="en_GB"/><!-- styling --><!-- favicon - can be any image (.png, .jpg, .ico) --><linkrel="icon"type="image/png"href="/og-image.png"/></head><body><header></header><main>
{{{content}}}
</main><footer></footer></body></html>
Put it somewhere, and put something in there! Make a personal website! Make a blog! I will love you forever.
I have used things like Python's BeautifulSoup before, but it feels like overkill when most of the time I only want to "get the value of a specific item on a webpage".
I've been trying to find some nice tools for querying HTML on the terminal, and I've found some nice tools from html-xml-utils, that let you pipe HTML into hxselect and use CSS selectors to find content.
Here are some annotated examples with this website:
setup
# install packagessudoaptinstall html-xml-utils jq
# set URLurl="https://blog.alifeee.co.uk/notes/"# save site to variable, normalised, otherwise hxselect complains about invalid HTML# might have to manually change the HTML so it normalises nicelysite=$(curl-s"${url}"| hxnormalize -x-l240)
hxselect
hxselect is the main tool here. Run man hxselect or visit https://manpages.org/hxselect to read about it. You provide a CSS selector like main > p .class, and you can provide -c to output only the content of the match, and -s <sep> to put <sep> in between each match if there are multiple matches (usually a newline \n).
Getting a single thing
# get the site header
$ echo"${site}"| hxselect -c'header h1'
notes by alifeee <img alt="profile picture"src="/notes/og-image.png"></img><a class="source"href="/notes/feed.xml">rss</a># get the links of all images on the page
$ echo"${site}"| hxselect -s'\n'-c'img::attr(src)'
/notes/og-image.png
# get a list of all URLs on the page
$ echo"${site}"| hxselect -s'\n'-c'a::attr(href)'|head
/notes/feed.xml
/notes/
https://blog.alifeee.co.uk
https://alifeee.co.uk
https://weeknotes.alifeee.co.uk
https://linktr.ee/alifeee
https://github.com/alifeee/blog/tree/main/notes
/notes/
/notes/tagged/bash
/notes/tagged/scripting
# get a specific element, here the site description
$ echo"${site}"| hxselect -c'header p:nth-child(3)'
here I may post some short, text-only notes, mostly about programming. <a href="https://github.com/alifeee/blog/tree/main/notes">source code</a>.
Getting multiple things
# save all matches to a variable results# first with sed we replace all @ (choose any character)# then we match our selector, separated by @# we remove "@" first so we know they are definitely our separatorsresults=$(curl-s"${url}"\| hxnormalize -x-l240\|sed's/@/(at)/g'\| hxselect -s'@''main article')# this separtes the variable results into an array "a", using the separator @mapfile -d@ a <<<"${results}";unset'a[-1]'# to test, run `item="${a[1]}"` and copy the commands inside the loopforitemin"${a[@]}";do# select all h2 elements, remove spaces before and links afterwardstitle=$(echo"${item}"\| hxselect -c'h2'\|sed's/^ *//'\|sed's/ *<a.*//g')# select all "a" elements inside ".tags" and separate with a comma, then remove final commatags=$(echo"${item}"\| hxselect -s', '-c'.tags a'\|sed's/, $//')# separate the tags by ",", and print them in between square brackets so it looks like jsontags_json=$(echo"${tags}"\|awk -F', ''{printf "["; fs=""; for (i=1; i<=NF; i++) {printf "%s\"%s\"", fs, $i; fs=", "}; printf "]"}')# select the href of the 3rd "a" element inside the "h2" elementurl=$(echo"${item}"\| hxselect -c'h2 a.source:nth-of-type(3)::attr(href)')# put it all together so it looks like JSON# must be very wary of all the quotes - pretty horrible tbhecho'{"title": "'"${title}"'", "tags":'"${tags_json}"', "url":"'"${url}"'"}' | jq
done
The output of the above looks like
{"title":"converting HTML entities to 'normal' UTF-8 in bash","tags":["bash","html"],"url":"#updating-a-file-in-a-github-repository-with-a-workflow"}{"title":"updating a file in a GitHub repository with a workflow","tags":["github-actions","github","scripting"],"url":"/notes/updating-a-file-in-a-git-hub-repository-with-a-workflow/#note"}{"title":"you should do Advent of Code using bash","tags":["advent-of-code","bash"],"url":"/notes/you-should-do-advent-of-code-using-bash/#note"}{"title":"linting markdown from inside Obsidian","tags":["obsidian","scripting","markdown"],"url":"/notes/linting-markdown-from-inside-obsidian/#note"}{"title":"installing nvm globally so automated scripts can use node and npm","tags":["node","scripting"],"url":"/notes/installing-nvm-globally-so-automated-scripts-can-use-node-and-npm/#note"}{"title":"copying all the files that are ignored by .gitignore in multiple projects","tags":["bash","git"],"url":"/notes/copying-all-the-files-that-are-ignored-by-gitignore-in-multiple-projects/#note"}{"title":"cron jobs are hard to make","tags":["cron"],"url":"/notes/cron-jobs-are-hard-to-make/#note"}{"title":"ActivityPub posts and the ACCEPT header","tags":["ActivityPub"],"url":"/notes/activity-pub-posts-and-the-accept-header/#note"}
Other useful commands
A couple of things that come up a lot when you do this kind of web scraping are:
removing leading/trailing spaces - pipe into | sed 's/^ *//' | sed 's/ *$//'
removing new lines - pipe into | tr -d '\n'
replacing new lines with spaces - pipe into | tr ' ' '\n'
Naturally, you usually want to turn them back into characters, usually UTF-8. I came across a particularly gnarly site that had some normal HTML entities, some rare (Unicode) ones, and also some special characters that weren't encoded at all. From it, I made file to test HTML entity decoders on. Here it is, as file.txt:
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante
I wanted to find a way to convert the entities (i.e., decode '&º, but NOT decode ’– (nbsp) &) with a single command I could put in a bash pipe. I tried several contenders:
The only thing I don't know how to do now is to make a bash function or alias so that I could write
cat file.txt | decodeHTML
instead of the massive php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'.
Python
(2025-02-10 edit) I have also found a nice way to do this using the Python html library's escape and unescape (because that's what I had installed in my workflow and I couldn't be bothered to install PHP)