notes by alifeee profile picture tagged html (3) rss

return to notes / blog / website / weeknotes / linktree

here I may post some short, text-only notes, mostly about programming. source code.

tags: all (44) scripting (13) linux (5) bash (4) geojson (4) obsidian (4) android (3) github (3) html (3) jq (3) ............ see all (+63)

basic HTML template with all the gubbins # prev single next top

tags: html • 235 'words', 71 secs @ 200wpm

I create a lot of websites. I'm also a big fan of very performant, very accessible websites. I'm also a big fan of the phrase "the important thing first is for a webpage to exist, then after that to look nice".

With that in mind, this is the HTML template I usually start with to make a new webpage. It's made to manually replace the things in curly brackets, but also has the bonus that you could use it with a templating language like https://handlebarsjs.com/ (my favourite).

<!DOCTYPE html>
<html lang="en">

<head>
    <!-- browser metas -->
    <title>{{title}}</title>
    <meta name="description" content="{{description}}" />
    <!-- allow unicode characters -->
    <meta charset="utf-8" />
    <!-- 'zoom' on mobile -->
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />

    <!-- embed metas - https://ogp.me/ - visible when sharing on social media -->
    <meta property="og:title" content="{{title}}" />
    <meta property="og:type" content="website" />
    <meta property="og:site" content="{{title}}" />
    <meta property="og:url" content="{{base_url}}/{{page}}" />
    <meta property="og:image" content="{{base_url}}/{{image}}" />
    <meta property="og:description" content="{{description}}" />
    <meta property="og:locale" content="en_GB" />

    <!-- styling -->
    <!-- favicon - can be any image (.png, .jpg, .ico) -->
    <link rel="icon" type="image/png" href="/og-image.png" />
</head>

<body>
    <header></header>
    <main>
        {{{content}}}
    </main>
    <footer></footer>
</body>

</html>

Put it somewhere, and put something in there! Make a personal website! Make a blog! I will love you forever.

back to top

using bash and CSS selectors for web-scraping # prev single next top

tags: bash, web-scraping, html • 885 'words', 266 secs @ 200wpm

I do a lot of web scraping.

I also love bash.

I have used things like Python's BeautifulSoup before, but it feels like overkill when most of the time I only want to "get the value of a specific item on a webpage".

I've been trying to find some nice tools for querying HTML on the terminal, and I've found some nice tools from html-xml-utils, that let you pipe HTML into hxselect and use CSS selectors to find content.

Here are some annotated examples with this website:

setup

# install packages
sudo apt install html-xml-utils jq
# set URL
url="https://blog.alifeee.co.uk/notes/"
# save site to variable, normalised, otherwise hxselect complains about invalid HTML
#   might have to manually change the HTML so it normalises nicely
site=$(curl -s "${url}" | hxnormalize -x -l 240)

hxselect

hxselect is the main tool here. Run man hxselect or visit https://manpages.org/hxselect to read about it. You provide a CSS selector like main > p .class, and you can provide -c to output only the content of the match, and -s <sep> to put <sep> in between each match if there are multiple matches (usually a newline \n).

Getting a single thing

# get the site header
$ echo "${site}" | hxselect -c 'header h1'
 notes by alifeee <img alt="profile picture" src="/notes/og-image.png"></img> <a class="source" href="/notes/feed.xml">rss</a>

# get the links of all images on the page
$ echo "${site}" | hxselect -s '\n' -c 'img::attr(src)'
/notes/og-image.png

# get a list of all URLs on the page
$ echo "${site}" | hxselect -s '\n' -c 'a::attr(href)' | head
/notes/feed.xml
/notes/
https://blog.alifeee.co.uk
https://alifeee.co.uk
https://weeknotes.alifeee.co.uk
https://linktr.ee/alifeee
https://github.com/alifeee/blog/tree/main/notes
/notes/
/notes/tagged/bash
/notes/tagged/scripting

# get a specific element, here the site description
$ echo "${site}" | hxselect -c 'header p:nth-child(3)'
 here I may post some short, text-only notes, mostly about programming. <a href="https://github.com/alifeee/blog/tree/main/notes">source code</a>.

Getting multiple things

# save all matches to a variable results
#   first with sed we replace all @ (choose any character)
#   then we match our selector, separated by @
#   we remove "@" first so we know they are definitely our separators
results=$(
  curl -s "${url}" \
    | hxnormalize -x -l 240 \
    | sed 's/@/(at)/g' \
    | hxselect -s '@' 'main article'
  )
# this separtes the variable results into an array "a", using the separator @
mapfile -d@ a <<< "${results}"; unset 'a[-1]'
# to test, run `item="${a[1]}"` and copy the commands inside the loop
for item in "${a[@]}"; do
  # select all h2 elements, remove spaces before and links afterwards
  title=$(
    echo "${item}" \
      | hxselect -c 'h2' \
      | sed 's/^ *//' \
      | sed 's/ *<a.*//g'
  )
  # select all "a" elements inside ".tags" and separate with a comma, then remove final comma
  tags=$(
    echo "${item}" \
      | hxselect -s ', ' -c '.tags a' \
      | sed 's/, $//'
  )
  # separate the tags by ",", and print them in between square brackets so it looks like json
  tags_json=$(
    echo "${tags}" \
      | awk -F', ' '{printf "["; fs=""; for (i=1; i<=NF; i++) {printf "%s\"%s\"", fs, $i; fs=", "}; printf "]"}'
  )
  # select the href of the 3rd "a" element inside the "h2" element
  url=$(
    echo "${item}" \
      | hxselect -c 'h2 a.source:nth-of-type(3)::attr(href)'
  )
  # put it all together so it looks like JSON
  #   must be very wary of all the quotes - pretty horrible tbh
  echo '{"title": "'"${title}"'", "tags": '"${tags_json}"', "url": "'"${url}"'"}' | jq
done

The output of the above looks like

{"title":"converting HTML entities to &#x27;normal&#x27; UTF-8 in bash","tags":["bash","html"],"url":"#updating-a-file-in-a-github-repository-with-a-workflow"}
{"title":"updating a file in a GitHub repository with a workflow","tags":["github-actions","github","scripting"],"url":"/notes/updating-a-file-in-a-git-hub-repository-with-a-workflow/#note"}
{"title":"you should do Advent of Code using bash","tags":["advent-of-code","bash"],"url":"/notes/you-should-do-advent-of-code-using-bash/#note"}
{"title":"linting markdown from inside Obsidian","tags":["obsidian","scripting","markdown"],"url":"/notes/linting-markdown-from-inside-obsidian/#note"}
{"title":"installing nvm globally so automated scripts can use node and npm","tags":["node","scripting"],"url":"/notes/installing-nvm-globally-so-automated-scripts-can-use-node-and-npm/#note"}
{"title":"copying all the files that are ignored by .gitignore in multiple projects","tags":["bash","git"],"url":"/notes/copying-all-the-files-that-are-ignored-by-gitignore-in-multiple-projects/#note"}
{"title":"cron jobs are hard to make","tags":["cron"],"url":"/notes/cron-jobs-are-hard-to-make/#note"}
{"title":"ActivityPub posts and the ACCEPT header","tags":["ActivityPub"],"url":"/notes/activity-pub-posts-and-the-accept-header/#note"}

Other useful commands

A couple of things that come up a lot when you do this kind of web scraping are:

back to top

converting HTML entities to 'normal' UTF-8 in bash # prev single next top

tags: bash, html • 476 'words', 143 secs @ 200wpm

I do a fair amount of web scraping, and you come across a lot of HTML entities. They're the things that look like &gt;, &nbsp;, &amp;.

See https://developer.mozilla.org/en-US/docs/Glossary/Character_reference

Naturally, you usually want to turn them back into characters, usually UTF-8. I came across a particularly gnarly site that had some normal HTML entities, some rare (Unicode) ones, and also some special characters that weren't encoded at all. From it, I made file to test HTML entity decoders on. Here it is, as file.txt:

Children&#39;s event,
Wildlife &amp; Nature,
peddler-market-n&#186;-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

I wanted to find a way to convert the entities (i.e., decode &#39; &amp; &#186;, but NOT decode  (nbsp) &) with a single command I could put in a bash pipe. I tried several contenders:

perl

I'd used this one before in a script scraping an RSS feed: https://github.com/alifeee/openbenches-train-sign/blob/a29cc24df919c67809f84586f9e0a90aed6ea3cf/transformer/full.cgi#L49, but on this input, it fails as it doesn't decode the symbol º (U+00BA : MASCULINE ORDINAL INDICATOR from https://babelstone.co.uk/Unicode/whatisit.html).

$ cat file.txt | perl -MHTML::Entities -pe 'decode_entities($_);'
Children's event,
Wildlife & Nature,
peddler-market-n�-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

recode

I found recode after some searching, but it failed as it tried to unencode things that were already unencoded.

$ sudo apt install recode
$ cat file.txt | recode html
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artistsâ Circle,
surface â Breaking
woodland walk. (nbsp)
Justin Adams  Mauro Durante

php

At first I used html_specialchars_decode which didn't work, but then I found html_entity_decode, which does the job perfectly. Thanks PHP.

$ cat file.txt | php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

The only thing I don't know how to do now is to make a bash function or alias so that I could write

cat file.txt | decodeHTML

instead of the massive php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'.

Python

(2025-02-10 edit) I have also found a nice way to do this using the Python html library's escape and unescape (because that's what I had installed in my workflow and I couldn't be bothered to install PHP)

$ cat file.txt | python3 -c 'import sys;from html import unescape;print(unescape(sys.stdin.read()),end="")'
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante
back to top