However, further down, it lists how to access it, which says:
You can search for and view open records on our partner site Findmypast.co.uk (charges apply). A version of the 1939 Register is also available at Ancestry.co.uk (charges apply), and transcriptions without images are on MyHeritage.com (charges apply). It is free to search for these records, but there is a charge to view full transcriptions and download images of documents. Please note that you can view these records online free of charge in the reading rooms at The National Archives in Kew.
So… charges apply.
Anyway, for a while in April 2025 (until May 8th), FindMyPast is giving free access to the 1939 data.
Of course, family history is hard, and what's much easier is "who lived in my house in 1939". For that you can use:
I have used things like Python's BeautifulSoup before, but it feels like overkill when most of the time I only want to "get the value of a specific item on a webpage".
I've been trying to find some nice tools for querying HTML on the terminal, and I've found some nice tools from html-xml-utils, that let you pipe HTML into hxselect and use CSS selectors to find content.
Here are some annotated examples with this website:
setup
# install packagessudoaptinstall html-xml-utils jq
# set URLurl="https://blog.alifeee.co.uk/notes/"# save site to variable, normalised, otherwise hxselect complains about invalid HTML# might have to manually change the HTML so it normalises nicelysite=$(curl-s"${url}"| hxnormalize -x-l240)
hxselect
hxselect is the main tool here. Run man hxselect or visit https://manpages.org/hxselect to read about it. You provide a CSS selector like main > p .class, and you can provide -c to output only the content of the match, and -s <sep> to put <sep> in between each match if there are multiple matches (usually a newline \n).
Getting a single thing
# get the site header
$ echo"${site}"| hxselect -c'header h1'
notes by alifeee <img alt="profile picture"src="/notes/og-image.png"></img><a class="source"href="/notes/feed.xml">rss</a># get the links of all images on the page
$ echo"${site}"| hxselect -s'\n'-c'img::attr(src)'
/notes/og-image.png
# get a list of all URLs on the page
$ echo"${site}"| hxselect -s'\n'-c'a::attr(href)'|head
/notes/feed.xml
/notes/
https://blog.alifeee.co.uk
https://alifeee.co.uk
https://weeknotes.alifeee.co.uk
https://linktr.ee/alifeee
https://github.com/alifeee/blog/tree/main/notes
/notes/
/notes/tagged/bash
/notes/tagged/scripting
# get a specific element, here the site description
$ echo"${site}"| hxselect -c'header p:nth-child(3)'
here I may post some short, text-only notes, mostly about programming. <a href="https://github.com/alifeee/blog/tree/main/notes">source code</a>.
Getting multiple things
# save all matches to a variable results# first with sed we replace all @ (choose any character)# then we match our selector, separated by @# we remove "@" first so we know they are definitely our separatorsresults=$(curl-s"${url}"\| hxnormalize -x-l240\|sed's/@/(at)/g'\| hxselect -s'@''main article')# this separtes the variable results into an array "a", using the separator @mapfile -d@ a <<<"${results}";unset'a[-1]'# to test, run `item="${a[1]}"` and copy the commands inside the loopforitemin"${a[@]}";do# select all h2 elements, remove spaces before and links afterwardstitle=$(echo"${item}"\| hxselect -c'h2'\|sed's/^ *//'\|sed's/ *<a.*//g')# select all "a" elements inside ".tags" and separate with a comma, then remove final commatags=$(echo"${item}"\| hxselect -s', '-c'.tags a'\|sed's/, $//')# separate the tags by ",", and print them in between square brackets so it looks like jsontags_json=$(echo"${tags}"\|awk -F', ''{printf "["; fs=""; for (i=1; i<=NF; i++) {printf "%s\"%s\"", fs, $i; fs=", "}; printf "]"}')# select the href of the 3rd "a" element inside the "h2" elementurl=$(echo"${item}"\| hxselect -c'h2 a.source:nth-of-type(3)::attr(href)')# put it all together so it looks like JSON# must be very wary of all the quotes - pretty horrible tbhecho'{"title": "'"${title}"'", "tags":'"${tags_json}"', "url":"'"${url}"'"}' | jq
done
The output of the above looks like
{"title":"converting HTML entities to 'normal' UTF-8 in bash","tags":["bash","html"],"url":"#updating-a-file-in-a-github-repository-with-a-workflow"}{"title":"updating a file in a GitHub repository with a workflow","tags":["github-actions","github","scripting"],"url":"/notes/updating-a-file-in-a-git-hub-repository-with-a-workflow/#note"}{"title":"you should do Advent of Code using bash","tags":["advent-of-code","bash"],"url":"/notes/you-should-do-advent-of-code-using-bash/#note"}{"title":"linting markdown from inside Obsidian","tags":["obsidian","scripting","markdown"],"url":"/notes/linting-markdown-from-inside-obsidian/#note"}{"title":"installing nvm globally so automated scripts can use node and npm","tags":["node","scripting"],"url":"/notes/installing-nvm-globally-so-automated-scripts-can-use-node-and-npm/#note"}{"title":"copying all the files that are ignored by .gitignore in multiple projects","tags":["bash","git"],"url":"/notes/copying-all-the-files-that-are-ignored-by-gitignore-in-multiple-projects/#note"}{"title":"cron jobs are hard to make","tags":["cron"],"url":"/notes/cron-jobs-are-hard-to-make/#note"}{"title":"ActivityPub posts and the ACCEPT header","tags":["ActivityPub"],"url":"/notes/activity-pub-posts-and-the-accept-header/#note"}
Other useful commands
A couple of things that come up a lot when you do this kind of web scraping are:
removing leading/trailing spaces - pipe into | sed 's/^ *//' | sed 's/ *$//'
removing new lines - pipe into | tr -d '\n'
replacing new lines with spaces - pipe into | tr ' ' '\n'