alifeee's notes (bash)

using bash and CSS selectors for web-scraping # prev single next top

tags: bash, web-scraping, html • 885 'words', 266 secs @ 200wpm

I do a lot of web scraping.

I also love bash.

I have used things like Python's BeautifulSoup before, but it feels like overkill when most of the time I only want to "get the value of a specific item on a webpage".

I've been trying to find some nice tools for querying HTML on the terminal, and I've found some nice tools from html-xml-utils, that let you pipe HTML into hxselect and use CSS selectors to find content.

Here are some annotated examples with this website:

setup

# install packages
sudo apt install html-xml-utils jq
# set URL
url="https://blog.alifeee.co.uk/notes/"
# save site to variable, normalised, otherwise hxselect complains about invalid HTML
#   might have to manually change the HTML so it normalises nicely
site=$(curl -s "${url}" | hxnormalize -x -l 240)

`hxselect`

hxselect is the main tool here. Run man hxselect or visit https://manpages.org/hxselect to read about it. You provide a CSS selector like main > p .class, and you can provide -c to output only the content of the match, and -s <sep> to put <sep> in between each match if there are multiple matches (usually a newline \n).

Getting a single thing

# get the site header
$ echo "${site}" | hxselect -c 'header h1'
 notes by alifeee <img alt="profile picture" src="/notes/og-image.png"></img> <a class="source" href="/notes/feed.xml">rss</a>

# get the links of all images on the page
$ echo "${site}" | hxselect -s '\n' -c 'img::attr(src)'
/notes/og-image.png

# get a list of all URLs on the page
$ echo "${site}" | hxselect -s '\n' -c 'a::attr(href)' | head
/notes/feed.xml
/notes/
https://blog.alifeee.co.uk
https://alifeee.co.uk
https://weeknotes.alifeee.co.uk
https://linktr.ee/alifeee
https://github.com/alifeee/blog/tree/main/notes
/notes/
/notes/tagged/bash
/notes/tagged/scripting

# get a specific element, here the site description
$ echo "${site}" | hxselect -c 'header p:nth-child(3)'
 here I may post some short, text-only notes, mostly about programming. <a href="https://github.com/alifeee/blog/tree/main/notes">source code</a>.

Getting multiple things

# save all matches to a variable results
#   first with sed we replace all @ (choose any character)
#   then we match our selector, separated by @
#   we remove "@" first so we know they are definitely our separators
results=$(
  curl -s "${url}" \
    | hxnormalize -x -l 240 \
    | sed 's/@/(at)/g' \
    | hxselect -s '@' 'main article'
  )
# this separtes the variable results into an array "a", using the separator @
mapfile -d@ a <<< "${results}"; unset 'a[-1]'
# to test, run `item="${a[1]}"` and copy the commands inside the loop
for item in "${a[@]}"; do
  # select all h2 elements, remove spaces before and links afterwards
  title=$(
    echo "${item}" \
      | hxselect -c 'h2' \
      | sed 's/^ *//' \
      | sed 's/ *<a.*//g'
  )
  # select all "a" elements inside ".tags" and separate with a comma, then remove final comma
  tags=$(
    echo "${item}" \
      | hxselect -s ', ' -c '.tags a' \
      | sed 's/, $//'
  )
  # separate the tags by ",", and print them in between square brackets so it looks like json
  tags_json=$(
    echo "${tags}" \
      | awk -F', ' '{printf "["; fs=""; for (i=1; i<=NF; i++) {printf "%s\"%s\"", fs, $i; fs=", "}; printf "]"}'
  )
  # select the href of the 3rd "a" element inside the "h2" element
  url=$(
    echo "${item}" \
      | hxselect -c 'h2 a.source:nth-of-type(3)::attr(href)'
  )
  # put it all together so it looks like JSON
  #   must be very wary of all the quotes - pretty horrible tbh
  echo '{"title": "'"${title}"'", "tags": '"${tags_json}"', "url": "'"${url}"'"}' | jq
done

The output of the above looks like

{"title":"converting HTML entities to &#x27;normal&#x27; UTF-8 in bash","tags":["bash","html"],"url":"#updating-a-file-in-a-github-repository-with-a-workflow"}
{"title":"updating a file in a GitHub repository with a workflow","tags":["github-actions","github","scripting"],"url":"/notes/updating-a-file-in-a-git-hub-repository-with-a-workflow/#note"}
{"title":"you should do Advent of Code using bash","tags":["advent-of-code","bash"],"url":"/notes/you-should-do-advent-of-code-using-bash/#note"}
{"title":"linting markdown from inside Obsidian","tags":["obsidian","scripting","markdown"],"url":"/notes/linting-markdown-from-inside-obsidian/#note"}
{"title":"installing nvm globally so automated scripts can use node and npm","tags":["node","scripting"],"url":"/notes/installing-nvm-globally-so-automated-scripts-can-use-node-and-npm/#note"}
{"title":"copying all the files that are ignored by .gitignore in multiple projects","tags":["bash","git"],"url":"/notes/copying-all-the-files-that-are-ignored-by-gitignore-in-multiple-projects/#note"}
{"title":"cron jobs are hard to make","tags":["cron"],"url":"/notes/cron-jobs-are-hard-to-make/#note"}
{"title":"ActivityPub posts and the ACCEPT header","tags":["ActivityPub"],"url":"/notes/activity-pub-posts-and-the-accept-header/#note"}

Other useful commands

A couple of things that come up a lot when you do this kind of web scraping are:

removing leading/trailing spaces - pipe into | sed 's/^ *//' | sed 's/ *$//'
removing new lines - pipe into | tr -d '\n'
replacing new lines with spaces - pipe into | tr ' ' '\n'
decoding HTML entities - pipe into php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }' or replace manually with sed (see https://blog.alifeee.co.uk/notes/converting-html-entities-to-normal-utf-8-in-bash/#note)

converting HTML entities to 'normal' UTF-8 in bash # prev single next top

tags: bash, html • 476 'words', 143 secs @ 200wpm

I do a fair amount of web scraping, and you come across a lot of HTML entities. They're the things that look like >,  , &.

See https://developer.mozilla.org/en-US/docs/Glossary/Character_reference

Naturally, you usually want to turn them back into characters, usually UTF-8. I came across a particularly gnarly site that had some normal HTML entities, some rare (Unicode) ones, and also some special characters that weren't encoded at all. From it, I made file to test HTML entity decoders on. Here it is, as file.txt:

Children&#39;s event,
Wildlife &amp; Nature,
peddler-market-n&#186;-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

I wanted to find a way to convert the entities (i.e., decode ' & º, but NOT decode ’ – (nbsp) &) with a single command I could put in a bash pipe. I tried several contenders:

perl

I'd used this one before in a script scraping an RSS feed: https://github.com/alifeee/openbenches-train-sign/blob/a29cc24df919c67809f84586f9e0a90aed6ea3cf/transformer/full.cgi#L49, but on this input, it fails as it doesn't decode the symbol º (U+00BA : MASCULINE ORDINAL INDICATOR from https://babelstone.co.uk/Unicode/whatisit.html).

$ cat file.txt | perl -MHTML::Entities -pe 'decode_entities($_);'
Children's event,
Wildlife & Nature,
peddler-market-n�-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

recode

I found recode after some searching, but it failed as it tried to unencode things that were already unencoded.

$ sudo apt install recode
$ cat file.txt | recode html
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artistsâ Circle,
surface â Breaking
woodland walk. (nbsp)
Justin Adams  Mauro Durante

php

At first I used html_specialchars_decode which didn't work, but then I found html_entity_decode, which does the job perfectly. Thanks PHP.

$ cat file.txt | php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

The only thing I don't know how to do now is to make a bash function or alias so that I could write

cat file.txt | decodeHTML

instead of the massive php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'.

Python

(2025-02-10 edit) I have also found a nice way to do this using the Python html library's escape and unescape (because that's what I had installed in my workflow and I couldn't be bothered to install PHP)

$ cat file.txt | python3 -c 'import sys;from html import unescape;print(unescape(sys.stdin.read()),end="")'
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

you should do Advent of Code using bash # prev single next top

tags: advent-of-code, bash • 319 'words', 96 secs @ 200wpm

last year I did Advent of Code in bash.

you should do it too.

people often attempt it in a programming language which is new to them, so they can have a nice time learning how to use it. I think this language should be "the terminal".

I learnt a lot about sed and awk, and more general things about bash like pipes, redirection, stdout and stderr, and reading from and editing files.

Ever since, I've learnt more and more about the terminal and become more comfortable there, and now I feel comfortable doing some nice loops and commands, such that I process a lot of data purely on the terminal, like some of the maps I've been making recently.

My Advent of Code 2023 solutions are in this GitHub repository: https://github.com/alifeee/advent-of-code-2023

They're mostly in bash, and they mostly solve quite quickly. I stopped after day 12 because things got too confusing and "dynamic programming", and my method of programming ("not having done a computer science degree") doesn't do well with recursion and complicated stuff.

I used a lot of awk, which I found very fun, and now I probably use awk almost every day.

For fun, here are how many times each of awk's string functions appear in my solutions

$ while read cmd; do printf "${cmd}: "; egrep --exclude=awkstrings.md -r "${cmd}" | wc -l; done <<< $(cat awkstrings.md | pcregrep -o1 '^- `([^\(`]*)')
asort: 0
gensub: 7
gsub: 10
index: 49
length: 54
match: 21
patsplit: 4
split: 24
sprintf: 1
strtonum: 0
sub: 47
substr: 20
tolower: 0
toupper: 0

awk and bash were probably terrible ways to solve complex problems, but they were certainly fun in the early days of it.

you should give it a go.

copying all the files that are ignored by .gitignore in multiple projects # prev single next top

tags: bash, git • 790 'words', 237 secs @ 200wpm

I clone all git projects into the same folder, (~/git on my personal linux computers, something like Documents/GitHub or Documents/AGitFolder on Windows).

I want to move my Windows laptop to Linux, but I don't want to keep the Windows hard drive. The laptop has only one drive slot, and I can't be bothered to take it apart and salvage the drive - I'd rather just wipe the whole drive. However, I'd like to keep some things to move to Linux, like my SSH keys, and any secrets or passwords I use in apps.

One source of secrets is the many files ignored by git, like .env or secrets.py. Thankfully, these are usually excluded from git tracking by writing them in .gitignore files. So... I should be able to find all the files by searching for .gitignore files in my git folder!

Here is a command I crafted to do that (I truncated the output to only show a few examples):

$ while read file; do folder=$(echo "${file}" | sed s'/[^\/]*$//'); cat "${file}" | awk '/^#/ {next} /^ *$/ {next} /\*/ {next} /.*\..+[^\/]$/ {print}' | awk '{sub(/^\//, ""); printf "'"${folder}"'%s\n", $0}' > /tmp/files13131.txt; while read f; do [ -f "${f}" ] && echo "${f}"; done < /tmp/files13131.txt; done <<< $(find . -maxdepth 4 -name ".gitignore")
./blog/.env
./gspread/tests/creds.json
./invoice_template/invoice.toml
./polycule-visualiser/polycule.json
./polycule-visualiser/_data/URIs.json
./summon2scale-scoreboard/scoreboard.sqlite
./website-differ/.env
./website-differ/sites.csv

(fyi those repositories are: blog, gspread, invoice_template, polycule-visualiser, summon2scale-scoreboard, website-differ)

For fun, I'll describe each bit of the command

while read file; do # this runs the loop for each input line of AA below
  folder=$(echo "${file}" | sed s'/[^\/]*$//'); # turn "./folder/.gitignore" into "./folder" by stripping all content after final "/"
  cat "${file}" # output contents of file
    # skip all lines of .gitignores which start with "#", are blank lines, or contain "*".
    #   then select only lines that end with ".something" (i.e., are files not directories) and print them
    | awk '/^#/ {next} /^ *$/ {next} /\*/ {next} /.*\..+[^\/]$/ {print}'
    # some files are "file.txt", some are "/file.txt", so first remove leading slashes, then print folder and file
    | awk '{sub(/^\//, ""); printf "'"${folder}"'%s\n", $0}'
    > /tmp/files13131.txt; # in order to check each file exists we need another loop, so put the list of files in a temporary file for now
  while read f; do # run the loop for each input line of BB below
      [ -f "${f}" ] && echo "${f}"; # before the && is a test to see if the file "${f}" exists, and it is echo'd only if the file does exist
    done < /tmp/files13131.txt; # this is BB, "<" sends the output of the next file to the loop
done <<< $( # this is AA, "<<< $(" sends the output of the following command to the loop
  find . -maxdepth 4 -name ".gitignore" # this finds all files named .gitignore in the current directory to a max depth of 4 directories
)

if we send the output of the command to files.txt (or copy and paste into there), then we can copy all these files to a folder _git, and preserve the structure of the folders, with:

rootf="_git"; while read file; do relfile=$(echo "${file}" | sed 's/^\.\///'); relfolder=$(echo "${relfile}" | sed 's/[^\/]*$//'); mkdir -p "${rootf}/${relfolder}"; cp "${file}" "${rootf}/${relfile}"; done < files.txt

and explained:

rootf="_git"; # set a variable to the folder to copy to so we can change it easily
while read file; do # loop over the file given in AA and name the loop variable "file"
  # turn "./folder/sub/file.txt" into "folder/sub/file.txt" by replacing an initial "./" with nothing
  relfile=$(echo "${file}" | sed 's/^\.\///');
  # turn "folder/sub/file.txt" into "folder/sub/" by replacing all content after the final "/" with nothing
  relfolder=$(echo "${relfile}" | sed 's/[^\/]*$//');
  # create directories for the relative folder e.g., "folder/sub", "-p" means create all required sub-directories
  mkdir -p "${rootf}/${relfolder}";
  # copy the file to the folder we just created
  cp "${file}" "${rootf}/${relfile}";
done < files.txt # use files.txt as the input to the loop

Now, I can put the files on a memory stick, and I have a very volatile memory stick that I shouldn't lose.

and I can put the memory stick into Linux when I boot it, so I don't lose any of my secrets.

To be honest, I don't think many of them were really important, and would probably just require me to go searching for API keys whenever I wanted to edit a project.

But, making these two commands was pretty fun.