notes by alifeee profile picture rss

return to notes / blog / website / weeknotes / linktree

here I may post some short, text-only notes, mostly about programming. source code.

tags: all (19), scripting (5), bash (4), linux (3), html (2), markdown (2), obsidian (2), shortcuts (2), ActivityPub (1), advent-of-code (1) ............ see all (+15)

viewing a single note

converting HTML entities to 'normal' UTF-8 in bash # source

tags: bash, html • 384 'words', 115 secs @ 200wpm

I do a fair amount of web scraping, and you come across a lot of HTML entities. They're the things that look like >,  , &.

See https://developer.mozilla.org/en-US/docs/Glossary/Character_reference

Naturally, you usually want to turn them back into characters, usually UTF-8. I came across a particularly gnarly site that had some normal HTML entities, some rare (Unicode) ones, and also some special characters that weren't encoded at all. From it, I made file to test HTML entity decoders on. Here it is, as file.txt:

Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

I wanted to find a way to convert the entities (i.e., decode ' & º, but NOT decode  (nbsp) &) with a single command I could put in a bash pipe. I tried several contenders:

perl

I'd used this one before in a script scraping an RSS feed: https://github.com/alifeee/openbenches-train-sign/blob/a29cc24df919c67809f84586f9e0a90aed6ea3cf/transformer/full.cgi#L49, but on this input, it fails as it doesn't decode the symbol º (U+00BA : MASCULINE ORDINAL INDICATOR from https://babelstone.co.uk/Unicode/whatisit.html).

$ cat file.txt | perl -MHTML::Entities -pe 'decode_entities($_);'
Children's event,
Wildlife & Nature,
peddler-market-n�-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

recode

I found recode after some searching, but it failed as it tried to unencode things that were already unencoded.

$ sudo apt install recode
$ cat file.txt | recode html
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artistsâ Circle,
surface â Breaking
woodland walk. (nbsp)
Justin Adams  Mauro Durante

php

At first I used html_specialchars_decode which didn't work, but then I found html_entity_decode, which does the job perfectly. Thanks PHP.

$ cat file.txt | php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

The only thing I don't know how to do now is to make a bash function or alias so that I could write

cat file.txt | decodeHTML

instead of the massive php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'.

back to top back to main page