notes by alifeee profile picture tagged html (2) rss

return to notes / blog / website / weeknotes / linktree

here I may post some short, text-only notes, mostly about programming. source code.

tags: all (19) scripting (5) bash (4) linux (3) html (2) markdown (2) obsidian (2) shortcuts (2) ActivityPub (1) advent-of-code (1) ............ see all (+24)

basic HTML template with all the gubbins # prev single next top

tags: html • 235 'words', 71 secs @ 200wpm

I create a lot of websites. I'm also a big fan of very performant, very accessible websites. I'm also a big fan of the phrase "the important thing first is for a webpage to exist, then after that to look nice".

With that in mind, this is the HTML template I usually start with to make a new webpage. It's made to manually replace the things in curly brackets, but also has the bonus that you could use it with a templating language like https://handlebarsjs.com/ (my favourite).

<!DOCTYPE html>
<html lang="en">

<head>
    <!-- browser metas -->
    <title>{{title}}</title>
    <meta name="description" content="{{description}}" />
    <!-- allow unicode characters -->
    <meta charset="utf-8" />
    <!-- 'zoom' on mobile -->
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />

    <!-- embed metas - https://ogp.me/ - visible when sharing on social media -->
    <meta property="og:title" content="{{title}}" />
    <meta property="og:type" content="website" />
    <meta property="og:site" content="{{title}}" />
    <meta property="og:url" content="{{base_url}}/{{page}}" />
    <meta property="og:image" content="{{base_url}}/{{image}}" />
    <meta property="og:description" content="{{description}}" />
    <meta property="og:locale" content="en_GB" />

    <!-- styling -->
    <!-- favicon - can be any image (.png, .jpg, .ico) -->
    <link rel="icon" type="image/png" href="/og-image.png" />
</head>

<body>
    <header></header>
    <main>
        {{{content}}}
    </main>
    <footer></footer>
</body>

</html>

Put it somewhere, and put something in there! Make a personal website! Make a blog! I will love you forever.

back to top

converting HTML entities to 'normal' UTF-8 in bash # prev single next top

tags: bash, html • 384 'words', 115 secs @ 200wpm

I do a fair amount of web scraping, and you come across a lot of HTML entities. They're the things that look like &gt;, &nbsp;, &amp;.

See https://developer.mozilla.org/en-US/docs/Glossary/Character_reference

Naturally, you usually want to turn them back into characters, usually UTF-8. I came across a particularly gnarly site that had some normal HTML entities, some rare (Unicode) ones, and also some special characters that weren't encoded at all. From it, I made file to test HTML entity decoders on. Here it is, as file.txt:

Children&#39;s event,
Wildlife &amp; Nature,
peddler-market-n&#186;-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

I wanted to find a way to convert the entities (i.e., decode &#39; &amp; &#186;, but NOT decode  (nbsp) &) with a single command I could put in a bash pipe. I tried several contenders:

perl

I'd used this one before in a script scraping an RSS feed: https://github.com/alifeee/openbenches-train-sign/blob/a29cc24df919c67809f84586f9e0a90aed6ea3cf/transformer/full.cgi#L49, but on this input, it fails as it doesn't decode the symbol º (U+00BA : MASCULINE ORDINAL INDICATOR from https://babelstone.co.uk/Unicode/whatisit.html).

$ cat file.txt | perl -MHTML::Entities -pe 'decode_entities($_);'
Children's event,
Wildlife & Nature,
peddler-market-n�-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

recode

I found recode after some searching, but it failed as it tried to unencode things that were already unencoded.

$ sudo apt install recode
$ cat file.txt | recode html
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artistsâ Circle,
surface â Breaking
woodland walk. (nbsp)
Justin Adams  Mauro Durante

php

At first I used html_specialchars_decode which didn't work, but then I found html_entity_decode, which does the job perfectly. Thanks PHP.

$ cat file.txt | php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'
Children's event,
Wildlife & Nature,
peddler-market-nº-88,
Artists’ Circle,
surface – Breaking
woodland walk. (nbsp)
Justin Adams & Mauro Durante

The only thing I don't know how to do now is to make a bash function or alias so that I could write

cat file.txt | decodeHTML

instead of the massive php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'.

back to top