notes by alifeee profile picture rss

return to notes / blog / website / weeknotes / linktree

here I may post some short, text-only notes, mostly about programming. source code.

tags: all (33), scripting (11), bash (4), geojson (3), jq (3), linux (3), obsidian (3), ActivityPub (2), github (2), html (2) ............ see all (+33)

a list of lists of fonts to use on your website # prev single next top

tags: personal-websites, fonts • 252 'words', 76 secs @ 200wpm

Somebody asked me for a font. I don't know much about fonts. I do know a whole lot about making fonts. I also know a whole lot about surfing the web.

Sometimes, you choose to buy a pastry or a loaf from your local bakery instead of your nearly-local supermarket. In the same way, instead of loading up Google Fonts, you could consider picking a font from one of the lists on this page.

Here are some sites that I knew of already:

I also searched the web, searching for "fonts for the web" - which didn't turn up anything that good, even with Kagi's "small web" toggled - and secondly the same but with "site:neocitites.org". From that, I found these lists which look pretty great:

Never forget, you could also spend way too much time… making your own font.

back to top

uploading files to a GitHub repository with a bash script # prev single next top

tags: obsidian, github, scripting • 364 'words', 109 secs @ 200wpm

I write these notes in Obsidian. To upload, them, I could visit https://github.com/alifeee/blog/tree/main/notes, click "add file", and copy and paste the file contents. I probably should do that.

But, instead, I wrote a shell script to upload them. Now, I can press "CTRL+P" to open the Obsidian command pallette, type "lint" (to lint the note), then open it again and type "upload" and upload the note. At this point, I could walk away and assume everything went fine, but what I normally do is open the GitHub Actions tab to check that it worked properly.

The process the script undertakes is:

  1. check user inputs are good (all variables exist, file is declared)
  2. check if file exists or not already in GitHub with a curl request
  3. generate a JSON payload for the upload request, including:
    1. commit message
    2. commit author & email
    3. file contents as a base64 encoded string
    4. (if file exists already) sha1 hash of existing file
  4. make a curl request to upload/update the file!

As I use it from inside Obsidian, I use an extension called Obsidian shellcommands, which lets you specify several commands. For this, I specify:

export org="alifeee"
export repo="blog"
export fpath="notes/"
export git_name="alifeee"
export git_email="alifeee@alifeee.net"
export GITHUB_TOKEN="github_pat_3890qwug8f989wu89gu98w43ujg98j8wjgj4wjg9j83wjq9gfj38w90jg903wj"
{{vault_path}}/scripts/upload_to_github.sh {{file_path:absolute}}

…and when run with a file open, it will upload/update that file to my notes folder on GitHub.

This is maybe a strange way of doing it, as the "source of truth" is now "my Obsidian", and the GitHub is really just a place for the files to live. However, I enjoy it.

I've made the script quite generic as you have to supply most information via environment variables. You can use it to upload an arbitrary file to a specific folder in a specific GitHub repository. Or… you can modify it and do what you want with it!

It's here: https://gist.github.com/alifeee/d711370698f18851f1927f284fb8eaa8

back to top

how to manually encrypt and decrypt a file (or folder) # prev single next top

tags: encryption • 473 'words', 142 secs @ 200wpm

I've wondered the answer to this question for a while.

Today, I figured I'd find out a nice way, as I wanted to store an SSH private key on a server (so I can access it from different computers in different locations). I could also store it on my phone, as I (mostly) have that on me.

The idea could be that you would have a local file, encrypt it, then store it on a server (or any file sharing service like Dropbox/Google Drive/etc). Then, from a different device, you could download it, and unencrypt it.

set aliases

I found this answer on the internet, and I set some aliases so I can easily arbitrarily password-encrypt/decrypt files. I set the aliases with atuin, so they sync across all my devices, but you could also stick this in ~/.bashrc or elsewhere. The aliases are:

alias decrypt='openssl enc -d -aes-256-cbc -pbkdf2 -in'
alias encrypt='openssl enc -aes-256-cbc -pbkdf2 -in'

encrypt

Then, I can use them by supplying a file, and I get a bunch of jumbled characters, which I certainly couldn't crack.

$ echo 'this is totally an SSH key' > non_encrypted_file.txt

$ encrypt non_encrypted_file.txt | tee encrypted_file.txt
enter AES-256-CBC encryption password: ********
Verifying - enter AES-256-CBC encryption password: ********
=���?���C�9���_Z���>E

decrypt

To decrypt, I put in the same password I used to encrypt:

$ decrypt encrypted_file.txt
enter AES-256-CBC decryption password: ********
this is totally an SSH key

using pipes

I can also use pipes!

$ cat non_encrypted_file.txt | encrypt - > encrypted_file.txt
enter AES-256-CBC encryption password: ********
Verifying - enter AES-256-CBC encryption password: ********

$ cat encrypted_file.txt | decrypt -
enter AES-256-CBC decryption password: ********
this is totally an SSH key

notes

how good is the encryption

I'm not sure how "good" aes-256-cbc as an encryption protocol(?). I'll ignore this fact.

how to expand aliases

in future, I may want to know what type of encryption I use. I could go and look at my aliases file, but I discovered that you can also type ALT+CTRL+E (or ESC+CTRL+E) to expand aliases inline, so turning line 1 into line 2

encrypt
openssl enc -aes-256-cbc -pbkdf2 -in

how to encrypt a folder/multiple files

to encrypt a folder, you could use tar to turn the folder into a .tar file, which looks like a file. Then, use tar to stop making it into a file later. A bit like this.

# create archive
tar -cf non_encrypted_folder.tar non_encrypted_folder/
# encrypt
encrypt non_encrypted_folder.tar > encrypted_folder.tar
# remove original folder
rm -rf non_encrypted_folder/
# decrypt
decrypt encrypted_folder.tar > decrypted_folder.tar
# extract archive
tar -xf decrypted_folder.tar
# check exists
cat non_encrypted_folder/non_encrypted_file.txt 
back to top

finding the account information of a Mastodon account manually via curl requests # prev single next top

tags: ActivityPub • 458 'words', 137 secs @ 200wpm

Then Try This, a non-profit research group, recently changed their mastodon handle from

@thentrythis@mastodon.thentrythis.org

to

@thentrythis@thentrythis.org

to understand how this works (because I like understanding the ActivityPub protocol), I tried to find how my Mastodon client would find the new account.

When you open the original account profile, it opens on social.thentrythis.org, so there must be some path from thentrythis.org to there.

First, I tried accessing several URLs off the top of my head that I thought were used.

https://thentrythis.org/.well-known/webfinger
https://social.thentrythis.org/.well-known/webfinger

They all were blank.

Then, I was pointed in the right direction, and now I could manually make the same requests that my Mastodon client would do using Mastodon's documentation.

The process is:

Given a username, i.e., @thentrythis@thentrythis.org, find the format of the "webfinger request" (which allows you to request data about a user), which should be on /.well-known/host-meta. The key here is that the original site (thentrythis.org) redirects to the "social site" (social.thentrythis.org).

$ curl -i "https://thentrythis.org/.well-known/host-meta" | head -n4
HTTP/1.1 301 Moved Permanently
Date: Sun, 12 Jan 2025 18:56:53 GMT
Server: Apache/2.4.62 (Debian)
Location: https://social.thentrythis.org/.well-known/host-meta

$ curl "https://social.thentrythis.org/.well-known/host-meta"
<?xml version="1.0" encoding="UTF-8"?>
<XRD xmlns="http://docs.oasis-open.org/ns/xri/xrd-1.0">
  <Link rel="lrdd" template="https://social.thentrythis.org/.well-known/webfinger?resource={uri}"/>
</XRD>

Then, we can use the template returned to query a user by placing acct:<user>@<server> into the template, replacing {uri}, i.e.,

curl -s "https://social.thentrythis.org/.well-known/webfinger?resource=acct:thentrythis@thentrythis.org" | jq
{
  "subject": "acct:thentrythis@thentrythis.org",
  "aliases": [
    "https://social.thentrythis.org/@thentrythis",
    "https://social.thentrythis.org/users/thentrythis"
  ],
  "links": [
    {
      "rel": "http://webfinger.net/rel/profile-page",
      "type": "text/html",
      "href": "https://social.thentrythis.org/@thentrythis"
    },
    {
      "rel": "self",
      "type": "application/activity+json",
      "href": "https://social.thentrythis.org/users/thentrythis"
    },
    {
      "rel": "http://ostatus.org/schema/1.0/subscribe",
      "template": "https://social.thentrythis.org/authorize_interaction?uri={uri}"
    },
    {
      "rel": "http://webfinger.net/rel/avatar",
      "type": "image/png",
      "href": "https://social.thentrythis.org/system/accounts/avatars/113/755/294/674/928/838/original/640741180e302572.png"
    }
  ]
}

neat :]

It's always nice to know that I could use Mastodon by reaaaallllyyy slowly issuing my own curl requests (or, what this really means, build my own client).

back to top

my ad-hoc definition of hacking # prev single next top

tags: hacking, definitions • 278 'words', 83 secs @ 200wpm

a very broad-strokes definition of the word "hacking" I spurted out in a text conversation.

when people say hacking they mean one of several things
the (positive) sense is that used by hackspaces, to hack is to make something do something beyond its initial purposes
technologically, a lot of the time, that means taking apart an old TV and reusing parts of it to make a lightning rod, or replacing a phone battery by yourself (the phone companies do not desire this), or adding a circuitboard to your cat flap that uses the chip inside the cat to detect if it's your cat and if not lock the flap
more "software based", it can be like scraping a government website to collect documents into a more readable format, turning trains back on via software that were disabled by their manufacturer as a money-grabbing gambit, getting access to academic papers that are unreasonably locked behind expensive paywalls

If someone says 'my facebook got hacked' what does that mean

usually what they mean is that someone has logged into it without their permission
and most (all) of the time, that person has guessed their password because they said it out loud, they watched them put it in, they guessed it randomly (probs rare), or (rarest) they found the password in a passwork leak for a different website and tried it on Facebook (because the person uses the same password on multiple accounts)
I'd call that a second thing people say hacking for and a third is the money extorting hackers, who hack into [the British library] and lock all their documents unless they pay [a ransom]

back to top

installing my own VPN on my server was much easier than I thought # prev single next top

tags: vpn, server, open-access • 408 'words', 122 secs @ 200wpm

I've thought about installing a VPN on my server for a few months. It wouldn't allow the perhaps-more-common VPN use of getting past region-locked content (as I can't change the region of my server), but as an academic exercise, and for other reasons, I gave it a try installing a VPN on my server.

It was super easy! All I ran was:

wget https://git.io/vpn -O openvpn-install.sh
sudo bash openvpn-install.sh

I accepted all the default settings (IP / UDP / port / DNS servers) apart from username (alifeee), allowed the port through my firewall (which uses Uncomplicated FireWall (ufw)) with sudo ufw allow 1194 (the default port), and a file alifeee.ovpn was created. That file was pretty simple, and basically just a few keys, and looked a bit like this:

client
dev tun
proto udp
remote xxx.xx.xxx.xxx 1194
resolv-retry infinite
nobind
persist-key
persist-tun
remote-cert-tls server
auth SHA512
ignore-unknown-option block-outside-dns
verb 3
<ca>
-----BEGIN CERTIFICATE-----
awiohi3hrt7832h8whefiuhy372wjofeijhiu324htwefpjh32ihefjvhguiy238
28y3tfhunh3278yhiugj3y2iuejgvi32t7uojqwevg7t2ui3egv7i3tu2wegvuit
...
...
3289qewfu8it3j28geu428uhewgimnio23uewgnvdiuw482hj3iegdv98b7suywh
3qw8fu3it2niop9wq8fuvegjny32klqoewiu8bhgjnq3==
-----END CERTIFICATE-----
</ca>
<cert>
-----BEGIN CERTIFICATE-----
awiohi3hrt7832h8whefiuhy372wjofeijhiu324htwefpjh32ihefjvhguiy238
28y3tfhunh3278yhiugj3y2iuejgvi32t7uojqwevg7t2ui3egv7i3tu2wegvuit
...
...
3289qewfu8it3j28geu428uhewgimnio23uewgnvdiuw482hj3iegdv98b7suywh
3qw8fu3it2niop9wq8fuvegjny32klqoewiu8bhgjnq3==
-----END CERTIFICATE-----
</cert>
<key>
-----BEGIN PRIVATE KEY-----
awiohi3hrt7832h8whefiuhy372wjofeijhiu324htwefpjh32ihefjvhguiy238
28y3tfhunh3278yhiugj3y2iuejgvi32t7uojqwevg7t2ui3egv7i3tu2wegvuit
...
...
3289qewfu8it3j28geu428uhewgimnio23uewgnvdiuw482hj3iegdv98b7suywh
3qw8fu3it2niop9wq8fuvegjny32klqoewiu8bhgjnq3==
-----END PRIVATE KEY-----
</key>
<tls-crypt>
-----BEGIN OpenVPN Static key V1-----
8rq328r8ij3fwy73uiwtoeg8u3hiweg3
38efuiot328yguwhgjovi8u32qwaghhj
...
...
ot3iu7wsaopkijguwthji3qwokijnegh
tiu3eyvjieg93nwjsgkobjhunt3etuhf
-----END OpenVPN Static key V1-----
</tls-crypt>

This file was small enough that I was able to copy it in only two screens through ConnectBot on my phone. To install it, I:

Since I've installed it, it's actually been pretty useful. I've used it:

So, if you want to get round blocks, hide your traffic, or other VPN shenanigans, you could create a VPS (Virtual Private Server) and install OpenVPN to it pretty easily. Perhaps you could even get around region locks if you picked a server location in a region you wanted.

back to top

combining geojson files with jq # prev single next top

tags: geojson, jq, scripting • 520 'words', 156 secs @ 200wpm

I'm writing a blog about hitchhiking, which involves a load os .geojson files, which look a bit like this:

The .geojson files are generated from .gpx traces that I exported from OSRM's (Open Source Routing Machine) demo (which, at time of writing, seems to be offline, but I believe it's on https://map.project-osrm.org/), one of the routing engines on OpenStreetMap.

I put in a start and end point, exported the .gpx trace, and then converted it to .geojson with, e.g., ogr2ogr "2.1 Tamworth -> Tibshelf Northbound.geojson" "2.1 Tamworth -> Tibshelf Northbound.gpx" tracks, where ogr2ogr is a command-line tool from sudo apt install gdal-bin which converts geographic data between many formats (I like it a lot, it feels nicer than searching the web for "errr, kml to gpx converter?"). I also then semi-manually added some properties (see how).

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {
        "label": "2.1 Tamworth -> Tibshelf Northbound",
        "trip": "2"
      },
      "geometry": {
        "type": "MultiLineString",
        "coordinates": [
          [
            [-1.64045, 52.60606]
            [-1.64067, 52.6058],
            [-1.64069, 52.60579],
            ...
          ]
        ]
      }
    }
  ]
}

I then had a load of files that looked a bit like

$ tree -f geojson/
geojson
├── geojson/1.1 Tamworth -> Woodall Northbound.geojson
├── geojson/1.2 Woodall Northbound -> Hull.geojson
├── geojson/2.1 Tamworth -> Tibshelf Northbound.geojson
├── geojson/2.2 Tibshelf Northbound -> Leeds.geojson
├── geojson/3.1 Frankley Northbound -> Hilton Northbound.geojson
├── geojson/3.2 Hilton Northbound -> Keele Northbound.geojson
└── geojson/3.3 Keele Northbound -> Liverpool.geojson

Originally, I was combining them into one .geojson file using https://github.com/mapbox/geojson-merge, which as a binary to merge .geojson files, but I decided to use jq because I wanted to do something a bit more complex, which was to create a structure like

FeatureCollection
  Features:
    FeatureCollection
      Features (1.1 Tamworth -> Woodall Northbound, 1.2 Woodall Northbound -> Hull)
    FeatureCollection
      Features (2.1 Tamworth -> Tibshelf Northbound, 2.2 Tibshelf Northbound -> Leeds)
    FeatureCollection
      Features (3.1 Frankley Northbound -> Hilton Northbound, 3.2 Hilton Northbound -> Keele Northbound, 3.3 Keele Northbound -> Liverpool)

I spent a while making a quite-complicated jq query, using variables (an "advanced feature"!) and a reduce statement, but when I completed it, I found out that the above structure is not valid .geojson, so I went back to just having:

FeatureCollection
  Features (1.1 Tamworth -> Woodall Northbound, 1.2 Woodall Northbound -> Hull, 2.1 Tamworth -> Tibshelf Northbound, 2.2 Tibshelf Northbound -> Leeds, 3.1 Frankley Northbound -> Hilton Northbound, 3.2 Hilton Northbound -> Keele Northbound, 3.3 Keele Northbound -> Liverpool)

...which is... a lot simpler to make.

A query which combines the files above is (the sort exists to sort the files so they are in numerical order downwards in the resulting .geojson):

while read file; do cat "${file}"; done <<< $(find geojson/ -type f | sort -t / -k 2 -n) | jq --slurp '{
    "type": "FeatureCollection",
    "name": "hitchhikes",
    "features": ([.[] | .features[0]])
}' > hitching.geojson

While geojson-merge was cool, it feels nice to have a more "raw" command to do what I want.

back to top

a Nautilus script to create blank files in a folder # prev single next top

tags: nautilus, scripting • 330 'words', 99 secs @ 200wpm

I moved to Linux [time ago]. One thing I miss from the Windows file explorer is how easy it was to create text files.

With Nautilus (Pop!_OS' default file browser), you can create templates which appear when you right click in an empty folder (I don't remember where the templates file is and I can't find an obvious way to find out, so... search it yourself), but this doesn't work if you're using nested folders.

i.e., I use this view a lot in Nautilus the file explorer, which is a tree-view that lets you expand folders instead of open them (similar to most code editors).

.
├── ./5.3.2
│   └── ./5.3.2/new_file
├── ./6.1.4
├── ./get_5.3.2.py
└── ./get_6.1.4.py

But in this view, you can't "right click on empty space inside a folder" to create a new template file, you can only "right click the folder" (or if it's empty, "right click a strange fake-file called (Empty)").

So, I created a script in /home/alifeee/.local/share/nautilus/scripts called new file (folder script) with this content:

#!/bin/bash
# create new file within folder (only works if used on folder)
# notify-send requires libnotify-bin -> `sudo apt install libnotify-bin`

if [ -z "${1}" ]; then
  notify-send "did not get folder name. use script on folder!"
  exit 1
fi

file="${1}/new_file"

i=0
while [ -f "${file}" ]; do
  i=$(($i+1))
  file="${1}/new_file${i}"
done

touch "${file}"

if [ ! -f "${file}" ]; then
  notify-send "tried to create a new file but it doesn't seem to exist"
else
  notify-send "I think I created file all well! it's ${file}"
fi

Now I can right click on a folder, click "scripts > new file" and have a new file that I can subsequently rename. Sweet.

I sure hope that in future I don't want anything slightly more complicated like creating multiple new files at once...

back to top

comparing PCs with terminal commands # prev single next top

tags: pc-building, scripting, visualisation • 738 'words', 221 secs @ 200wpm

I was given an old computer. I'd quite like to make a computer to use in my studio, and take my tower PC home to play video games (mainly/only local coop games like Wilmot's Warehouse, Towerfall Ascension, or Unrailed, and occasionally Gloomhaven).

It's not the best, and I'd like to know what parts I would want to replace to make it suit my needs (which are vaguely "can use a modern web browser" without being slow).

By searching the web, I found these commands to collect hardware information for a computer:

uname -a # vague computer information
lscpu # cpu information
df -h # hard drive information
sudo dmidecode -t bios # bios information
free -h # memory (RAM) info
lspci -v | grep VGA -A11 # GPU info (1)
sudo lshw -numeric -C display # GPU info (2)

I also found these commands to benchmark some things:

sudo apt install sysbench glmark2
# benchmark CPU
sysbench --test=cpu run
# benchmark memory
sysbench --test=memory run
# benchmark graphics
glmark2

I put the output of all of these commands into text files for each computer, into a directory that looks like:

├── ./current
│   ├── ./current/benchmarks
│   │   ├── ./current/benchmarks/cpu
│   │   ├── ./current/benchmarks/gpu
│   │   └── ./current/benchmarks/memory
│   ├── ./current/bios
│   ├── ./current/cpu
│   ├── ./current/disks
│   ├── ./current/gpu
│   ├── ./current/memory
│   └── ./current/uname
└── ./new
    ├── ./new/benchmarks
    │   ├── ./new/benchmarks/cpu
    │   ├── ./new/benchmarks/gpu
    │   └── ./new/benchmarks/memory
    ├── ./new/bios
    ├── ./new/cpu
    ├── ./new/disks
    ├── ./new/gpu
    ├── ./new/memory
    └── ./new/uname
4 directories, 19 files

Then, I ran this command to generate a diff file to look at:

echo "<html><head><style>html {background: black;color: white;}del {text-decoration: none;color: red;}ins {color: green;text-decoration: none;}</style></head><body>" > compare.html
while read file; do
  f=$(echo "${file}" | sed 's/current\///')
  git diff --no-index --word-diff "current/${f}" "new/${f}" \
    | sed 's/\[\-/<del>/g' | sed 's/-\]/<\/del>/g' \
    | sed -E 's/\{\+/<ins>/g' | sed -E 's/\+\}/<\/ins>/g' \
    | sed '1s/^/<pre>/' | sed '$a</pre>'
done <<< $(find current/ -type f) >> compare.html
echo "</body></html>" >> compare.html 

then I could open that html file and look very easily at the differences between the computers. Here is a snippet of the file as an example:

CPU(s):                   126
  On-line CPU(s) list:    0-110-5
Vendor ID:                AuthenticAMDGenuineIntel
  Model name:             AMD Ryzen 5 1600 Six-Core ProcessorIntel(R) Core(TM) i5-9400F CPU @ 2.90GHz
    CPU family:           236
    Model:                1158
    Thread(s) per core:   21
    Core(s) per socket:   6
    Socket(s):            1
Latency (ms):
         min:                                    0.550.71
         avg:                                    0.570.73
         max:                                    1.621.77
         95th percentile:                        0.630.74
         sum:                                 9997.519998.07
    glmark2 2021.02
=======================================================
    OpenGL Information
    GL_VENDOR:     AMDMesa
    GL_RENDERER:   AMD Radeon RX 580 Series (radeonsi, polaris10, LLVM 15.0.7, DRM 3.57, 6.9.3-76060903-generic)NV106
    GL_VERSION:    4.64.3 (Compatibility Profile) Mesa 24.0.3-1pop1~1711635559~22.04~7a9f319
...
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 9303213 FrameTime: 0.1074.695 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 8108144 FrameTime: 0.1236.944 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 7987240 FrameTime: 0.1254.167 ms
=======================================================
                                  glmark2 Score: 7736203

It seems like the big limiting factor is the GPU. Everything else seems reasonable to leave in there.

As ever, I find git diff --no-index a highly invaluable tool.

back to top

attempts to make a local archive of personal websites # prev single next top

tags: personal-websites, scripting • 1035 'words', 311 secs @ 200wpm

I wanted to make a local archive of personal websites. This is because in the past I have searched my bookmarks for things like fonts to see how many of them mention, talk about, or link to things about fonts. When I did this, I only looked at the homepages, so I've been wondering about a way to search a list of entire sites since.

I came up with the idea of downloading the HTML files for my bookmarked sites, and using grep and...

...other tools...

from https://merveilles.town/@akkartik

Here are 2 scripts that search under the current directory. I've been using search for several years, and just cooked up lua_search.

search is a shell script. Doesn't support regular expressions; it can't pass in quoted args to nested commands.

lua_search should support Lua patterns. It probably still has some bugs.

E.g. lua_search '%f[%w]a%f[%W]' is like grep '\<a\>'.

I think I'll make another version in Perl or something, to support more familiar regular expressions.

Oh, forgot attachments:

https://paste.sr.ht/~akkartik/9136bfabd0143733a040a8fe6f909c1e5d3b9db6

Also, lua_search doesn't support case-insensitivity yet. search tries to be smart: if you pass in a pattern with any uppercase letters it's treated as case-sensitive, but if it's all lowercase it's treated as case-insensitive. lua_search doesn't have these smarts yet, Also, lua_search doesn't support case-insensitivity yet. searchtries to be smart: if you pass in a pattern with any uppercase letters it's treated as case-sensitive, but if it's all lowercase it's treated as case-insensitive.lua_search` doesn't have these smarts yet, and all patterns are currently case-sensitive.and all patterns are currently case-sensitive.

#!/usr/bin/zsh
# Search a directory for files containing all of the given keywords.

DIR=`mktemp -d`

ROOT=${ROOT:-.}
# generate a list of files on stdout
echo find `eval echo $ROOT` -type f -print0  \> $DIR/1    >&2
find `eval echo $ROOT` -type f -print0  > $DIR/1

INFILE=1
for term in $*
do
  # filter file list for one term
  OUTFILE=$(($INFILE+1))
  if echo $term |grep -q '[A-Z]'
  then
    echo cat $DIR/$INFILE  \|xargs -0 grep -lZ "$term"  \> $DIR/$OUTFILE    >&2
    cat $DIR/$INFILE  |xargs -0 grep -lZ "$term"  > $DIR/$OUTFILE
  else
    echo cat $DIR/$INFILE  \|xargs -0 grep -ilZ "$term"  \> $DIR/$OUTFILE    >&2
    cat $DIR/$INFILE  |xargs -0 grep -ilZ "$term"  > $DIR/$OUTFILE
  fi
  INFILE=$OUTFILE
done

# get rid of nulls in the outermost call, and sort for consistency
cat $DIR/$INFILE  |xargs -n 1 -0 echo  |sort
#!/usr/bin/lua

local input = io.popen('find . -type f')

-- will scan each file to the end at most once
function match(filename, patterns)
  local file = io.open(filename)
  for _, pattern in ipairs(patterns) do
    if not search(file, pattern) then
      return false
    end
  end
  file:close()
  return true
end

function search(file, pattern)
  if file:seek('set') == nil then error('seek') end
  for line in file:lines() do
    if line:match(pattern) then
      return true
    end
  end
  return false
end

for filename in input:lines() do
  filename = filename:sub(3)  -- drop the './'
  if match(filename, arg) then
    print(filename)
  end
end

...to search the sites.

initial attempt

I found you can use wget to do exactly this (download an entire site), using a cacophony of arguments. I put them into a script that looks a bit like:

#!/bin/bash
domain=$(
  echo "${1}" \
    | sed -E 's/https?:\/\///g' \
    | sed 's/\/.*//'
)
wget \
     --recursive \
     --level 0 \
     -e --robots=off \
     --page-requisites \
     --adjust-extension \
     --span-hosts \
     --convert-links \
     --domains "${domain}" \
     --no-parent \
     "${1}"

modifications/tweaks

...and set it off. I found several things, which made me modify the script in several ways (mainly I saw these by watching one specific URL take a lot of time to scrape):

At this point (after not much effort, to be honest), I gave up. My final script was:

#!/bin/bash
domain=$(
  echo "${1}" \
    | sed -E 's/https?:\/\///g' \
    | sed 's/\/.*//'
)

+ skip="clofont.free.fr, erich-friedman.github.io, kenru.net, docmeek.com, brisray.com, ihor4x4.com"
+ if [[ "${skip}" =~ ${domain} ]]; then
+   echo "skipping ${domain}"
+   exit 1
+ fi

+ if [ -d "${domain}" ]; then
+   echo "skipping... folder ${domain} already exists"
+   exit 1
+ fi

+ echo "wget ${1} with domain ${domain}"

wget \
+    -N \
     --recursive \
     --level 0 \
     -e --robots=off \
     --page-requisites \
     --adjust-extension \
     --span-hosts \
     --convert-links \
     --domains "${domain}" \
     --no-parent \
+    --reject '*.js,*.css,*.ico,*.txt,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso,*.ppt,*.zip,*.tar,*.gz,*.mpg,*.mp4' \
+    --ignore-tags=img,link,script \
+    --header="Accept: text/html" \
     "${1}"

If I want to continue in future (searching a personal list of sites), I may find another way to do it, perhaps something similar to Google's search syntax potato site:http://site1.org site:http://site2.org, or perhaps I can create a custom search engine filter with DuckDuckGo/Kagi/etc that lets me put a custom list of URLs in. Who's to say. Otherwise, I'll also just continue sticking search queries in the various alternative/indie search engines like those on https://proto.garden/blog/search_engines.html.

back to top see more (+23)