# Word List Definitions
**Status:** Final | Implemented | November 2024

**Author:** Jonathan Blandford

Each word-list is kept in a GResource file that can be loaded and
unloaded as per the user's settings.

We extract the definitions for each word in the wordlist from
wiktionary and put it in the .gresource file alongside the wordlist
index data. This is a complicated and time-intensive process, taking
tens of minutes per build. The first step is also to download the raw
wiktionary data which is ~20Gb of data.

As a result of that **We keep a the pre-computed binary GResource file
in git** and data rather than build them from first-principles.

:::{note}
One result of this is that we tend to build into
`$SOURCE_DIR` rather than `$BUILD_DIR`.
:::

## Files

`def-extractor.py` will generate a number of files.

* `word-lists/{name}-filtered-wiktextract-data.jsonl`
* **`word-lists/{name}-gvariant-defs.data`**
* **`word-lists/{name}-gvariant-index.data`**
* **`word-lists/{name}-enums.json`**

:::{important}
Files in bold are stored in git and updated when the
word-list or format changes.
:::

### Resource file: `{name}-filtered-wiktextract-data.jsonl`

This is an intermediate file and is not kept in git. It contains all
the definitions of words from the word list pulled from the main
wiktionary data. The remainder of the operations use this as an
optimization.

:::{note}
The `.jsonl` extension indicates that it's a list of json
blocks each stored on their own line. The entire file isn't valid
json, but each line is. This is the same format that the raw data
comes in.
:::

### Resource file: `{name}-gvariant-defs.data`

This contains all the definitions concatted together. Each definition
is stored as a `GVariant` with a type signature of
`"(sa(ysa(maqas)))"`. Breaking it down:

```
(s    a(         y   s      a(         maq                as)))
 WORD ENTRY-LIST POS HEADER SENSE-LIST OPTIONAL-TAGS-LIST GLOSSES-LIST
```

### Resource file: `{name}-gvariant-index.data`

This contains a sorted list of the hash of the `FILTER`, the offset in
the defs file of the definition `GVariant` data, and the length of
it. Each chunk is padded to 12 bytes, and can be binary searched to
find the hash.

:::{note}
Like anagrams, we don't worry about hash collisions when
storing the index. We store all filters with the same hash together
in the same block. We then walk the whole block and look at the
stored word for each definition to see if it's one that matches the
filter we want.
:::

### Resource file: `{name}-enums.json`

This is a json file containing a list of both tags and POS names. These are stored as

## Defs creation steps

Here are the steps needed to create a new word-list:

### Definition data

First, download the raw wiktionary data from the [wiktextract
site](https://kaikki.org/dictionary/rawdata.html). There's information
there on how to generate those files, but they warn it takes a
day-and-a-half for each run so we start with the pregenerated ones.

```shell
% cd word-lists/
% curl -O https://kaikki.org/dictionary/raw-wiktextract-data.jsonl.gz
% gunzip raw-wiktextract-data.jsonl.gz
```

### Update existing word-lists

To update the git files to a new dictionary (or newer versions of the
existing word-lists), simply run:

```shell
meson compile -C _build/ build-wordlist-defs
```

That will run `scripts/build-wordlist-defs.sh`


### Adding a new word-list

This isn't really supported at this time. In concept though, you'd
have to add a `load_wordlist()` function to 'def-extractor.py' similar
to existing ones. Extend the arg parser and also
`generate_filtered_list()`. The rest of the code should work fine.

Then, from `${MESON_SOURCE_ROOT}/tools/wiktionary-extractor/` run:

```shell
% ./def-extractor.py $WORDLIST filtered-list
% ./def-extractor.py $WORDLIST enums
% ./def-extractor.py $WORDLIST gvariant-list
```

With the $WORDLIST set to the name you gave your wordlist.

### Glossary:

* **Filter:** the crossword-suitable set of clusters representing a
  word. For example, `MOUSE`. Could also be `CAT?`, though we won't
  match that to a definition. This concept is shared with the
  word-list.
* **Word:** a word with a fixed spelling. Multiple words can have the
  same filter. Examples include *"aard-vark"*/*"aardvark"* which
  shares a filter of `AARDVARK`. Another example is _"A.A."_ (the
  initialism) and _"Aa"_ (the river in France). They both share a
  filter of `AA`. One word may have multiple entries.
* **Entry:** A set of semantic meanings of the word, rooted in a
  common part-of-speech. An entry can contain multiple *senses.*
* **Sense:** The essesence of a word. One sense may need have multiple
  glosses to describe it.
* **Gloss:** A human-parseable sentence describing a sense. It is
  often circular and may contain examples. This is a gloss.
* **POS:** Acronym for Part of Speech
