# Word List
**Status**: Final | Implemented | March 2024

**Updated**: November 2024

The word list provides an efficient way to store and search for words
for the editor. It is not intended to provide fuzzy–, textual–, or
regexp– style matching, but instead it focuses on pure letter
matching.

## Requirements

We need to support the following use cases:

* **Three modes** of operation:
    * **MATCH:** Match by letters. For example, `"?OR??"` will
      match WORDS and CORES, among others.
    * **ANAGRAM:** Find anagrams out of a set of letters. So `CAT`
      will return `ACT`, `CAT`, `TAC`, and `TCA`
    * **INTERSECTION:** Find the words that fit when two filters cross
      each other. This is used for the EditWordList
* **Efficiency and speed.** We want to be as fast as possible when
  searching for words.
* **Multiple concurrent instances** of the word list. We have many
  `Word List` instances in the app, and use them from separate threads.
* **Internationalized** word lists. It should be UTF-8 clean.
* **Hot swappable** data sources. We want to be able to switch to
  different lists of words.
* **Letter Frequencies** of the words. This is needed for the solver.

We also need this to be as fast as possible. In the *Crossword
Editor*, we use matches to efficiently build puzzles in order to find
possible solutions for sections of the puzzle, or provide hints for
clues. Currently, all lookups are synchronous, which puts an upper
bound for the time of each operation. We are currently able to do a
lookup in less than 10 microseconds.

We never need to return a list of words of mixed word *LENGTH*. It's
_always_ sorted by word length. Eg, we never have any list with both
`CAT` and `MOUSE` in it. Also, note that when we talk about length, we
mean the number of code points for a given word, and not the number of
bytes. So `AÑO` has a length of three, despite being four bytes long
(encoded in UTF-8).

## Overall approach
We store the word data in an `mmappable()` file. This file is
generated at compile time, and loaded as a `GResource`. The file
consists of five distinct blocks of data, each described in greater
detail down below:

1. **Word List** – A list of all the words in the data set. These
   words are bucketed by word length in different subsections. Within
   each bucket they are stored first by priority, and then
   alphabetically.
1. **Filter Fragment List** – A list of words, indexed by letter,
   sorted by word length. This is used to MATCH filters. For example:
   The three letter word subsection would have all the words that
   match `"A??"` followed by all the words matching `"B??"`, all the
   way up through all the words matching `"??Z"`.
1. **Hash List** – A sorted table of hashes of each word sorted. This
   is used by the anagram mode.
1. **Letter Frequency List** – A precomputed list of letter
   frequencies in the word list. Used for autosolve.
1. **Index Section** – The table of contents for the file. It will
   have byte offsets for each section and subsection of the file. It's
   located at the end of the file, and is stored as a json block.

Everything in the list will be indexed by a byte offset into the data
file. Some of the byte-offsets are absolute; some are relative. We
mmap the file and seek through it to find what we need.

### MATCH mode approach

When in MATCH mode, we decompose the given filter into different
filter fragments. A filter fragment is a filter pattern with a single
character listed. We precalculate the list of words associated with
each fragment in the **Filter Fragment List**.

So, for example, consider the filter `??MO??T`. For this filter, we
break it up into three filter fragments: `??M????`, `???O???`, and
`??????T`. We build a list out of the intersection of each filter
fragments words. This intersection operation is quick to be done, as
the list is sorted so we only have to take one pass through each
filter.

As special cases, we can bypass the intersection when our filter has 0
or 1 character. When the filter is only has `?`s we can just use the
_Word List_ bucket for that word length. When we are looking for a
filter pattern with one character selected (such as `??M????`) the
_Letter Index_ can be used.

### ANAGRAM mode approach

In order to find all the anagrams of the filter, we first sort each
word's letters and hash the result. For each hash, we store a list of
the words that share that hash. That gives us a quick and efficient
way of very quickly finding all anagrams for a word. There are
hash-collisions in our word list, but not very many. That means that
when we look up the words for the hash, we need to double check the
results are anagrams.

:::{note}
One tradeoff of this approach versus a traditional
trie-based approach is that we can't easily handle anagrams with
unknown letters. That means that we can't easily generate a list of
words whose anagrams match `CDO?`.
:::

## INTERSECTION mode approach

The intersection mode is significantly more complex than either of the
other two modes. It takes in two filters, and takes a position within
the two filters that indicates where they cross. Consider the
following grid fragment:

|<!-- -->|<!-- -->|<!-- -->|<!-- -->|
|---|-----|---|---|
| # |  E  |   |   |
|   |  ?  |   |   |
| D |**?**| ? | ? |
|   |  ?  |   |   |
|   |  Y  |   |   |


We would pass in `E???Y` with a position of 2 for the first filter,
and `D???` and a position of 1 for the second. This operation will
calculate possible characters that can go in the crossing cell (in
bold), as well as the two list of possible words for those two
filters.

In this example, 'J' is a possible crossing character (for `ENJOY` and
`DJIN`), but 'V' is not.

## API
WordList is a `GObject` that implements a `GListModel`-like API. It
has a stateful "filter" property that determines what words it makes
available. The filter defaults to `""` – the empty word list.

The value of the filter depends on the mode of the word list. Question
marks are not parsed as an unknown word in ANAGRAM mode.

```c
typedef enum
{
  WORD_LIST_NONE,
  WORD_LIST_MATCH,
  WORD_LIST_ANAGRAM,
} WordListMode;

WordList    *word_list_new                  (void);
void         word_list_set_filter           (WordList      *word_list,
                                             const gchar   *filter);
                                             WordListMode   mode);
guint        word_list_get_n_items          (WordList      *word_list);
const gchar *word_list_get_word             (WordList      *word_list,
                                             guint          position);
gint         word_list_get_priority         (WordList      *word_list,
                                             guint          position);

gfloat       word_list_get_char_frequency   (WordList      *word_list,
                                             guint          word_len,
                                             guint          letter_position,
                                             gunichar       c);
void         word_list_find_intersection    (WordList      *word_list,
                                             const gchar   *filter1,
                                             const guint    pos1,
                                             const gchar   *filter2,
                                             const guint    pos2,
                                             IpuzCharset  **intersecting_chars,
                                             WordArray    **filter1_words,
                                             WordArray    **filter2_words);
```

The priority of each word is a value between 0 and 255, and defaults
to 50.

Like the `GListModel` interface, this lets the user get the number of
words (per filter) and get the word/priority for each
position. However, unlike `GListModel` it doesn't emit signals and
doesn't return GObjects for performance reasons. That's a relatively
simple task to do, and `WordListModel` is a wrapper around `WordList`
that provides this interface.

Note that `WordList` gives completely stable results. It will always
return the same answers for a given filter. This means that you can
set the filter to be some value and iterate through the items, change
the filter, and then set it back and continue your iteration.

## Data Sections

As mentioned, the overall resource file is divided into four sections
of data:

* **Word List**
* **Filter Fragments**
   * **Letter List**
   * **Filter Index**
* **Anagram Hash Table**
   * **Anagram Words**
   * **Anagram Hash Index**
* **Letter Frequency Table**

In addition, there's an index section that's included with the
`GResource`.

Each is described in detail below.

:::{important}
Each section is defined to be stored in little-endian byte
order.
:::

### Word List Sections
This block stores all the words along with their priority. The block
is divided into multiple _Word List Sections_ – one for each word
length. So, for example, there's a section for all the words that are
three characters long, followed by a section for all the words that
are four characters long, etc.

Each word entry in a section consists of a 1-byte priority stored as
an unsigned char, followed by a UTF-8 string terminated by the null
character. It will be padded out by '\0' characters to fit the longest
word (byte-wise) word within the section, stored as STRIDE. See the
charset section below for a little more clarification.

### Filter Fragment section
The _Filter Fragment_ section contains two blocks of memory. The first
block contains the _Letter Lists_, which are basically a big block of
gushorts. This is a list of all the indexes for a `WordIndex` for a
filter fragment concatenated together. It's not possible to find out
any state from within the block, and it's not possible to calculate
anything about it. They're in a predetermined order, but there's
nothing special about the order.

The _Letter List_ is also large; we store a `gushort` per letter per
word, which means it is twice as big as the _Word List_ (for
English). As an example, for the fragment `??T` we store the offset of
every word in the three-letter `Word List` block that matches that
pattern. That will be immediately followed by a list of all the words
that match `??U`, and so forth.

The second part of the _Filter Fragment_ section is the _Filter index_
for the _Letter List_ block. This is a segment of geometrically
increasing set of offsets to determine the section of the _Letter
List_ contains the list of words. We store an guint for the byte
offset for each list, and a gushort for the length of each list.

The _Filter Fragment_ section is super confusing to describe, so
here's an example. Consider the query `"??T"`:

```
Start at Length = MIN-LENGTH

   +-+          +-+
   |A|x25 more  |A|x25 more
   +-+          +-+
Length = 2 (Offset = 0)

                             / "??T" IS IN HERE
   +-+          +-+          +-+
   |A|x25 more  |A|x25 more  |A|x25 more
   +-+          +-+          +-+
Length = 3 (Offset = 52 * 6)

   +-+          +-+          +-+          +-+
   |A|x25 more  |A|x25 more  |A|x25 more  |A|x25 more
   +-+          +-+          +-+          +-+

Length = 4 (Offset = 130 * 6)

...

Continues up to Length = MAX-LENGTH
```

In the example above, the query `"??T"` is found at offset 744. The
next 6 bytes would contain the integer offset within the _Letter List_
of the fragments, and a `gushort` listing the length of the section.

#### Assumptions
We're storing the length of each filter fragment section in the word
list as a gushort, so we assume that none has a length longer than
65,536 words. So far that's true with the lists we're using.

### Anagram Hash Table
The _Anagram Hash Table_ is used to store hashes for looking up
anagrams. It is structured similar to the _Filter Fragment_ section:
first, there is the _Anagram Words_ block which has the link between
hashes and words. Then, there's the _Anagram Hash Index_ which is used
to actually look up the hashes and find the index in the _Anagram
Words_ section. These are described below:

#### Anagram Words
This is a block of gushorts, with each being the index of a
`WordIndex`. Each index is included exactly once.

#### Anagram Hash Index
This section has a mapping from anagram hash. It is structured as an
array, with each block being 9 bytes long. The block contains three
components:
* **OFFSET:** The offset from the base of the _Anagram Words_ block
* **HASH:** The hash of the sorted characters of each word
* **LEN:** How long each hash list will be. For most words without a
  matching anagram word, the length of this is 1.

This list is sorted by hash value, and the expectation is that one can
do a binary search through the array to find the hash

```
+--------------+----------+------+--­
+ guint        | guint    |guchar| (REPEATED)
+--------------+----------+------+--­
  OFFSET         HASH      LEN
```

:::{note}
There's room for optimizing this structure significantly
if file size is a problem. In addition, the 9 byte block leaves us
unaligned in the data structure. Some quick tests show that
alignment doesn't really matter in intel processors right now.
:::

### Letter Frequency Section

The letter frequency section stores a list of frequency of each letter
at each position for each word list. It's the same geometrically
increasing layout as the filter fragment section, so each
letter/position location is predictable. The sum of all letters at a
given position obviously adds up to 1.0.

This frequency is used by the word solver to pick a best next letter
to try, and is fairly niche.

### Index Section

:::{note}
This section is currently included as a separate file
within the `GResouce`.
:::

The _Index Section_ is a separate file in the GResource. Data we store
in the index:

* Valid character list found in the puzzle. This can be used for
  building a filter, and is stored as an array of unichars
* Meta information about the data that's stored
* Locations for each word list. Example, the location for all 2 letter
  words, all 3 letter words, etc.
* Location of the filter fragment and anagram hash sections

#### Example Index
```js
{
  "display-name": "Peter Broda's Wordlist",
  "charset": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
  "filterchar": "?",
  "min-length": 2,
  "max-length": 21,
  "threshold": 50,
  "visibility": "editor",
  "letter-list-offset": OFFSET,
  "letter-index-offset": OFFSET,
  "anagram-word-list-offset": OFFSET,
  "anagram-hash-index-offset": OFFSET,
  "anagram-hash-index-length": LENGTH,
  "frequency-table-offset": OFFSET,
  "words": [[2, STRIDE, OFFSET, LENGTH],
            [3, STRIDE, OFFSET, LENGTH],
            [4, STRIDE, OFFSET, LENGTH],
            ...
            [21, STRIDE, OFFSET, LENGTH]]
}

```

* CHARSET is a unicode string, UTF-8 encoded.
* STRIDE is the maximum width (in bytes) of each word in the table (see note below).
* OFFSET is an unsigned int with the number of bytes into the file the table starts.
* LENGTH is the number of entries in the table.
* VISIBILITY is a string. It can be `editor`, `player`, or `test`.

## Internal representation

It's really easy to get confused among all the internal tables. As a
result, we use the following structs as our primary reference that we
look things up.

```c
typedef struct
{
  gint length;
  gint index;
} WordIndex;

typedef struct
{
  gint length;      /* Length of the word */
  gint position;    /* Position of the character within the word */
  gint char_offset; /* location within the charset of the word */
} FilterFragment;
```

The WordIndex gives as a reference to a word (and its priority). We
can look up a word with just these two fields. It is public and
refersable. So we can look up a word by Word Index, as well as find
the word index of a given word (if it exists).

The FilterFragment gives us an individual part of a filter, with only
one character selected. So "???G????" is a filter fragment, and
"???G??N?" is made out of two fragments combined. We can look up a
list of WordIndexes for a given Filter Fragment.

## Charsets and Internationalization
We have made every effort to make this UTF-8 clean / Unicode-aware. We
normalize every word using [Normalization Form
C](https://unicode.org/reports/tr15/). For English, this has the
practical result that every character is uppercase.

The charset is important for calculating the filter table. We index
chacters based on their offset within the Charset. For the default
english charset it's "A" == 0; "B" = 1;, etc. For other languages,
accents and other marks are considered independent characters. So the
charset for spanish would have both N and Ñ in it, as well as accented
versions of all the letters.

The difference between word length and stride is not always clear. An
example is the french word `PÂTÉ`. It has a LENGTH of 4, and a STRIDE
of six. The `Â` with the circumflex would be considered an independent
character, and not be stored along with an unaccented `A`.

## Wordlist sources

There are two wordlist we use. The Peter Broda Wordlist, and the
wordnik wordlist:

### [Peter Broda's word list](https://peterbroda.me/crosswords/wordlist/?fbclid=IwAR04CeR_nhEW5M7CoK6Pyc3lxtzAlD9i9nk6pYadGXWtWN9pTBNWvHCE2hk)

This word list is long and has plenty of dubious words, but it's a
good starting point. This word list seems was actively maintained and
updated regularly, though had an outage in early 2024. We have a
relatively late copy of it. It's main advantage is that it provides
enumerations and some scoring. The scoring is dubious though — I
wouldn't score the words the way that this list has done. It also has
a high count of profane words with a high score.

This file is mostly clean. It's plain ASCII with \r\n line breaks and
mixed case. There are a few words in it with non-letters in it, and we
just discard those (Example: _MIA!;50_ has an exclamation point in
it). There's no explicit license in the file but permission is granted
on the website:

> You are free to use this list in any way you'd like. This includes
> commercial uses, though I'd appreciate it if you didn't just turn
> around and try to sell it (but I mean, I'll still offer it for free
> to anyone so that wouldn't be a smart business venture anyway).

### [Wordnik word list](https://github.com/wordnik/wordlist)

The wordnik wordlist is also included. It has no phrases and fewer
words than the Peter Broda list, but that might be better for some
puzzles. It's effectively the union of the US/UK scrabble
dictionaries, plus some extra words. It's stored as a list of
lowercase words in quotes with a newline between them.

It is available under an MIT license.

## Areas for improvement

* When doing the intersection of the filter fragments, we go through
  it sequentially in the order of the filter. So, if we have a filter
  of "S???AZ", we start with the "S?????" filter and then do an
  intersection with "????A?". Both are quite lengthy. We then
  intersect with "?????Z" which has much fewer words (~20).

  We should sort the filters from shortest to longest. We are likely
  to fail faster in the case when there are no matches, and possibly
  have a very short list to intersect against (which could end early).

  We could also experiment with doing a binary search to find sections
  to skip.

* The ANAGRAM and FILTER modes both set the `WordList` to a mode where
  you set the filter, then query the object. On the other hand,
  INTERSECTION returns a pair of `WordArrays` that operate independent
  of the `WordList` internal state. This should be reconciled.

* Change all the offsets to be relative in the code. That would let us
  install each section in its own file in the future.

* Fix the endianness when parsing. The data is now encoded as
  little-endian and stored in git, so we will have to byte-swap on
  big-endian machines. Currently, we disable s390 support as I haven't
  written that yet

