Nutrimatic Usage Guide

What It Is

Nutrimatic is a pattern-matching word-search tool designed for puzzle solving and construction, based on a dictionary of words and phrases that commonly occur in Wikipedia, using a regular expression based pattern syntax.

Word-pattern-matching tools, like TEA (defunct, alas), OneLook, One Across, and Unscramblerer are popular and invaluable tools for puzzle solving and construction. They answer questions like "what word starts with 'kr', ends with 'w', and is six letters long?" (Krakow, capital of Poland.) Anagram tools such as the Internet Anagram Server and Andy's Anagram Solver are also useful when the order of the letters isn't fixed.

Nutrimatic is sort of like these things (especially TEA, hence the name), except:

Pattern Language Syntax

Nutrimatic's query language is based on regular expressions, in particular POSIX extended regular expressions (but without POSIX character classes).

The text (words and phrases) being matched has been normalized so that it is all lowercase and contains no punctuation -- only letters, numbers, and spaces. Most punctuation is turned into spaces, but apostrophes are simply removed, so "Fleur-de-lis" shows up as "fleur de lis", and "I'm Jack's total lack of surprise." is "im jacks total lack of surprise". Spaces are matched automatically by default; see the section on "quoted phrases" below for details.

Because only letters and lowercase letters show up in the text, uppercase letters and punctuation symbols are available to mean other things in queries. Nutrimatic adds several such extensions:

Character Classes

For convenience, some uppercase letters and punctuation marks refer to commonly used classes of characters:

A - any alphabetic character, equivalent to [a-z]
C - any consonant (including y), equivalent to [bcdfghjklmnpqrstvwxyz]
V - any vowel (excluding y), equivalent to [aeiou]
_ (underscore) - any letter or number, equivalent to [a-z0-9]
- (hyphen) - an optional space, equivalent to ( ?)
# (number sign) - any digit, equivalent to [0-9]

Note that the hyphen and the underscore (as opposed to the standard regular expression dot ".") are only really useful inside "quoted phrases", though technically they are allowed anywhere.

"Quoted Phrases"

By default, Nutrimatic allows spaces to be inserted anywhere in the expression when matching. That means that an expression like CVCVCVCVCV (five alternating consonant-vowel pairs) matches words like "literature" as well as phrase fragments like "have become" or "was used as a".

To restrict where spaces can be placed, use "quoted phrases" in the pattern. Within the quotation marks, spaces will not be inserted unless the pattern specifically allows them. So the expression "CVCVCVCVCV" only matches single 10-letter words; "CVCVC-VCVCV" matches either single words or evenly split pairs of 5-letter words; "CVCVC VCVCV" matches only pairs of 5-letter words.

Quotation marks can be used on all or part of the query, as desired.

& (Intersection)

By analogy with the standard | (alternation) operator, the & operator requires both sides to match for the pattern to match. This is useful for applying several constraints in parallel. For example, "_*a_*&_*b_*&_*c_*&_{5,}" matches single words that contain all of "a", "b" and "c" (but in no particular order) and are also at least 5 letters long overall.

Warning: The & operator can be expensive to parse. Large numbers (10+) of combined expressions can take exponentially long to process. This is not an issue in most cases.

<Anagrams>

Text inside <angle brackets> matches anything that contains the same parts but in any order. So <act> matches "act", "cat", "atc" and so on.

The parts of an anagram are normally letters, but can themselves be any regular expression (in parentheses). For example, <(ag)(m)(ra)__> matches any 7-letter word or phrase containing "ag", "m", "ra", and any other two letters. Anagrams can also be part of a larger expression, if you have partial information about the order of something, such as <aan>g<amr> which matches "anagram" but not "margana".

Remember, if you want to restrict your anagram to single words, use "quoted phrases".

Warning: Anagrams, especially large or complex anagrams (more than 10-15 letters, or using several chunks which are different but can match the same text, or are deeply nested), can be very slow to parse and to search.

Bug: The anagram algorithm isn't perfect -- when using anagrams of wildcards or pieces that aren't single letters, sometimes results will be produced that aren't actually an anagram of the input. If you're using fancy anagram expressions, double-check what you get back.

Examples from MSPH 12(3)

I'm using the Microsoft Puzzlehunt 12(3) as a source of examples mostly just because I played in it recently, the solutions are online, and there were a reasonable number of wordy puzzles.

I should emphasize that Nutrimatic does not solve these puzzles automatically, it only helps with one specific part of the "crank turning". It doesn't help you figure out what to do in the first place, and anyway most of these puzzles require many other steps where a text grep engine like Nutrimatic is unhelpful.

Camouflage

This puzzle included many lines of text like "MOCHIT HATORY" with a blank in the middle. You were instructed to fill in a letter that would make a word at least 5 letters using the end of the first part, the filled in letter, and the start of the second part. For that first row, this pattern locates all such words:

"(((((m?o)?c)?h)?i)t?)_(h(a(t(o(ry?)?)?)?)?)?&_{5,}"

Note that the puzzle directions further require "common non-plural, non-capitalized English" words, and that each letter (A-Z) be used exactly once; these constraints must be enforced by the human solver.

Triple Sec

Among the various other things in this puzzle, you frequently need to rearrange a list of letter triples into a clue with some left over:

MAY SIT TIT BLE COM IKS IAL IMB MON

Six of the triples will be used to make a crossword-type clue; the other three will be left over for future use. The boldface letters indicate the start of a word in the clue (a word break happens just before each boldface letter). This pattern finds the solution:

"<( may)?( sit)?(tit)?(ble)?(com)?(iks)?(ial)?(im b)?( mon)?>"&( _{18})

The anagram operator is used to reorder the triples. Each triple is made individually optional in the anagram, but a constraint is added to the end requiring the answer to be 18 letters long (i.e. consume 6 triples). The expression is quoted and spaces are used to indicate where word breaks are allowed (boldface letters).

The answer "mayim bialiks sitcom" is the first result. (Mayim Bialik's sitcom is "Blossom").

Dice

After assembling a cube with spots on it, part of the final step in this puzzle is to treat each side of the cube as a letter bank (such as AEHIMNPRSW) and to determine the phrases that can be formed using that letter bank. We assume the puzzle designer didn't include any extraneous letters, so this is like an anagram with repeated letters collapsed into one. This pattern finds what can be formed from one of these letter banks:

[aehimnprsw]*&_*a_*&_*e_*&_*h_*&_*i_*&_*m_*&_*n_*&_*p_*&_*r_*&_*s_*&_*w_*

This rather cumbersome expression requires a phrase made from letters in the bank and also (this is the cumbersome part) requires each letter to be used at least once somewhere in the phrase. The answer to this one, "new hampshire", comes out on top. (Turns out each of the six sides decodes to a state.)

Hollywood Walk of Fame

One of the steps in the second phase of this puzzle involved identifying a series of rebus-like drawings, and then concatenating the names of the depicted objects with one letter removed from each. For example, one of the stars contained drawings of CHARM, ELTON, CHEST, and ONE. This pattern finds what can be made from this with one letter dropped from each:

(c?h?a?r?m?&____)(e?l?t?o?n?&____)(c?h?e?s?t?&____)(o?n?e?&__)

Within each word, every letter is optional, but a constraint is also added to the subpattern requiring that N-1 of the letters be used. The answer "charlton heston" comes out on top.

Murder By Depth

After finding this puzzle in a pool of water, you need to take a number of nonsense strings like "SEPAP" and "HEGM" and "add water" by inserting the letters W, A, T, E, R -- in that order, but not necessarily consecutively -- into the given letters. This pattern finds the answer for HEGM:

<waterhegm>&_*w_*a_*t_*e_*r_*

The word is required to be an anagram of <waterhegm> with the added constraint that the letters in "water" appear in that order. The answer "wheat germ" shows up on top.

Mortal Jeopardy

This meta for the third round was quite involved, but in the end produced a series of (mostly) three-letter chunks. The series was in the right order, but the chunks were scrambled internally:

<het><ral><seg><tan><rut><bla><oody><afl><ndi><cin><awe><ter>

The answer, "the largest natural body of land in ice water", shows up first. (Interpreting it requires some cleverness by the human solver!)

How It Works (and Why It's So Slow Sometimes)

Internally, Nutrimatic uses a trie which includes every word or phrase that occurs in Wikipedia at least five times -- 23,744,883 all told, packed into a 258MB index file. Every node in the trie includes a frequency count. (For example, the prefix "th" occurs 75,979,024 times; "qu" occurs 1,446,508 times.)

When searching for a pattern, Nutrimatic uses a best-first search through the trie using a priority queue of nodes sorted by frequency. The queue initially contains only the root node. As long as the queue is not empty, the algorithm takes the most frequent node from the queue and examines it for compatibility with the search pattern. If the path to that node matches the pattern, it's printed as output. If the path to that node is a possible prefix for the pattern (not necessarily matching the pattern itself), then the node's children are all added to the queue. Then it returns to the next best node in the queue and continues the process.

Whenever a space is encountered in the search, in addition to continuing to the children of that node, the search algorithm also continues at the root of the trie, allowing two words or phrases to be spliced together if they match the pattern. A heavy frequency penalty is added for this "reset", so that phrases which occur naturally are considered before such "frankenphrases".

Because this process can involve evaluating very many nodes, and potentially having a very large priority queue of nodes to process, the search expression is converted to a deterministic finite state machine for rapid evaluation while searching. The openfst library is used to create, combine and optimize the finite state machine as the query is parsed.

The creation of the finite state machine can be slow with complex patterns. Ordinary regular expressions all compile extremely quickly, but patterns which include many "&" operators or long anagrams (which use the "&" operator internally for every part of the anagram) can result in large state machines: Processing an anagram of length N requires O(2^N) states, and if the parts of the anagram are themselves state machines the complexity is multiplied. If more than 30 CPU-seconds are consumed during this phase, the search is killed. Where possible, using more constrained, simpler patterns can help.

More commonly, the search process itself can take a long time as a large number of nodes are visited. After 1,000,000 nodes are visited, the search is terminated (with the "Computation limit exceeded" message) whether or not any results have been shown.

Because the search proceeds forward from the start of the answer string, patterns which are constrained at the beginning work much better than patterns which are constrained at the end. For example, the_* gives results very quickly; _*est takes a long time to return. The first pattern can walk the subtrie starting with "the"; the second pattern effectively walks every word and phrase in the index in frequency order, checking for each one whether it ends in "est".

Adding more constraints always helps the search -- for example, quoting the search string is good if you don't expect the answer to contain multiple words (or you know where word breaks go).

Finally, this particular instance is running on a multi-user computer, and the 250MB+ index file has a way of getting paged out when people want to, you know, read their mail or something. So that can add a few seconds when first using the service.


updated March 24, 2009 - egnor@ofb.net