Cyningstan DOS Games

Games for Early PCs

On Generating Names

Word cloud with generated names
Word cloud with generated names

Sunday, 14th January 2024

None of my games yet published make use of automated name generation. But it's something that I'm interested in for future projects. Procedurally generated worlds and universes rely on some way of generating names for the people and places within them. As my teenage self found out, it's no use just stringing random letters together and hoping for the best.

Once I found this out, an early attmept of mine produced names that were *nearly* good enough for a sci-fi setting. I'd still pick out letters at random, but I'd ensure that no more than two consonants or two vowels would appear in a row. Most of the names that came out were vaguely pronounceable. There were some improbable combinations that crept in, but in a sci-fi setting you can get away with those by blaming alien cultures.

Another algorithm I sometimes used relied on stored groups of consonants and vowels, like 'ch', str', 't', and 'a', 'ea', 'ui'. My names would be generated by alternating between consonant and vowel groups, adding a random vowel or consonant group to a word until a specified length had been reached. This had the advantage that you could "flavour" the output by including letter groups common in a particular language, making faux-French, faux-Klingon, faux-Dwarvish and so on.

But the best name generator I've written came from an in-house magazine for an old play-by-mail game about three decades ago. The magazine included an article responding to player questions about how the names in the game were generated, and their explanation set me on a new path in name generation.

They would split a word up into five parts. I'll use the world "custard" as an example. That would be split up into the groups "C", "U", "ST, "A", and "RD". But these aren't just vowel/consonant pairs as in my previous algorithm. You might notice the combination "RD", which would be out of place at the beginning of the word. The five groups used where as follows.

1. The "C" was the "initial consonant." This was optional, and is omitted in words like "upwards", "earth" and "antler". This letter group could also include "ST" (as in "stairs") but not "RD". If you wanted to add a French flavour to your generated words, you might include combinations like "D'" with the apostrophe to this group.

2. The "U" is part of the second group, the "initial vowel." Unlike the initial consonant, this one is mandatory. All words have at least one vowel. If the word has an initial consonant group, then the initial vowel group is the vowels that immediately follow it. Since these are letter groups, this one would include combinations like "EA" from "earth" and "feather". This might be the last letter in the word, as per the "E" in "the" or the "A" in "a".

3. The "ST" is from the third group. The third and fourth groups are always together, and may be include zero or more times in a word. The third group is the "middle consonant" group. There are combinations that might appear here that you'd never see at the beginning or end of a word. The most extreme example I can think of is "NGSTR" from "angstrom".

4. "A" is from the fourth group, which as I've said above, always follows the third group. These are the later vowels, and in some cases may be the final letters in the word, for example, the "E" in "apple" and the "EE" in "bungee". The third and fourth group may be completely absent, as in the word "ten". Or they may appear multiple times, as in the word "fo(r a)(g i)ng."

5. The "RD" is from the fifth group, the final consonant. We've seen already that some words may lack this letter. It's optional but, like the initial consonant, can only appear once. "RD" is one of plenty of examples that might appear in the fifth and third letter groups, but not in the first.

The idea is that you'd take these letter groups from many different words. The algorithm to use these letter groups can be briefly stated in English as follows:

a. On a 50% chance, add one of the letter groups from the intial consonants category (#1).

b. Add one of the letter groups from the initial vowels category (#2).

c. A random number of times, possibly including none, add one of the letter groups from the middle consonants category (#3), and one of the letter groups from the later vowels category (#4).

d. On a 50% chance, add one of the letter groups from the final consonants category (#5).

From the input word "custard" alone, this could produce "u", "cu", "usta", "custa", "urd", "curd", "ustard", "custard", "custastard" and so on. But if you add letter groups from more input words, you can begin to mix the letter groups from different words together, and quickly there'll be more variation.

Because I am lazy, and laziness is a virtue in a programmer, I didn't set about reading a dictionary and building up these five categories of letter groups by hand. Instead, I wrote a program that would scan a piece of text, and build up those letter groups for me. An advantage of this approach is that if I set it reading text in another language, the output would have letter groups common in that language, and might sound a bit like that language.

This is basically the name generation algorithm I use now. Sometimes I put in little tweaks to the text scanning algorithm, like including a 'U' in a consonant group if it follows a 'Q', so that 'Q' will always be followed by 'U'. I sometimes treat 'Y' as a vowel or include some logic to decide whether it's used as a vowel or a consonant in the word being scanned. Sometimes, I treat 'r' as part of a vowel group if it follows a vowel. But these tweaks sometimes produce unpleasant side effects, such as 'qu' followed by 'u', or 'er' followed by 'r'.

Another thing that might improve the algorithm is something I learned about recently, Markov chains. If I included probabilities on which letter groups came next in the scanned text, then combinations like 'qu' and 'er' would resolve themselves, and the text might generally sound more like the source language. There are downsides to that, though. The data storage requirements would balloon, and if the reproduction is too accurate, then the system might just end up regurgitating the words that it scanned.

I've built my algorithm into a library for use with future game projects. Once I've tested the library by successfully using it in one of those projects, I'll publish it with source code on the web site for others to use.


New Comment

Yes No