CC Blog Projects Research & Design Hub

Crafting a Custom Dictionary

Written by Jeff Bachiochi

Using Liberty BASIC

Word processing and spell-checking have come long way since the days of typewriter ink and printed dictionaries. As a part of an ongoing project, Jeff uses the Liberty BASIC programming language to write an application that creates multiple length-based dictionary files from a standard dictionary.

  • How to write an application that creates multiple length-based dictionary files from a standard dictionary

  • How do .dic file is a dictionary files work?

  • How do .aff  language files work?

  • How to enable dictionary expansion

  • What does theh  ProcessEntry routine do?

  • Liberty BASIC

I have fond memories of my dad, Harry. He was postman, and walked a mail route in town for most of his life. During the 1950s, when he needed to prepare a document, he would type it using a manual typewriter (Figure 1). There was no correction tape at that time, and being the perfectionist that he was, he would toss out the sheet of paper with the error, feed in a fresh sheet and reenter the entire document. I have a genealogy document of my grandmother Alice’s ancestry that he typed. While I’ve since scanned this in, so I could save it as a digital file, it isn’t the words that make this special for me. It is the hand-typed sheets with the quirky characters that make it unique. Individual letter hammers would eventually become skewed a bit, and this misalignment would occur throughout the document. Imperfections such as these became fingerprints that could be used to confirm that a particular typewriter had produced a document.

FIGURE 1 – The Royal typewriter my Dad used for all his communications was made right here in Connecticut. A sheet of paper was rolled into place on the traversing carriage. Each key press caused a lever with its associated embossed letter to strike an inked ribbon, making an imprint upon the paper. The carriage automatically moved left with each key press. When the carriage reached the right margin, the lever on the left had to be moved manually (a “carriage return”) to bring it back to the right and index the paper up one line.

Born in West Hartford, Connecticut in 1758, Noah Webster grew to believe in the developing cultural independence of the United States, including its distinctive American language with its own idioms, pronunciation and style. In 1806 Webster published A Compendious Dictionary of the English Language, the first truly American dictionary. In 1831, brothers George and Charles Merriam opened a printing and bookselling operation in Springfield, Massachusetts. And in 1843, after Noah’s death, their company, G. & C. Merriam Co., purchased publishing rights to Webster’s dictionary. G. & C. Merriam Co. was renamed Merriam-Webster, Inc. in 1982, and is known today as America’s most trusted provider of language information. It’s been publishing in print continuously for more than 150 years—and now online as well.

As an American adult, you may know roughly 40,000 words of the 400,000 in use today. This number is hard to pin down due to slang and jargon—those words used only by people in their specific profession. My 1975 Webster’s Concise Family Dictionary has 57,000 “core” words listed. A core word is the basic or essential part of a word without prefix or suffix. This is an important bit of information, as you will soon see.

ENTER THE PC

The 1970s brought on the Personal Computer or PC—the TRS-80 Model 1, the Commodore PET, the Apple II—and in the early 1980s, the IBM 5150. Applications were pretty much limited to whatever you could write. It certainly was a learning experience. The first word processing program, Electric Pencil, was released in 1976. The typewriter had met its match, except in our house. My dad just couldn’t accept the concept. I couldn’t blame him at the time, because my TRS-80 used cassette tapes for storing programs and data—not the most robust storage device.

Luckily, floppy disk drives, hard disks, CDs, DVDs and thumb drives were developed in rapid progression. Today’s word processors not only make writing and editing documents easier, but they also include some sophisticated add-ons, including grammar- and spell-checking. This brings us to the topic of this article: words and spell checking.

There is no special hardware necessary for this project, only a PC (which is technically hardware, but we’re not building any hardware). I’ll be using the Liberty BASIC programming language to write an application that creates multiple length-based dictionary files from a standard dictionary. These files will be used by another application I am writing, but that’s another story. The code for this article is available for download from Circuit Cellar’s article code and files webpage.

— ADVERTISMENT—

Advertise Here

.diC FILES

I thought this project would be a no-brainer, just a text file listing all the possible words that the dictionary covers! A .dic file is a dictionary file that contains a long list of words for a specific language, and typically is used by word processors—such as Microsoft Word, OpenOffice or LibreOffice—to check the spelling in a document or provide correctly written alternatives for misspelled words. This brings us back to the term: “core” word.

Dictionary entries, like the ones found in the Merriam-Webster dictionary, contain the core word (usually presented in boldface) followed by information pertaining to pronunciation, function, inflection, usage, synonyms, combination forms and more. It is the combination forms that we are interested in here. Combination forms include prefixes and suffixes that may be added to the core word. These allow the number of word entries to be reduced because the combination does not necessarily call for its own entry. An example is the word “relegate” and its combination forms “relegation,” “relegated” and “relegating.”

In an effort to reduce the size of a dictionary (.dic) file, only core words are listed. Combination forms are identified with a special coding added to the end of each core word. The format of this coding is a “/”suffix, where the forward slash identifies an escape character followed by any number of coded alpha characters. Let’s look at one dictionary entry, abandon/LGDRS. By itself, this .dic entry identifies the core word, “abandon.” However, there is no indication of what the escape-coded suffix means. The .dic file must be accompanied by an additional file. The .aff (affix) file contains the rules for all the possible escape-coded characters. Before looking at the .aff file, I want to finish with the .dic entry format. The termination character for each core word entry is an LF (linefeed – 0x0A) character. The first entry of the file contains the number of core entries in the file. This allows the file user to know how many entries there are in the file without having to run through the file and count them.

.aff FILES

With each language requiring a different character set to represent its written word, the Internet Assigned Numbers Authority (IANA) has established standards defining 8-bit character sets for each language based on Latin script. While it is impossible to translate every character set into the limited 8-bit format, extended sets can cover additional 8-bit sets. When you choose a particular dictionary, it usually identifies a specific language—for instance, the en_US.dic one that I’ll be using. The associated .aff file, en_US.aff, begins with the character set definition as SET ISO8859-1. This is “Latin alphabet no. 1,” consisting of 191 characters from the Latin script. Note that the French definition SET ISO8859-15 replaces just 8 characters, which include “€,” the euro character.

I will be covering the rules for the combinatorial additions to core words in this discussion. The .aff file also contains additional commands that make the dictionary more useful, such as misspelled suggestions. I suggest you look into this further if you are interested in its other uses. For this project I am only interested in the prefix and suffix rules that can be used to form additional words from the core.

Each rule contains a command that begins with SFX (suffix) or PFX (prefix). Both of these commands take on the same format and begin with a four-field entry that can be considered the ID command. The first field will be either SFX or PFX identifying the affix derivative type. The second field holds the escape-code character with which this rule is associated. The third field is Y/N and determines whether or not the affix can be combined (SFX and PFX) on a core word. The last field is the number of rules (1-n) that follow this command.

Next, there is at least one rule command, as defined by field 4 of the ID command. Each rule command has five fields. The fields 1 and 2 are the same as in the ID command—so we know it belongs to that ID. Now here’s where it gets interesting. Field 3 is a character or string of characters that must be removed from the core word before the affix is added. A “0” here indicates no characters need to be removed. Field 4 is a character or string of characters (the affix) to be added to the core word.

Field 5 is the rule that must be met to perform this addition. The rule is a character or string of characters that must match the first characters of the core word for a prefix, or the last characters of the core word for a suffix. Note here that characters within a square bracket “[ ]” are all possible matches. Note also that any character or bracketed group preceded by a “^” (the caret character) is considered negative and must not be present. Let’s take a look at an example using abandon/LGDRS. The first escape character after the “/” is “L“. In the en_US.aff file we find the commands for escape “L“:

SFX L Y 1
SFX L 0 ment .

— ADVERTISMENT—

Advertise Here

We find that the suffix rule for “L” can have compound affixes and has one rule associated with it. The rule says no characters need to be removed, the affix is “ment,” and the “.” signifies that there are no rules that need to be met. Therefore, this rule creates a new word: “abandonment.”

The next escape character is “G” and we find this in the .aff file:

SFX G Y 2
SFX G e ing e
SFX G 0 ing [^e]

We find that the suffix rule for “G” can have compound affixes and has two rules associated with it. The first rule says an “e” must be removed, the affix is “ing,” and “e” signifies that there must be an “e” at the end of the core word for the rule to be met. Since “abandon” has no “e” at the end, no new word is created. The second rule says no letters are to be removed, the affix is “ing,” and “e” signifies that there must be not be an “e” at the end of the core word for the rule to be met. Since abandon has no “e” at the end, this rule creates a new word: “abandoning.”

The next escape character is “D” and we find this in the .aff file:

SFX D Y 4
SFX D 0 d e
SFX D y ied [^aeiou]y
SFX D 0 ed [^ey]SFX D 0 ed [aeiou]y

We find that the suffix rule for “D” can have compound affixes and has four rules associated with it. The first rule says no letters need be removed, the affix is “d,” and it is added if the last letter in the core word is “e.” It is not, so there’s no new word for this rule. The second rule says remove the “y,” add affix “ied” if the next to last letter is not “a,” “e,” “i,” “o,” or “u,” and the last letter is “y“. This does not match the last letters of abandon, so no new word. The third rule says, nothing to remove, add affix “ed” if the last two letters are NOT, “ey.” They are not “ey,” so a new word is formed, abandoned. The last rule says no letters need be removed, the affix is “ed.” and it is added if the next to last letter in the core word is “a,” “e,” “i,” “o,” or “u,” and the last letter is “y.” It is not, so there’s no new word for this rule.

The final two escape characters, “R” and “S,” are similar rules that create the additional words, “abandoner” and” abandons.” So, the dictionary entry abandon/LGDRS, can be expanded with five additional entries, “abandonment,” “abandoning,” “abandoned,” “abandoner” and “abandons,” according to the rules of the associated .aff file.

LIBERTY BASIC

I’ve promoted using Liberty Basic in the past for other projects. While Liberty Basic enables users to present an interface that can rival any Windows application, it also enables quick and dirty processing without the need to design any fancy interface. A simple text output frame is available for simple applications. The present application makes use of this feature to give the user as much or as little feedback as required for the job.

This application will open up the .dic file and process it, but will skip saving any data. That’s because at this point, we don’t know what size string arrays need to be. Although we will read-in words from a single library, I want to output separate libraries based on the number of characters in the word. I want to end up with separate dictionaries containing words of the same character length.

Because the format requires the first entry to hold the total number of words in the file, and a sequential file only allows appending to the end of a file, there is a catch-22 here. I don’t know what that number will be until I have processed the .dic file. So, I have two options: (1) I can read the .dic file once to count the entries, write the count out to new file and then process the input file a second time to process the words to each separate output file. Or, (2) I can read the .dic file once to process the input file into each separate output file, and then read each separate file, count the words and output a second file for each word length—with the first entry as the count followed by the words.

Given that most of the time will be spent in expanding the dictionary, we’ll go with the second option. This also takes fewer resources because we won’t need any large arrays to store data—each word will be processed one at a time. Since I don’t necessarily know what the longest word will be, I’ll be opening 50 output files to handle words 1- 50 characters in length. Figure 2 shows a flow chart of the main loop of the application. You’ll find two loops in the flow. The first is responsible for taking each word (file entry) in the .dic file and pulling off any “/” suffix—saving the core word and any other words based on the “/” suffix. The second loop merely takes each Temp library file and resaves it, with the first entry being the number of entries in the file.

FIGURE 2 – This application to expand a dictionary file (.dic) uses two loops. The routine ProcessEntry determines the “legality” of each word in the (.dic) dictionary file, expands it according to the rules appended to each and saves the core word and all expanded results to separate temporary files based on word length—keeping track of the word count for each file. The second loop creates new dictionary files for each temporary file. Each new dictionary file begins with a word count for that file, followed by the word list. Empty files are eliminated.

Let’s look at this first, since it’s the simplest. Because I had no idea what the longest word would be, I choose 50 as the longest possible word in the original dictionary. So, I have 50 Temp files to process. The first loop has already done the work, so I now know how many entries are in each file—these running totals were updated in the array Index(n), where =50. Therefore, for each Temp file (1-50), I need to check Index(n) to see if that Temp file n has any words saved into it. If it has no words, the Temp file is deleted and we can go on to the next Temp file n+1. If the Temp file has words, we open a new Dictionary file n, save the number in Index(n) and then all of the words from Temp file into the new Dictionary file n. We now end up with a new dictionary file, which holds all the words of the same length.

Let’s now look at the process of expanding the core word based on the “/” suffix from the original dictionary file.

PROCESS ENTRY

The ProcessEntry routine is the heart of this application. It is based two factors: rules that I need, and rules the .aff file calls for. My rules are simple. All words must only contain capital letters A-Z or the apostrophe character. This means I will throw out anything that contains a number or some other character. Lowercase letters are replaced with uppercase letters.

— ADVERTISMENT—

Advertise Here

The .aff file rules are a bit more difficult. From the discussion earlier, we find that the .dic file contains a list of core words. To reduce the size of the .dic file, only core words are listed. If a core word, for example, “do” can be part of another word that has a prefix such as “un-” or a suffix such as “-ing,” then the core word has a “/” (slash) appended, followed by alpha characters, and each character indicates a rule that is defined in the .aff file.

The first thing we must do is to determine if the core word has a “/” appended to it. If there is no “/” in the word entry, then this is a core word, and we can call SaveEntrySaveEntry writes the word to the Temp file n, where = the length of the word (including any apostrophe). Index(n) is incremented, which keeps the running tally of the number of words in that file.

When a “/” has been found appended to a core word, all characters after the “/” are saved into the string variable AFFIX$. The variable TempWord$ will now hold the core word without the slash and following characters. We’ve broken the word entry into its core word TempWord$ and AFFIX$ the rules that pertain to it. Now we need to save the core word (SaveEntry) and loop though all the rules in the variable AFFIX$.

Next, we need to pull apart the variable AFFIX$. We will call the rule routine ExpandEntry once for each character in the variable AFFIX$. We need to loop through all the characters in AFFIX$ and assign the variable Key$ to each character before calling ExpandEntry. I use the select case Key$ command to find the rule associated with that character. So, this routine is a long list of cases that allow the program to go right to the appropriate rule, based on the character from AFFIX$. Figure 3 and Figure 4 show just a few of these rules, so you can get a feel for what’s happening here. The rule for escape character “A” is the prefix “RE-“. It has no rules, so:

DictionaryWord$ = “RE” + TempWord$

Like the escape character “L,” suffixes also can have no rules. Here the suffix “-MENT” is added as:

DictionaryWord$ = TempWord$ + “MENT”

FIGURE 3 – The ProcessEntry routine begins with my rules. All lowercase letters are changed to uppercase, and, if any entry contains anything other than letters or the “/” (slash) or ” ‘ ” (apostrophe), it is rejected. Once an entry is legal, it is searched for the “/” (slash) character. If it has none, then there are no rules and it is saved. Otherwise, the core word and the affixes are separated. The BuildEntry routine is called to expand the core word.
FIGURE 4 – Each character in AFFIX$ represents a rule for the core word. If any rule about the core word is true, then a prefix or suffix is added onto the core word and the new word is saved. This routine shown here is greatly abbreviated to illustrate a few simple and complex rules. The rules are found in the associated .aff file that is paired with the .dic file.

When dealing with an escape character like “B,” the rules get more complicated. If the last two letters of the core word = “EE,” then we just add the suffix “-MENT” to the core word. If the last letter is an “A” and the previous letter is not a vowel, then the suffix “-MENT” is added to the core word, else it cannot be added. If the last letter is an “E” and the previous letter is not a vowel, then the suffix “-MENT” is added to the core word after the last letter “E” of the core word is removed, else it cannot be added.

After a new word is formed, a call to SaveEntry will save it to the appropriate Temp file. One note here: when we discussed the .aff file earlier, there was a Y/N entry that allowed for compound words. A compound word contains more than 1 affix (suffix/prefix) addition to a core word. This is true (Y) for both escape character “U” (prefix “un-“) and escape character “G” (suffix “-ing“), so the core word “do” will have three additional entries, “undo,” “doing” and “undoing.”

CONCLUSION

When the application is run, process goals are printed in a simple text box to show the progress of the application (Figure 5). If you’ve stayed with me on this, then the question is: “Why create separate libraries with words of all the same length?”

FIGURE 5 – When the application is run, process goals are printed in a simple text box to show the progress of the application. We see that there are 62,118 entries in the original en_us.dic file. Of these, 61,943 core words are considered legitimate, and they are expanded into 107,234 words. The expanded dictionary is then separated into 25 files, each containing words of the same length.

This has to do with an ongoing pet project of mine. I like puzzles. With my morning coffee I like to do the crossword, Sudoku and Cryptogram in the local newspaper. Yep, I still subscribe to a daily newsprint. One of my past articles was a Sudoku solver—“Automating Sudoku” (Circuit Cellar 189, April 2006. Presently, I’m working on a Cryptogram solver. It would be handy to look up four-letter words with, say, an “e” as the last letter. With a large amount of dictionary work, it would be a waste of time to look for four-letter words in one large dictionary. Having a dictionary of four-letter words would be more efficient.

This led to looking for dictionaries. I knew that OpenOffice, which I am using to write this article, has a spell checker. This led to the discovery of the .dic file, and subsequently to the .aff file associated with it. Before I knew it, I was defining rules to expand core words. While this may seem unproductive to some, I enjoy the adventure of the unexpected. I had no idea that I would find a mystery in what I assumed would be a simple list of words!

English is my primary language. Without it I wouldn’t be able to communicate with other Americans. Although I’ve studied other languages, each of them has deteriorated from lack of use. I have great admiration for those who speak multiple languages. Their brains are obviously wired differently from mine. But isn’t that difference what makes us all so special? We need the talents of all Americans for our country to flourish. And not just our country, but the whole of planet Earth. Like it or not, we are all connected to it and have an impact on it. Too much to do, so little time. 

RESOURCES

Noah Webster and George and Charles Merriam – www.merriam-webster.com

.dic and .aff files – www.openoffice.org/lingucomponent/dictionary.html

“Automating Sudoku” (Circuit Cellar 189, April 2006)

Liberty BASIC | www.libertybasic.com

PUBLISHED IN CIRCUIT CELLAR MAGAZINE • FEBRUARY 2021 #367 – Get a PDF of the issue

Keep up-to-date with our FREE Weekly Newsletter!

Don't miss out on upcoming issues of Circuit Cellar.


Note: We’ve made the May 2020 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Website | + posts

Jeff Bachiochi (pronounced BAH-key-AH-key) has been writing for Circuit Cellar since 1988. His background includes product design and manufacturing. You can reach him at: jeff.bachiochi@imaginethatnow.com or at: www.imaginethatnow.com.

Sponsor this Article

Supporting Companies

Upcoming Events


Copyright © KCK Media Corp.
All Rights Reserved

Copyright © 2021 KCK Media Corp.

Crafting a Custom Dictionary

by Jeff Bachiochi time to read: 17 min