Coding Names by How They Look

As someone who reads A LOT of handwritten documents, I am aware of some of the common areas of confusion that can arise in transcribing names. But even though I am aware of them does not make me immune to interpreting the letters incorrectly! It's still incredibly hard. In a previous article, Coding by Sound, I summarized how names can be coded by how they sound. This is a valuable tool since names can often be spelled multiple ways and still sound the same. I thought, “Can I come up with a coding system that takes into account how the name looks?” If so, this could give researchers another way to find elusive ancestors. The rest of this article details the work I did on this system.

Examples of the problem

A while ago, someone was looking for a Polish person whose first name was Hamslans. It was unlike any given name I have ever heard. It did strike me immediately, though, that the first name they were looking for was Stanislaus. Depending on the penmanship of the writer, St can look like an H. The lower case m (and n, for that matter) sometimes look like a series of short vertical strokes. So what may look like lll suggestive of m in Hamslans is the ni of Stanislaus. The ll of what was interpreted as the n is actually the letter u. Often times, records are not clearly written and names get misspelled when transcribed. Some letters or letter combinations look similar to others. Since online indexes are prepared by transcriptions of the handwritten word, the quality of the index depends on how well the handwritten script was read. If you can't find something in an index that you think should be there, it may be that it was misread. Before you blame the indexer, let's look at a couple more examples I have encountered to illustrate that transcription is no simple matter. I was searching for an immigrant relative name Szadorski. For the longest time, I could not find him because he was indexed as Sradorski-- the Polish script r and z can look a lot alike. (To find him, I actually did a wildcard search in the second letter position.) Another common source of confusion is between the letter t and the Polish ł, which can often look alike. I was looking for naturalization records for a Kolaski (Kołaski) relative, but was unsuccessful looking for Kolaski. So instead, I did a search on Kotaski and found him. So, we see that neither a normal search nor soundex based search gets us what we're looking for.

Development of my Look-alike coding system

The description that follows is based on over 25 years of experience in transcribing names. I have grouped similar looking letters and assigned them a numeric value which makes up the code. While this discussion focuses on cursive versions of letters, many of the descriptions can also apply to the printed versions of the letters. A letter may belong to a different group depending on whether it is capitalized or not since the two versions may look different. Adjacent repeated letters are eliminated since doubled letters seldom alter the sound of a name and would only serve to complicate the utility of the code. It's easy to roughly categorize letters based on their general characteristics. For example, the letters a, c, e, o, s, A, C, and O have roundish characteristics-- I call them roundy. Assignment of a code of 0 (zero) seemed only natural for those letters. For example, in some writing, it's easy to confuse a with o. The next group of letters includes, u, m, n, w, i, rz, M, N, W, and U. I call these spiky because they have vertical strokes or spikes. It's easy to confuse m with w, u with n, or n with rz to name a few. Since the digit, 1, is a vertical stroke, it was chosen as the code to represent this group. While we would typically say that m and n have humps, one often finds them written with spikes. Code group 2 includes the letters L, S, and Z (yes, these are all capital letters). Code group 3 includes P, R, K, and B (again all capital letters). Group 4 letters have descenders, parts that dip below the line of writing, and include f, g, j, p, and y (z could fit into this group but Polish z's are not written with a descender). Code group 5 includes D, and G. Code group 6 have ascenders, parts that extend above the half-way mark of the line, and include b, d, h, and k (l and t could fit into this group but are coded separately). Code group 7 includes T and F. Code group 8 includes l, ł, and t. The ł and t are the most commonly confused letters in handwriting. Code group 9 includes I, J, r, z. While group 9 letters do not all look alike, the I and J often do. The r and z often look alike. A name beginning with a code of 9 has an I or J at the beginning. Otherwise we know that it must refer to r or z (regardless of whether the z is written with a descender or not). Because H can look like an St, it is coded as 28. So now that all the letters have been assigned to a group, one can encode any name, letter by letter. The rest of the letters may be in the same group as their lower case counterparts if they look similar. What follows is the groupings as I have them now:

Look-alike coding summary

Group code
0 a, c, e, o, s, A, C, O the roundy letters, like the digit zero
1 u, w, i, m, n, rz, U, W, M, N the spiky letters, like the digit one
2 L, S, Z
3 P, R, K, B
4 f, g, j, y, p the descenders
5 D, E, G the catch-all group
6 b, d, h, k the ascenders
7 T, F
8 l, t
9 I, J, Y, r, z ran out of digits, but interpretation of the code is clear depending on if it occurs at the beginning of a word or in the body somewhere.
28 H

Apply the following rules in order:
1. One of the codes above is substituted for each letter in a surname.
2. Consecutive 1's (ones) are reduced to a single 1. Example: The name, Zimny, initially would be coded as 21114 but then becomes 214. This rule was created because Zimny might have 6 spikes in a row when handwritten. Someone might transcribe what they see as Zininy which also has 6 spikes and would initially be coded as 211114 which is not the same as 21114. With this rule, both names reduce to the same code, 214, as they should because they look alike.
3. The code is truncated to 6 digits if the name has more than 6 letters. The length of the code is nearly the same length as the name itself. The longer the name, the higher the probability was that the code would be unique to that name. In doing computer searches, such high “precision” can lead to few or no matches being found. By compromising and limiting the code length to no more than 6 digits, it becomes more likely similar names will be found.
4. Names that code to fewer than 6 digits are padded with the digit 2 until there are 6 digits. Wait, isn't 2 used to represent S, L, and Z? Yes, but only as a first digit. If 2 occurs elsewhere in the code, it must represent a “blank” character.


Do I really need to bother with coding the names? Of course not! Subscribers can use the Look-alike search option for their searches. The website automatically codes the surname for you and compares the code against those in the database. The above lengthy explanation is just an "under the hood" peek at what's happening. Be aware that the search is on the surname only and takes into account the name exactly as you entered it-- that is to say, if you are looking for Rogalski, that's what you must enter. If you try the trick of just entering Rogal as you might in a "normal" search, only codes matching Rogal will be found because its code is different from Rogalski. Does this coding scheme help with the initial Hamslans problem? Hamslans initially codes to 280108010 (before using just 6 digits). It would match the code for Stanislaus which also codes to 280108010 (before using just 6 digits). This shows that we have a match at the 6 digit precision and even at the full precision achieved by coding every letter.