Getting back to the Code Book Contest cipher, we have the first line:

MCCMMCTRUOUUUREPUCCTCTPCCC

Doing a character count, the ciphertext is 7048 characters long. This can be factored as 8 x 881. Assuming that all of the columns are the same length, then there’s a good possibility that the key is 8 letters. Also, we can work on the additional assumption that this is a CEMOPRTU cipher (in accordance with Almgren and team’s solution paper.

1 2 3 4 5 6 7 8

—————

M C C M M C T R

U O U U U R E P

U C C T C T P C

C C C U U P C M

M P R T C C R U

P E C C M U U P

C M P E P P U P

U R U P P M E U

P U C E U U C U

Taking a bit more of the ciphertext, I started by numbering the columns 1-8, and breaking the ciphertext into rows of 8 characters each.

(Edit: This was my first mistake. I didn’t know which way to build up the table – in rows or in columns. I went with rows.)

On its face, without the key, I’m stuck. All possible permutations of the columns are equally likely because every position in the substitution table can appear in the ciphertext. In other words, I can’t tell if the key unscrambles as 17243568 or 78425316, or anything else in between. That gives 8!, or 40,320 different ways the columns can be arranged.

Except… The solution to stage 6 reads:

“THIS STAGE IS NUMBER SIX AND IT IS A PLAYFAIRCIPHER. THE FOLLOWING STAGE IS NUMBER SEVEN WITHIN THE CONTEXT OF THIS COMPETITION AND IT IS ENCRYPTED ACCORDING TO AN ADFGVX TYPE CIPHER. WHEN YOU ATTEMPT TO CRYPTANALYSE STAGE SEVEN YOU WILL SEE THAT IT IS MORE COMPLICATED THAN A STRAIGHTFORWARD ADFGVX CIPHER. UNFORTUNATELY

I CANNOT GIVE YOU ANY MORE CLUES.

IN TERMS OF THIS STAGE THE SHIBBOLETH THAT YOU SHOULD TAKE NOTE OF IS MOLYBDENUM

YOU WILL PROBABLY HAVE NOTICED BY NOW THAT EACH STAGE IS IN A DIFFERENT CIPHER AND IS USUALLY IN AN APPROPRIATE LANGUAGE. SO THE VIGENERE STAGE WAS IN FRENCH AND THIS PLAYFAIR STAGE IS IN ENGLISH. IF THE PATTERN PERSISTS THEN THE NEXT STAGE WILL BE IN GERMAN. BUT DOES THE PATTERN PERSIST?”

(Note that “Shibboleth” refers to the code for stage 6 – “molybdenum”. Each of the 10 stages has its own code, and the winner of the contest was the first group to mail in all 10 codes to Simon.)

And, yes, the pattern persists – Stage 7 is in German. There are 22,355+ German words of 8 letters, so that part isn’t going to help me with the key. Except…

I now know what the expected letter frequencies should be. If I pair up 2 of the cipher columns, I can do a character count. Even though one column pair represents every 4th letter in the plaintext, the message itself is long enough that any given quarter of the text should still contain a good mix of characters that will come close to the German frequency table. Picking column 1, and permuting it with the other 7 columns, then counting the character frequencies is easy enough to do in a VBscript. Putting the output in Excel to graph the data visually should make the correct combination stand out clearly. Repeat the same steps for the other remaining 3 two-column pairs. There will still be the question of what order the 4 two-column pairs should be in, but that can wait for the moment.

So, I do this. I start with column 1 (MUUCMPCUP) and column 2 (COCCPEMRU) and get MC UO UC CC MP PE CM UR and PU. In this small subsample, there’s no duplications, but in the full text, UU occurs 63 times, UC 33 times, PU times 26, etc. I repeat the process for 1-3 through 1-8, and the pair with what seems to be the best turn out is… 1-2. I go to column 3 and run the permutations, and the best match seems to be 3-2, followed by 3-4. With column 5, it’s 5-1 then 54. And for column 7, it’s 7-6 followed by 7-2. Yeah, this is useless.

Plugging the numbers into Excel and looking at representative graphs gives almost identical-looking curves that aren’t that far off from the German frequency table. The problem is that the transposition chart is case insensitive, while the plaintext is mixed case, and until I break the code, I can’t match up characters for “a” and “A”, “t” and “T”, etc. Additionally, there are 64 characters total in the substitution table, smearing the percentages out and making the curves flatter.

Ok, doing things smarter isn’t helping me. Theoretically, I should be able to cut my work in half based on one flaw in the system. Because the CEMOPRTU table is a square, it is symmetrical along the diagonal CC to UU. Since I don’t know what the original table values were, and I don’t have the original key, it really doesn’t matter if I say EM = “E” and ME = “9”, or if I flip the table and claim EM = “9” and ME = “E”. All that’s important is that I get the plaintext out correctly. So the choice of column pair ordering “12”, “34”, “56” and “78” versus “21”, “43”, “65” and “87” is irrelevant and long as I remain consistent with all 4 pairs.

Fine. I make a stab in the dark and pair up 1 and 2, 3 and 4, 5 and 6, and 7 and 8, in that order. I then go to the next step, which is to arbitrarily assign random characters to the CEMOPRTU table, and convert MCCMMCTR to the characters from the table. There are two reasons for doing this. First, it cuts the number of characters in half (for example, “MCCMMCTR” becomes “CoyQ”.) Second, it makes things easier when counting 3-letter words.

Assuming that the plaintext was in English, one of the most common three-letter words in the language is “the”. On top of this, because the code table is assumed to contain “space,” or some other word delimiter, if there is any three-letter combination that occurs frequently, and is occasionally bracketed by the same character front and back, say “pH9Xp”, then at a minimum I can claim that that specific permutation of ciphertext columns deserves further study.

At this point, my program can’t handle stepping through permutations of the numbers 1-8, so I hardcode the string “12345678” in the program, build up the corresponding cipher string, look up the matching characters in the table, and do the frequency counts for 3-character combinations. I change the string to “34125678”, run the program again. Change to “56123478”, etc. for all 24 permutations (4*3*2*1 = 24). Two of the strings give me one 3-character “word” that shows up three times each. But, it’s not bracketed correctly. If it were, I’d make the assumption that the brackets represent “space”, and I’d make the substitution in the look-up table to do a reality check on the actual word spacing. But, that’s not working out right, either.

Great. Ok, if working smarter in any form at all isn’t helping me, I’ll work that much harder.

The first step is to figure out a way to automate generating permutations. It’s one thing to hand write 1234, 1243, 1324, 1342, 1423, 1432, 2134, 2143, 2314, 2341, 2413, 2431, 3124, 3142, 3214, 3241, 3412, 3421, 4123, 4132, 4213, 4231, 4312 and 4321, if you only have four digits. But at 8 digits, that’s 40,000+ values again. And that would take me a while.

Pseudo code:

s = “12345678”

ptr = 8

do while not done

store the character in s at ptr position to c as an integer

replace that character in s with ” ”

increment c

if c > 8, then decrement ptr

else

check if the new value of c is in s

if yes, keep incrementing c

if no, replace the character in s at ptr with the new value of c,

increment ptr and store 0 to c.

end if

if ptr < 1, reached last permutation – quit program

if ptr > 8, have built up complete new permutation – loop done

Say I have 1342 and I want the next number.

c = 2, and I make s = 134

c + 1 = 3, but 3 is already in s.

c + 1 = 4, but 4 is also already in s.

c + 1 = 5. 5 is greater than 4, so decrement ptr.

c = 4, and make s = 13

c + 1 = 5. 5 is greater than 4, so decrement ptr.

c = 3, and s =1

c + 1 = 4. 4 is not in s, so write 4 to s at ptr.

s = 14, c = 0, ptr = ptr + 1

c + 1 = 1. 1 is in s.

c + 1 = 2. 2 is not in s, so write 2 to s at ptr.

s = 142, c = 0, ptr = ptr + 1

c + 1 = 1. 1 is in s.

c + 1 = 2. 2 is in s.

c + 1 = 3. 3 is not is s, so write 3 to s at ptr.

s = 1423, c = 0, ptr = ptr + 1

ptr is = 5, which signals the end of generating the next permutation.

After a lot of debugging, I get the code to work as a function in VBscript. I put it in with the rest of my program, and start it running. The permutation string increments relatively quickly, so the brute force approach might be feasible. I walk away from the computer for a couple hours, then come back. I’m supposed to be storing the results to file for permutations with 3 hits on any 3-character combination, and nothing came out. I start digging into the code, and uncover so many bugs that I have to question just what exactly I was focusing on when I wrote the script, but I also realize that I made a fatal mistake. (Edit: Just not the CORRECT fatal mistake.)

Because I’d been working with the column arrays so much, I’d used 881 for the length of my check string, and not 7048/2 = 3524 like I should have. My script was only checking 25% of the full ciphertext. That, and I’d mistakenly been counting the same 3-character strings more than once. But, the point is that by going from 881 to 3548 characters in the counting loop, the execution times went from 20 permutations per second down to a crawl at 1 permutation per 5 seconds. Trading off completeness for speed.

After verifying the code was working right, I started it running again, but I (intentionally this time) constrained the check length to the first 880 characters for speed. 3 hours pass, and I only get 6 files output, half of which are mirror images in the code table, and none have verifiable space bracket characters.

In theory, the German equivalent of “the” should show up in the first quarter of the plaintext, in all lowercase, surrounded by spaces, but that’s not what I’m seeing, out of 40,000+ total column permutations. I’m now facing two possible situations – either my sample space is too limited, and/or I’m doing something else wrong, too. (Edit: Duh.)

I change the string check length to 2000 characters, which drops me to 1 permutation per second. Estimated run time – 11 hours. Then I go do the dishes, and prepare for bed.

To be continued.