Notes   /   26 June 2022

Some notes on analyzing the content of Emoji Dick.

This book, Emoji Dick is a project from 2010 by Fred Benenson where he created Amazon Turk jobs asking workers to convert sentences from Moby-Dick into equivalent expressions in emoji. I wanted to understand more about the emoji these turk workers used, so I found a PDF and started looking through it. I saw a few things that seemed unusual about the emoji, but basically they're incomprehensible and it is rare to see any meaningful correspondence with the sentences the symbols allegedly translate.

But even doing basic analysis of the emoji was hindered by the fact that -- at least in the PDF I found -- the emoji are rendered as embedded images, not their unicode characters. This is interesting, for one thing, because it means that Emoji Dick is permamently fixed in the emoji aesthetics specific to 2010, but it makes it impossible, for example, to Ctrl-f to see how many times a particulay symbol occurs.

I made some headway against this problem, so I am writing this note to document what I did in case anyone else is ever interested or if I need to refer to it again.

Extracting Images

First, I extracted all of the images from the PDF. For this, I used pdfimage which for Macs is included in the poppler package. Installation via homebrew was easy enough:

But I had to grant it permission to overwite some existing symlinks. No problem.

With that working, I extracted 187,448 very small images in .ppm format. Eventually, I realized that the pdfimage process had created two files per image -- the second one looked like it might have been an alpha mask -- so after deleting all those extras, I ended up with 93,724 images.

Hashing

Then, I wanted to see which emoji images are represented, which are repeated, how many times they were repeated, etc. I just had a big list of sequential file names, so in order to find out which images are the same as each other, I ran them through Python's hashlib and saved the result in a .tsv file. I'm sure I could have done this with Pandas, but since I always have to look up how to deal with dataframes, this was quicker and good enough.

Here's the script:

import os
import glob
import hashlib

db = []

files = glob.glob("./img/*.ppm")

for f in files:
    with open(f, "r") as i:
        h = hashlib.sha256(i.read()).hexdigest()

    fn = os.path.basename(f)

    db.append((fn,h))

db.sort(key=lambda x: x[0])

with open("db.tsv","w") as w:
    for hn in db:
        w.write(hn[0] + '\t' + hn[1] + '\n')

Counting

The hash table gave me a list of filenames matched to hashes, so the first and easiest thing I wanted to find out is which image show up most frequently. The next script finds out by converting the data table into a dictionary.

db = {}

with open("db.tsv") as f:
    data = f.readlines()

# just check for unique
for d in data:
    (fn,hn) = d.replace("\n","").split("\t")

    if hn in db.keys():
        db[hn].append(fn)
    else:
        db[hn] = [fn]

with open("uniq.tsv","w") as w:
    for k in db.keys():
        w.write(k + "\t" + str(len(db[k])) + "\n")

What I found here was kind of interesting once I matched specific hashes to their emoji.

For example, the spouting whale emoji (🐳 : 8ed63b50a4f77c9925f1c3bed704cd596a7ff54b8c5f27347038630e08a088f5) is relatively popular with 498 occurrences, but it is dwarfed by the red exclamation mark (❗: 6d170fb7fe5b6750d9e1b816e246c23d287d18610a221a1903b84b46146e5d57) and the red question mark (❓:7a038d676b2877c89d3225f0c9d3c1b80378ce05ab3eaf0d1f876690be0d6c8a) which come in at 852 and 841. I suppose my goal here is to get inside the head of a Turk worker, so for these symbols, I could imagine reasonable semantic reasons to use these typographically.

What really surprised me, though, was the most popular non-typographic emoji, 👲. At 743 occurrences, I wondered if it was being used as a standin for Ishmael, Queequeg, or some other character. But a quick skim couldn't find any uses that confirmed that. It appeared more or less randomly. That is, it seemed random until I noticed another pattern.

🏬 🔑 💁 🚑 🎓 ☕ 👲 👸 🎡

For reasons that I do not know, the 9-symbol sequence above appears 439 times throughout the text. I saw it a few times in a row, and to confirm my suspicion that it was the same thing recurring, I wrote another script to look for emoji n-grams using my hash table:

import glob
import os

db = []

gl = 9

grams = {}

with open("db.tsv") as f:
    data = f.readlines()

# just check for unique
for d in data:
    (fn,hn) = d.replace("\n","").split("\t")

    db.append((fn,hn))

for b in range(gl, len(db) - gl, 1):
    gram = ''

    for i in range(gl):
        gram = db[b - i][1] + ' ' + gram

    if gram in grams.keys():
        grams[gram] += 1
    else:
        grams[gram] = 1

for k in grams.keys():
    if (grams[k] > 1):
        print(k + "\t" + str(grams[k]) + "\n")

This sequence really jumps out, with 439 repetitions. The next most popular 9-grams come in at 50-60 repetitions, and these are mostly just overlaps with the principle 9-gram sequence identified above.

It took me a while to identify the fifth emoji in this sequence, by the way, but eventually I found that it is graduation cap, which for some reason looks like a formal coat on iOS 2.2 -- this also identifies what system the turk workers and/or Benenson must have been using for their set.

Here's what it looks like in the original emoji designs: Screen%20Shot%202022-06-26%20at%2010.38.04%20AM

I can't find any particular meaning in this sequence of 9 emoji, but my hunch is that it was a worker copy pasting the same sequence over and over again. Since Benenson payed out 5 cents per sentence, that's about $500 total, this worker likely earned $21.95 by spamming this sequence.

To put this in perspective, Benenson says that there are about 10,000 sentences, so these make up about 4.5% of the total work.

806 workers are credited at the end of the book, but there is no way of knowing which one of them is responsible for 🏬 🔑 💁 🚑 🎓 ☕ 👲 👸 🎡 .