I am currently teaching a graduate course (eng630: “Digital Humanities”: Emerging Tools and Debates in Literary Study) and, as much as possible, I’m trying to make clear the mechanics behind some of the text-analysis in the works we’re reading. So, this week, as I prepared to discuss Stephen Ramsay’s Reading Machines, I wanted to reproduce some of the analysis done there. The first chapter, for instance, offers a tf-idf reading of Woolf’s The Waves. Here is how Ramsay describes it:

It is possible—and indeed an easy matter—to use a computer to transform Woolf’s novel into lists of tokens in whcih each list represents the words spoken by the characters ordered from most distinctive to least distinctive term. Tf-idf, one of the classic formulas from the field of information retrieval, endeavours to generate lists of distinctive terms for each document in a corpus. We might therefore conceive of Woolf’s novel as a ‘corpus’ of separate documents (each speaker’s monologue representing a separate document), and use the formual to factor the presence of a word in a particular speaker’s vocabulary against the presence of the word in other speakers’ vocabularies. (11)

This post summarizes how I tried to do just that, and the different results I got. I’m not sure what accounts for the differences from Ramsay’s (and Sara Steger’s) results; I’ll try to show you what I mean below. In a future post I’ll use the same “method” on aa different text (spoiler: it’s Ulysses).

Readers familiar with The Waves, and the demands of text processing, will immediately recognize why the analysis of the characters’ monologues would present itself as a tractable problem (“indeed an easy matter”). While, in theory, one could do a similar analysis for any novel (or any work with multiple speakers), the narrative structure of The Waves makes it particularly available to this sort of analysis. Chapters describing the process of the sun across the sky in the course of a single day alternate with chapters in which characters speak in semi-monologue about their lives. This device itself is the novel’s most obvious departure from the conventions of narrative fiction, but it also makes it “an easy matter” (well, maybe for some people) to extract these dialogues. If you had good, marked-up data, you could easily extract this information (as Lincoln Mullen shows in this post, working with the Folger’s TEI Shakespeare); but if all you have is unstructured plaintext, you’re going to have a problem. Woolf’s novel though, even in plaintext, carries a good deal of this informational structure in its novelistic form (there is, as they say, no such thing as an unmarked text).

Here is a chunk of The Waves, quoted at random:

‘Where is Bernard?’ said Neville. ‘He has my knife. We were in the tool-shed making boats, and Susan came past the door… Now we must drop our toys. Now we must go in together. The copy-books are laid out side by side on the green baize table.’

‘I will not conjugate the verb,’ said Louis, ’until Bernard has said it…

There is always a short phrase (starting with an opening single quotation mark—i.e. an apostrophe—and a capital letter), some text, a closing single quote (variously punctuated), the word said followed by a character name and some punctuation mark, an opening single quotation mark and some words. This single “monologue” may continue into the next paragraph (which would then, consistent with convention, be opened by a single quotation mark—i.e. an apostrophe). Finally the monologue is closed by an apostrophe before the narrative turns to another character (and another opening apostro-quote), or to one of those sun-dappled interludes.

Whew; describing what is so obvious to any reader of the text is painful (as, I imagine, is reading my description of it), but it is this highly structured convention which makes Woolf’s novel comparatively available to processing. Even absent TEI (or other) markup, Woolf’s convention creates an ad-hoc ordered hierarchy of content objects, at least for the reader interested in the characters’ monologues. Someone with more regex-fu than I have might be able to cut out character dialogue programmaticly.○I think the regex would go something like /'([\^,]+,)' said Louis, '(*)'/ and would capture, as $1 and $2 the material the character says… in theory. I tried to sort this out, but quickly gave up. Instead, I manually paged through the text and pasted together all the text said by a single character, so that from that opening apostro-quote to the closing apstro-quote would all be one one line, and the phrase [character name] said would occur somewhere near the front of that line.○In emacs, checking twitter and listening to podcasts, this represented an hour and a half’s labor; labor, mind you, which was sufficiently mindless that I enjoyed a beer. Though, as you’ll see below, this fact led me to redo the entire thing.

And so, with thanks to Woolf for her highly structured departure from novelistic convention, and to emacs for keybindings that made this somewhat less loathsome, you’re ready to extract your data. It is now simply a matter of grepping the file for each character:

grep 'said Louis' the-waves.txt > characters/louis.txt`
grep 'said Neville' the-waves.txt > characters/neville.txt`
...

And so on.○The text Ramsay & Steger use, and that I also used, comes from Project Gutenberg Australia. Because of copyright, I cannot share the processed data I am working with—and, as you’ll see, this extraction process is a crucial step. If you’d be interested in seeing or using this data, to save yourself the hour and a half’s labor, however, just drop me an email. I have relatives in Australia who would be happy to host you during the term of your interaction with this copyrighted material.

At which point, the actual, real analysis begins. Here is the code I used, in R (using the tm package), to get my results. It assumes that that each individual’s speech is contained in a single text file in a directory immediately below the working directory, called ‘characters’ (that’s what all that grepping above was about).

# This code relies on the tm (text mining) package
library('tm')

# Create a corpus based on the subdirectory
characters <- Corpus(DirSource('characters/'))

# To aid processing lets make everything lower-case
characters <- tm_map(characters,tolower)

# And remove punctuation
characters <- tm_map(characters,removePunctuation)

# And we'll remove stopwords - this step, is optional. But in the version 
# of the code I'm pasting here I removed them, in an effort (to no avail, 
# alas!) to match Ramsay & Steger's result.
characters <- tm_map(characters,removeWords, stopwords('english'))

# Now, we create a Document Term Matrix - that is, a set of the
# the frequencies for each word in each document. The secret
# sauce is that control=list(weighting=weightTfIdf) line, which
# asks that those not be raw counts, but tfidf scores.
dtm <- DocumentTermMatrix(characters,   control=list(weighting=weightTfIdf))

And here is just a taste of what that looks like. (This code asks RPersonify much? to let me see the 45th through 55th terms in the matrix (the terms are arranged alphabetically) for all texts.○You access the matrix by requesting row and column: matrix[row, column]; so the empty row field requests all rows (that is, all texts), and columns (which represent words) 45 through 55 (a range chosen entirely at random).

>Inspect(dtm[,45:55])
A document-term matrix (6 documents, 11 terms)

Non-/sparse entries: 16/50
Sparsity           : 76%
Maximal term length: 13 
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

             Terms
Docs               account     accounts   accretions accumulating accumulation
  bernard.txt 0.0002033962 0.0002033962 0.0001247118 0.0002033962 0.0002033962
  jinny.txt   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
  louis.txt   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
  neville.txt 0.0000000000 0.0000000000 0.0004414937 0.0000000000 0.0000000000
  rhoda.txt   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
  susan.txt   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
             Terms
Docs          accumulations     accuracy   accurately      achieve         acid
  bernard.txt  0.0002033962 0.0002033962 0.0000000000 0.0001247118 0.0000000000
  jinny.txt    0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
  louis.txt    0.0000000000 0.0000000000 0.0000000000 0.0004378349 0.0004378349
  neville.txt  0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0004414937
  rhoda.txt    0.0000000000 0.0000000000 0.0007776662 0.0000000000 0.0000000000
  susan.txt    0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
             Terms
Docs           acknowledge
  bernard.txt 0.0000786844
  jinny.txt   0.0000000000
  louis.txt   0.0002762431
  neville.txt 0.0002785515
  rhoda.txt   0.0000000000
  susan.txt   0.0000000000

So, this shows us, for instance, that Bernard, Louis, and Neville, all use the word acknowledge (Jinny, Rhoda, and Susan don’t); and Louis and Neville use it more than Bernard (but at the exact rate as each other).

At this point, we’ve got the data. All that’s needed is a little R data-finesse to get it back out in the order we want it. I’m quite new to R, so I may be missing the better/more obvious way to do this, but this way seems to work. I load the data into a matrix, and then extract it into lists (I think I’m getting my R data types right), ordered by the word’s score. We can then output as many (score, term) pairs from the re-ordered lists that we want (say, the top 24 terms).

m           <- as.matrix(dtm)

bernard     <- sort(m[1,], decreasing=TRUE)
jinny       <- sort(m[2,], decreasing=TRUE)
louis       <- sort(m[3,], decreasing=TRUE)
neville     <- sort(m[4,], decreasing=TRUE)
rhoda       <- sort(m[5,], decreasing=TRUE)
susan       <- sort(m[6,], decreasing=TRUE)


>louis[1:24]
    western      accent     grained        thou        wilt       beast 
0.006426702 0.005691854 0.004284468 0.004284468 0.004284468 0.003570390 
   boasting        nile     average      clerks         oak      stamps 
0.003502680 0.003502680 0.002856312 0.002856312 0.002762431 0.002762431 
 australian        boys    pitchers       steel      beaten     bobbing 
0.002627010 0.002209945 0.002189175 0.002189175 0.002142234 0.002142234 
    custard  eatingshop     england       eyres  fourthirty         ham 
0.002142234 0.002142234 0.002142234 0.002142234 0.002142234 0.002142234 

> bernard[1:24]
  thats     hampton        lady   curiosity      letter        ones 
0.002237358 0.001870677 0.001870677 0.001830566 0.001830566 0.001745965 
    elderly      heaven     married    observed       byron     phrases 
0.001627170 0.001627170 0.001627170 0.001627170 0.001621254 0.001610960 
     dinner      willow      phrase         fin      simple    describe 
0.001496542 0.001496542 0.001495004 0.001423774 0.001423774 0.001371830 
       self       stick       sense      nature    thinking      canopy 
0.001371830 0.001371830 0.001288768 0.001247118 0.001247118 0.001220377 

> neville[1:24]
      story        ones      doomed immitigable      papers       cheep 
0.003342618 0.003090456 0.002880181 0.002880181 0.002880181 0.002207469 
 perfection       camel      detect   hosepipes      hubbub       loads 
0.002207469 0.002160136 0.002160136 0.002160136 0.002160136 0.002160136 
     mallet      marvel   squirting        boys       byron     founder 
0.002160136 0.002160136 0.002160136 0.001949861 0.001765975 0.001765975 
      scene shakespeare       stair      abject   admirable        ajax 
0.001765975 0.001765975 0.001671309 0.001440091 0.001440091 0.001440091 

> jinny[1:24]
     tunnel    prepared   billowing        game      native       peers 
0.003833041 0.003194201 0.003125710 0.003125710 0.003125710 0.003125710 
    quicker  melancholy      bodies        band       bodys     cabinet 
0.003125710 0.002555361 0.002121992 0.002083807 0.002083807 0.002083807 
     coach        crag      dazzle      deftly    equipped    eyebrows 
0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 
     felled     glasses        jump     lockets    matthews    murmured 
0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 

> rhoda[1:24]
    oblong        dips       tiger      fuller      themoh     swallow 
0.005443664 0.003888331 0.003337767 0.003110665 0.003110665 0.002860943 
    fallen   suspended      cliffs     garland  manybacked     minnows 
0.002707581 0.002384119 0.002332999 0.002332999 0.002332999 0.002332999 
       pond   structure      terror       bunch        foam   moonlight 
0.002332999 0.002332999 0.002105897 0.001907295 0.001907295 0.001907295 
      party      puddle       dream       pools     violets     amorous 
0.001907295 0.001907295 0.001805054 0.001805054 0.001805054 0.001555332 

> susan[1:24]
    kitchen      setter     washing  windowpane        bury        cart 
0.006213103 0.004053254 0.004053254 0.004053254 0.003136025 0.003106551 
       gate      horses       apron     seasons    squirrel        beds 
0.003106551 0.003106551 0.003039940 0.003039940 0.003039940 0.002485241 
     butter       clean         wet      winter        baby        boil 
0.002485241 0.002485241 0.002485241 0.002063764 0.002026627 0.002026627 
   cabbages    carbolic       clara      cradle        eggs      ernest 
0.002026627 0.002026627 0.002026627 0.002026627 0.002026627 0.002026627 

My data doesn’t quite match Ramsay & Steger’s (qtd. in Ramsay 13); look at the Louis data to see what I mean (I’ve reordered the terms alphabetically so that you can see the similarities and differences more easily):

Louis

Ramsay
& Steger
Me
accent accent
attempt
australian australian
average average
beast beast
beaten beaten
boasting
bobbing bobbing
boys
clerks clerks
custard custard
discord
disorder
eating-shop eatingshop
england england
eyres eyres
four-thirty fourthirty
grained grained
ham ham
mr
nile nile
oak
pitchers pitchers
stamps
steel steel
thou thou
western western
wilt wilt

The terms fourthirty and eatingshop are victims here of the way R removed punctuation. R can also explain one other of the differences: Ramsay’s list has the word mr, which my list lacks. mr is on the list of stopwords I removed from the text. But the others? I don’t have any explanation for those. Ramsay’s list has these words, which my list lacks (in addition to mr): attempt, discord, and disorder. And my list has oak, stamp, boys, and boasting, which his lacks.

Well, so, okay; but pretty good, right? Well, maybe not. It only gets worse for the other characters. Here is a summary of the discrepancies for the other characters:

Bernard (4 Shared)

Here my list and Ramsay & Steger’s are very different.

The lists share only four terms: letter, curiosity, simple, and canopy.

Ramsay & Steger’s then has: arrive, bandaged, bowled, brushed, buzzing, complex, concrete, deeply, detachment, final, getting, hoot, hums, important, low, moffat, rabbit, thinks, tick, tooth

important would be removed by my stoplist… the rest though should otherwise be in my list.

But mine has: thats, hampton, lady, ones, elderly, heaven, married, observed, byron, phrases, dinner, willow, phrase, fin, describe, self, stick, sense, nature, thinking.

Let’s look at some of the words and try to sort this out; hoot seems a pretty unique word. Going back through the text, I find seven instances of hoot or hoots. They breakdown this way by character:

Well, hoot seems unique to Bernard. Okay, let me jump back into R.

>inspect(dtm[,c('hoot')])
A document-term matrix (6 documents, 1 terms)

Non-/sparse entries: 1/5
Sparsity           : 83%
Maximal term length: 4 
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

             Terms
Docs                  hoot
  bernard.txt 0.0008135849
  jinny.txt   0.0000000000
  louis.txt   0.0000000000
  neville.txt 0.0000000000
  rhoda.txt   0.0000000000
  susan.txt   0.0000000000

Just so that we aren’t confused, lets grab the raw counts (rather than the tfidf scores).

> raw <- DocumentTermMatrix(characters)
> inspect(raw[,c('hoot')])
A document-term matrix (6 documents, 1 terms)

Non-/sparse entries: 1/5
Sparsity           : 83%
Maximal term length: 4 
Weighting          : term frequency (tf)

             Terms
Docs          hoot
  bernard.txt    4
  jinny.txt      0
  louis.txt      0
  neville.txt    0
  rhoda.txt      0
  susan.txt      0

Well, that’s no help then; hoot is unique to Bernard. At this point I begin to suspect something unpleasant. Maybe in my manual data munging, I bollocks’d something. Obviously, It seems like I got the occurrences of hoots in there, attributed to the right person (though maybe I deleted some other hoots?); but if I deleted something, or double pasted something, that could change the complexion of corpus as a whole, and so dilute the score (or inflate the score of some of these other terms showing up in my list).

So, at this point I went back and reprocessed the file again to insure I didn’t break anything. I used this bit of elisp (courtesy of this) to remove (I included it in a macro for a first pass) hard newlines within a paragraph:

(defun remove-line-breaks ()
  "Remove line endings in a paragraph."
  (interactive)
  (let ((fill-column (point-max)))
    (fill-paragraph nil)))

And I ran it again. My scores shifted ever so slightly, but my top terms for Bernard remained the same.

Back in R, let’s compare my lowest rank term with hoot again:

>inspect(dtm[,c('canopy','hoot')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 2/10
Sparsity           : 83%
Maximal term length: 6 
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

             Terms
Docs               canopy         hoot
  bernard.txt 0.001219801 0.0008132009
  jinny.txt   0.000000000 0.0000000000
  louis.txt   0.000000000 0.0000000000
  neville.txt 0.000000000 0.0000000000
  rhoda.txt   0.000000000 0.0000000000
  susan.txt   0.000000000 0.0000000000
> inspect(raw[,c('canopy','hoot')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 2/10
Sparsity           : 83%
Maximal term length: 6 
Weighting          : term frequency (tf)

             Terms
Docs          canopy hoot
  bernard.txt      6    4
  jinny.txt        0    0
  louis.txt        0    0
  neville.txt      0    0
  rhoda.txt        0    0
  susan.txt        0    0

That is to say, canopy, based on my raw scores, does look more distinctive than hoot. What about moffat (from Mrs Moffat in the text○So, if mr showed up in their analysis, why not mrs here? Because other characters talk about other Mrses—Mrs Crane, Mrs Constable.) which ranks high on Ramsay & Steger’s list, but not at all on mine.

inspect(dtm[,c('moffat','canopy')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 2/10
Sparsity           : 83%
Maximal term length: 6 
Weighting          : term frequency - inverse document frequency (normalized)     (tf-idf)

             Terms
Docs               moffat      canopy
  bernard.txt 0.001219801 0.001219801
  jinny.txt   0.000000000 0.000000000
  louis.txt   0.000000000 0.000000000
  neville.txt 0.000000000 0.000000000
  rhoda.txt   0.000000000 0.000000000
  susan.txt   0.000000000 0.000000000
> inspect(raw[,c('moffat','canopy')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 2/10
Sparsity           : 83%
Maximal term length: 6 
Weighting          : term frequency (tf)

             Terms
Docs          moffat canopy
  bernard.txt      6      6
  jinny.txt        0      0
  louis.txt        0      0
  neville.txt      0      0
  rhoda.txt        0      0
  susan.txt        0      0
 

So moffat’s score is just the same as canopy (but there are a lot of terms with that score, and terms with the same score are then ranked alphabetically, so it gets pushed off our top 24 listRamsay & Steger’s scores are likewise ranked alphabetically when they have equal scores; have a look at those lists on page 13, and you’ll see islands of alphabetical ordering.).

So, let me jump back to my initial, raw file; I check there, and Moffat indeed occurs 6 times.

So what on earth is going on here? At this point, I don’t know. Here are, I think, the possibilities. The fact that the greatest discrepancy comes from the character with the most monologue data is perhaps meaningful, but how it’s meaningful is not obvious. So:

After tinkering for a bit, I suspected that this might be so. But looking at the raw counts for my data makes me doubt that. One thing you might suspect, if carving up the text into characters’ monologues were the problem, would be that some key term might be misattributed; but, for instance, my raw counts of catullus seem consistent with Ramsay & Steger’s results:

> inspect(dtm[,c('story','catullus')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 5/7
Sparsity           : 58%
Maximal term length: 8 
Weighting          : term frequency - inverse document frequency (normalized)     (tf-idf)

             Terms
Docs                 story    catullus
  bernard.txt 0.0011797090 0.000124653
  jinny.txt   0.0000000000 0.000000000
  louis.txt   0.0002763958 0.000000000
  neville.txt 0.0031438302 0.002076189
  rhoda.txt   0.0000000000 0.000000000
  susan.txt   0.0000000000 0.000000000
> inspect(raw[,c('story','catullus')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 5/7
Sparsity           : 58%
Maximal term length: 8 
Weighting          : term frequency (tf)

             Terms
Docs          story catullus
  bernard.txt    15        1
  jinny.txt       0        0
  louis.txt       1        0
  neville.txt    12        5
  rhoda.txt       0        0
  susan.txt       0        0

That is, Ramsay & Steger think Catullus is distinctive for Neville. And, indeed, it appears to be so. The difference between my results and theirs the tfidf score—that is, in how distinctive it is. If their corpus were differently constructed than mine in some way, it might affect how distinctive it is.

So, there may be a data carving problem; who miscarved though is not obvious from this data, I don’t think. It is also possible that there may be some algorithmic difference; I am using the tf-idf algorithm built into R as a sort of black box. My scores are very different from the one’s Ramsay & Steger share on pg. 12. So we’re definitely doing something different. And that might account for these differences. What clear I need to do is return to algorithm to better understand what’s going on here.

For now, though, I don’t know. I’ll here just summarize the data for the rest of the characters.These are using the reprocessed data, so they may be a little different from above; there were no differences in top terms for Louis or Bernard, and these scores were extracted using exactly the same code as above.

Neville (12 Shared)

Jinny (20 Shared)

Rhoda (13 Shared)

Susan (16 Shared)

Oh, and here is the breakdown of the amount of text I have for each character:

wc *.txt
      46   32608  182921 bernard.txt
      33    6331   34467 jinny.txt
      46    8905   49588 louis.txt
      39   10011   55543 neville.txt
      40    8147   44839 rhoda.txt
      34    6131   33023 susan.txt
     238   72133  400381 total

Bernard has the most,○Of course, because the final chapter is offered entirely in his voice. followed by Neville, Louis, Rhoda, Jinny, and Susan.

Works Cited

Ramsay, Stephen. Reading Machines. Urbana: U of Illinois P, 2011. Print.