Chris Forster

(de)Bugging Ramsay—A Last Stab at tfidf and "The Waves"

In my last post I talked about some of the challenges of reproducing the analysis, offered in Stephen Ramsay's Reading Machines of the distinctiveness of characters' vocabulary in Woolf's The Waves. Here I follow up a bit more on why I was unable to get the same results Ramsay reports.

R, tm, and weightTfIdf

To get R to generate tf—idf scores, one can do the following:

# Generate a Document Term Matrix
dtm <- DocumentTermMatrix(corpus, control=list(weighting=weightTfIdf))

This was how I got my scores last time; and those scores didn't match the one's reported by Ramsay. So, one explanation is that I bollocks'd the data; the other is that there may be something going on under the hood.

To explore this latter possibility, I had a look at the weightTfIdf function.Doing so is easy; if you're in an R console, with the tm package loaded, just type weightTfIdf and you'll get all 26 lines of the code. It is worth noting that by default, the function normalizes its scores according to text. That is, rather than calculating the tf-idf score using a raw count of the number of times a particular term appears in a single document (in our case, the number of times a particular character uses a word), it divides the raw term frequency by the total number of words in that document (in our case, the total number of words said by a character). This normalizing process explains why the scores Ramsay reports for Louis (12) start at 5.9, while my own scores were always small numbers (0.006...; 0.003..., etc).

Short digression on misunderstanding the code

When I first dug into weightTfIdf code, I thought I discovered a difference in implementation:In order to work with the data, this function transposes the Document Term Matrix into a Term Document Matrix; this means that rather than the columns being terms and the rows texts, the rows are terms and the columns documents. I don't know why the function does this; but it does, and that explains why we're looking at row sums in what follows rather than column sums.

lnrs <- log2(nDocs(m)/rs)

Here we're calculating part of the tf-idf score—the part that corresponds to Ramsay's log(N/df). Ignore the log2 for a moment and look at that divisor, rs. That should represent the number of documents which contain a particular term. How is that calculated? Well, the code says:

rs <- row_sums(m > 0)

That row_sums function comes the slam package; and when I first saw this, I thought I had my explanation. Ah ha! If the divisor in the logarithm is the sum of a row, then it is dividing the number of documents not by the number of documents which contain the specified term, but by the sum of the row—that is, by the total occurrences of that term! Here is our explanation!

"Well, but what is that comparison, m > 0, doing in the function call," you ask? Good question! Well, I assumed that it was just a way to pass non-zero values to the row_sums function, and went about trying to rewrite the function properly. This heady delirium lead to questions on StackOverflow and wasted time. Because, of course, I totally misunderstood how this part of the code works.

The trick is that a relational operator (like >) on a matrix, returns a matrix with boolean values, evaluating that expression for each item in the matrix. So, each cell in the term document matrix is now either TRUE or FALSE based on whether the value of that term was greater than 0. So it might look something like this:

             Docs
Terms         bernard.txt jinny.txt louis.txt neville.txt rhoda.txt susan.txt
  absorption         TRUE     FALSE     FALSE       FALSE     FALSE     FALSE
  abstract           TRUE     FALSE     FALSE       FALSE     FALSE     FALSE
  abstraction       FALSE     FALSE      TRUE       FALSE     FALSE     FALSE
  absurd             TRUE     FALSE     FALSE        TRUE     FALSE     FALSE
  absurdity          TRUE     FALSE     FALSE        TRUE     FALSE     FALSE
  absurdly          FALSE     FALSE     FALSE        TRUE     FALSE     FALSE

And, since TRUE is treated as numerically equivalent to 1, and FALSE to 0, if we were to sum these rows, we would get... exactly what we were looking for; i.e., the number of documents in which the term appears. So, that was wasted time.

Another thing to note here is that weightTfIdf uses log2(), the binary logarithmWhich returns the exponent to which you would raise 2 in order to get the specified term; i.e. log2(16)=4, b/c 24=16.. The logarithm here works to essentially scale the term's frequency based on how specific the term is to any particular document; terms which occur in all documents will have N nearly equal to df, or N/df nearly equal to 1. And log(1) == log2(1) == 0. That is, it will push the weight to (or towards) 0. Whereas a term which occurs in only one document will maximize N/1. A logarithmic function is used simply to dampen that effect, preventing scores linearly increasing in the case of terms highly specific to a single document.

Does it make a difference whether one uses log2() or the natural logarithm or the base 10 "common" logarithm I learned back in school? I don't really know; I doubt it. The particular logarithmic function one chooses will change the score, but it wouldn't change the relationship among the terms (which had the highest score).

Hand Simulations

After all that, I tried to tinker a bit to bring results into line. I stopped normalizing my data and tried different log functions in an attempt to match Ramsay's scores. Well, okay, let's take one last look.

Despairing, I returned to the clearest data Ramsay gives us—the list of terms and scores which appears on page 12 and picked a few, to see if I could manage this by hand.My high school computer science teacher (we switched from Pascal to C++ midway through my high school career) used to make us "hand simulate" algorithms; print out the source code, jot down the variable names, and debug by hand. Smart guy. (Though the phrase "hand simulation" may have been unfortunate.) (I don't really mean by hand of course. Just... more slowly.) Here are the terms Ramsay lists for Louis, which have a score of 5.0021615: australian, beast, grained, though, wilt.

So, I returned to my data.

> rawdtm <- DocumentTermMatrix(characters)  # A basic Document Term Matrix of raw frequency counts
> terms <- c('australian','beast','grained','thou','wilt')
> inspect(rawdtm[,terms])
A document-term matrix (6 documents, 5 terms)

Non-/sparse entries: 6/24
Sparsity           : 80%
Maximal term length: 10 
Weighting          : term frequency (tf)

             Terms
Docs          australian beast grained thou wilt
  bernard.txt          1     0       0    0    0
  jinny.txt            0     0       0    0    0
  louis.txt            6     6       6    6    6
  neville.txt          0     0       0    0    0
  rhoda.txt            0     0       0    0    0
  susan.txt            0     0       0    0    0

This is already puzzling. These five terms all had the same score. Yet, australian occurs six times in Louis, but across two texts; while the other terms occurs six time in Louis, and only in Louis.

So, back to the data. I open up the original gutenberg and search through it for australian. And, indeed, looks like 7 total occurrences; one of which is from Bernard:

I thought how Louis would mount those steps in his neat suit with his cane in his hand and his angular, rather detached gait. With his Australian accent ("My father, a banker at Brisbane") he would come, I thought, with greater respect to these old ceremonies than I do, who have heard the same lullabies for a thousand years.

Let's check grained. I found six occurrences; all in Louis's speech and, as I was doing this by hand, I noticed that all those occurrences of grained were in the phrase grained oak:

Why does grained occur in Ramsay & Steger's list, but not oak? Well, maybe everyone talks about oak trees, but only Louis talks about grained oak. Back to our data:

>inspect(rawdtm[,c('grained','oak')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 3/9
Sparsity           : 75%
Maximal term length: 7 
Weighting          : term frequency (tf)

             Terms
Docs          grained oak
  bernard.txt       0   2
  jinny.txt         0   0
  louis.txt         6  10
  neville.txt       0   0
  rhoda.txt         0   0
  susan.txt         0   0

Hmm... well, let's do the math for these three terms (the first two of which got the same score in Ramsay's analysis; the last of which didn't appear at all). For these three terms, australian, oak, and grained for Louis, we have:

             tf    df    N    1+tf*(log2(N/df))   1+tf*(log10(N/df))    1+tf*(log(N/df))
australian    6     2    6         10.50978          3.862728               7.59164
grained       6     1    6         16.50978          5.668909              11.75056 
oak           10    2    6         16.84963          5.771213              11.98612

So oak should have the highest score of the three.

Something is clearly wrong.

I've returned to the raw counts, double checked them against the original file; and computed these things individually. I've tried to imagine all possible ways, but have been unable to produce any of the scores listed on pg 12.

I suspect that there may be a problem in the Ramsay and Steger's data. There is significant overlap in Ramsay and Steger's results and my own. It is for Louis that our data is most consistent; we share 19 of the top 24 terms. It is least similar for Bernard (sharing only 4/24 terms) and Rhoda (11/24). Did the Ramsay & Steger's analysis, perhaps, discard the final chapter where only Bernard talks (perhaps deliberately)? That would obviously greatly change the Bernard data set; and, by removing so much text, it could easily affect the df term for other characters, accounting for the other discrepancies. (But it wouldn't explain oak.)

It may very well be my data that is the problem, but if so I've been unable to locate the error. I'm abandoning trying to reconcile these approaches to a method that should, in principle, be eminently reproducible. I will though summarize the top 24 terms with highest tf-idf scores, using the best data I've got, and not normalizing.

> louis[1:24]
   western        oak      beast    grained       thou       wilt     accent 
 23.264663  15.849625  15.509775  15.509775  15.509775  15.509775  14.000000 
  boasting       nile    average     clerks     stamps australian       boys 
 12.679700  12.679700  10.339850  10.339850  10.000000   9.509775   8.000000 
  pitchers      steel     beaten      boast    bobbing    custard eatingshop 
  7.924813   7.924813   7.754888   7.754888   7.754888   7.754888   7.754888 
   england      eyres        ham 
  7.754888   7.754888   7.754888 
> bernard[1:24]
curiosity   hampton    letter   phrases     byron   elderly    heaven   married 
 25.84963  23.77444  23.26466  22.22858  22.18948  20.67970  20.67970  20.67970 
 observed    dinner    phrase    willow       fin    simple  describe      self 
 20.67970  20.60451  20.00000  19.01955  18.09474  18.09474  17.43459  17.43459 
    stick     sense     story    nature  pictures  thinking    canopy     enemy 
 17.43459  16.37895  16.00000  15.84963  15.84963  15.84963  15.50978  15.50978 
> neville[1:24]
      story      doomed immitigable      papers       byron    catullus 
  12.000000   10.339850   10.339850   10.339850    7.924813    7.924813 
      cheep  perfection       camel      detect         don   hosepipes 
   7.924813    7.924813    7.754888    7.754888    7.754888    7.754888 
     hubbub       loads      mallet      marvel   squirting       waits 
   7.754888    7.754888    7.754888    7.754888    7.754888    7.754888 
       boys     founder      knives      pocket       scene shakespeare 
   7.000000    6.339850    6.339850    6.339850    6.339850    6.339850 
> jinny[1:24]
    tunnel   prepared  billowing       game     native      peers    quicker 
  9.509775   7.924813   7.754888   7.754888   7.754888   7.754888   7.754888 
melancholy     bodies       band    cabinet      coach       crag     dazzle 
  6.339850   5.264663   5.169925   5.169925   5.169925   5.169925   5.169925 
    deftly   equipped   eyebrows     felled  haymarket       jump    lockets 
  5.169925   5.169925   5.169925   5.169925   5.169925   5.169925   5.169925 
  matthews   murmured    prepare 
  5.169925   5.169925   5.169925 
> rhoda[1:24]
   oblong      dips     tiger    fuller   swallow    fallen     steep suspended 
18.094738 12.924813 11.094738 10.339850  9.509775  8.000000  7.924813  7.924813 
   cliffs   minnows      pond    terror     bunch      foam     party    puddle 
 7.754888  7.754888  7.754888  7.000000  6.339850  6.339850  6.339850  6.339850 
    pools   violets       bow   caverns     chirp     choke    column   columns 
 6.000000  6.000000  5.169925  5.169925  5.169925  5.169925  5.169925  5.169925 
> susan[1:24]
   kitchen     setter    washing       bury       cart       gate      apron 
 15.849625  10.339850  10.339850   8.000000   7.924813   7.924813   7.754888 
   seasons   squirrel windowpane       beds     butter      clean        wet 
  7.754888   7.754888   7.754888   6.339850   6.339850   6.339850   6.339850 
     blown     winter       baby     bitten       boil   cabbages   carbolic 
  6.000000   5.264663   5.169925   5.169925   5.169925   5.169925   5.169925 
     clara     cradle       eggs 
  5.169925   5.169925   5.169925 

With Thanks to Woolf and emacs, Reading 'The Waves' with Stephen Ramsay

I am currently teaching a graduate course (eng630: "Digital Humanities": Emerging Tools and Debates in Literary Study) and, as much as possible, I'm trying to make clear the mechanics behind some of the text-analysis in the works we're reading. So, this week, as I prepared to discuss Stephen Ramsay's Reading Machines, I wanted to reproduce some of the analysis done there. The first chapter, for instance, offers a tf-idf reading of Woolf's The Waves. Here is how Ramsay describes it:

It is possible—and indeed an easy matter—to use a computer to transform Woolf's novel into lists of tokens in whcih each list represents the words spoken by the characters ordered from most distinctive to least distinctive term. Tf-idf, one of the classic formulas from the field of information retrieval, endeavours to generate lists of distinctive terms for each document in a corpus. We might therefore conceive of Woolf's novel as a 'corpus' of separate documents (each speaker's monologue representing a separate document), and use the formual to factor the presence of a word in a particular speaker's vocabulary against the presence of the word in other speakers' vocabularies. (11)

This post summarizes how I tried to do just that, and the different results I got. I'm not sure what accounts for the differences from Ramsay's (and Sara Steger's) results; I'll try to show you what I mean below. In a future post I'll use the same "method" on aa different text (spoiler: it's Ulysses).

Readers familiar with The Waves, and the demands of text processing, will immediately recognize why the analysis of the characters' monologues would present itself as a tractable problem ("indeed an easy matter"). While, in theory, one could do a similar analysis for any novel (or any work with multiple speakers), the narrative structure of The Waves makes it particularly available to this sort of analysis. Chapters describing the process of the sun across the sky in the course of a single day alternate with chapters in which characters speak in semi-monologue about their lives. This device itself is the novel's most obvious departure from the conventions of narrative fiction, but it also makes it "an easy matter" (well, maybe for some people) to extract these dialogues. If you had good, marked-up data, you could easily extract this information (as Lincoln Mullen shows in this post, working with the Folger's TEI Shakespeare); but if all you have is unstructured plaintext, you're going to have a problem. Woolf's novel though, even in plaintext, carries a good deal of this informational structure in its novelistic form (there is, as they say, no such thing as an unmarked text).

Here is a chunk of The Waves, quoted at random:

'Where is Bernard?' said Neville. 'He has my knife. We were in the tool-shed making boats, and Susan came past the door... Now we must drop our toys. Now we must go in together. The copy-books are laid out side by side on the green baize table.'

'I will not conjugate the verb,' said Louis, 'until Bernard has said it...

There is always a short phrase (starting with an opening single quotation mark—i.e. an apostrophe—and a capital letter), some text, a closing single quote (variously punctuated), the word said followed by a character name and some punctuation mark, an opening single quotation mark and some words. This single "monologue" may continue into the next paragraph (which would then, consistent with convention, be opened by a single quotation mark—i.e. an apostrophe). Finally the monologue is closed by an apostrophe before the narrative turns to another character (and another opening apostro-quote), or to one of those sun-dappled interludes.

Whew; describing what is so obvious to any reader of the text is painful (as, I imagine, is reading my description of it), but it is this highly structured convention which makes Woolf's novel comparatively available to processing. Even absent TEI (or other) markup, Woolf's convention creates an ad-hoc ordered hierarchy of content objects, at least for the reader interested in the characters' monologues. Someone with more regex-fu than I have might be able to cut out character dialogue programmaticly.○I think the regex would go something like /'([\^,]+,)' said Louis, '(*)'/ and would capture, as $1 and $2 the material the character says... in theory. I tried to sort this out, but quickly gave up. Instead, I manually paged through the text and pasted together all the text said by a single character, so that from that opening apostro-quote to the closing apstro-quote would all be one one line, and the phrase [character name] said would occur somewhere near the front of that line.○In emacs, checking twitter and listening to podcasts, this represented an hour and a half's labor; labor, mind you, which was sufficiently mindless that I enjoyed a beer. Though, as you'll see below, this fact led me to redo the entire thing.

And so, with thanks to Woolf for her highly structured departure from novelistic convention, and to emacs for keybindings that made this somewhat less loathsome, you're ready to extract your data. It is now simply a matter of grepping the file for each character:

grep 'said Louis' the-waves.txt > characters/louis.txt`
grep 'said Neville' the-waves.txt > characters/neville.txt`
...

And so on.○The text Ramsay & Steger use, and that I also used, comes from Project Gutenberg Australia. Because of copyright, I cannot share the processed data I am working with—and, as you'll see, this extraction process is a crucial step. If you'd be interested in seeing or using this data, to save yourself the hour and a half's labor, however, just drop me an email. I have relatives in Australia who would be happy to host you during the term of your interaction with this copyrighted material.

At which point, the actual, real analysis begins. Here is the code I used, in R (using the tm package), to get my results. It assumes that that each individual's speech is contained in a single text file in a directory immediately below the working directory, called 'characters' (that's what all that grepping above was about).

# This code relies on the tm (text mining) package
library('tm')

# Create a corpus based on the subdirectory
characters <- Corpus(DirSource('characters/'))

# To aid processing lets make everything lower-case
characters <- tm_map(characters,tolower)

# And remove punctuation
characters <- tm_map(characters,removePunctuation)

# And we'll remove stopwords - this step, is optional. But in the version 
# of the code I'm pasting here I removed them, in an effort (to no avail, 
# alas!) to match Ramsay & Steger's result.
characters <- tm_map(characters,removeWords, stopwords('english'))

# Now, we create a Document Term Matrix - that is, a set of the
# the frequencies for each word in each document. The secret
# sauce is that control=list(weighting=weightTfIdf) line, which
# asks that those not be raw counts, but tfidf scores.
dtm <- DocumentTermMatrix(characters,   control=list(weighting=weightTfIdf))

And here is just a taste of what that looks like. (This code asks RPersonify much? to let me see the 45th through 55th terms in the matrix (the terms are arranged alphabetically) for all texts.○You access the matrix by requesting row and column: matrix[row, column]; so the empty row field requests all rows (that is, all texts), and columns (which represent words) 45 through 55 (a range chosen entirely at random).

>Inspect(dtm[,45:55])
A document-term matrix (6 documents, 11 terms)

Non-/sparse entries: 16/50
Sparsity           : 76%
Maximal term length: 13 
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

             Terms
Docs               account     accounts   accretions accumulating accumulation
  bernard.txt 0.0002033962 0.0002033962 0.0001247118 0.0002033962 0.0002033962
  jinny.txt   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
  louis.txt   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
  neville.txt 0.0000000000 0.0000000000 0.0004414937 0.0000000000 0.0000000000
  rhoda.txt   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
  susan.txt   0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
             Terms
Docs          accumulations     accuracy   accurately      achieve         acid
  bernard.txt  0.0002033962 0.0002033962 0.0000000000 0.0001247118 0.0000000000
  jinny.txt    0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
  louis.txt    0.0000000000 0.0000000000 0.0000000000 0.0004378349 0.0004378349
  neville.txt  0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0004414937
  rhoda.txt    0.0000000000 0.0000000000 0.0007776662 0.0000000000 0.0000000000
  susan.txt    0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
             Terms
Docs           acknowledge
  bernard.txt 0.0000786844
  jinny.txt   0.0000000000
  louis.txt   0.0002762431
  neville.txt 0.0002785515
  rhoda.txt   0.0000000000
  susan.txt   0.0000000000

So, this shows us, for instance, that Bernard, Louis, and Neville, all use the word acknowledge (Jinny, Rhoda, and Susan don't); and Louis and Neville use it more than Bernard (but at the exact rate as each other).

At this point, we've got the data. All that's needed is a little R data-finesse to get it back out in the order we want it. I'm quite new to R, so I may be missing the better/more obvious way to do this, but this way seems to work. I load the data into a matrix, and then extract it into lists (I think I'm getting my R data types right), ordered by the word's score. We can then output as many (score, term) pairs from the re-ordered lists that we want (say, the top 24 terms).

m           <- as.matrix(dtm)

bernard     <- sort(m[1,], decreasing=TRUE)
jinny       <- sort(m[2,], decreasing=TRUE)
louis       <- sort(m[3,], decreasing=TRUE)
neville     <- sort(m[4,], decreasing=TRUE)
rhoda       <- sort(m[5,], decreasing=TRUE)
susan       <- sort(m[6,], decreasing=TRUE)


>louis[1:24]
    western      accent     grained        thou        wilt       beast 
0.006426702 0.005691854 0.004284468 0.004284468 0.004284468 0.003570390 
   boasting        nile     average      clerks         oak      stamps 
0.003502680 0.003502680 0.002856312 0.002856312 0.002762431 0.002762431 
 australian        boys    pitchers       steel      beaten     bobbing 
0.002627010 0.002209945 0.002189175 0.002189175 0.002142234 0.002142234 
    custard  eatingshop     england       eyres  fourthirty         ham 
0.002142234 0.002142234 0.002142234 0.002142234 0.002142234 0.002142234 

> bernard[1:24]
  thats     hampton        lady   curiosity      letter        ones 
0.002237358 0.001870677 0.001870677 0.001830566 0.001830566 0.001745965 
    elderly      heaven     married    observed       byron     phrases 
0.001627170 0.001627170 0.001627170 0.001627170 0.001621254 0.001610960 
     dinner      willow      phrase         fin      simple    describe 
0.001496542 0.001496542 0.001495004 0.001423774 0.001423774 0.001371830 
       self       stick       sense      nature    thinking      canopy 
0.001371830 0.001371830 0.001288768 0.001247118 0.001247118 0.001220377 

> neville[1:24]
      story        ones      doomed immitigable      papers       cheep 
0.003342618 0.003090456 0.002880181 0.002880181 0.002880181 0.002207469 
 perfection       camel      detect   hosepipes      hubbub       loads 
0.002207469 0.002160136 0.002160136 0.002160136 0.002160136 0.002160136 
     mallet      marvel   squirting        boys       byron     founder 
0.002160136 0.002160136 0.002160136 0.001949861 0.001765975 0.001765975 
      scene shakespeare       stair      abject   admirable        ajax 
0.001765975 0.001765975 0.001671309 0.001440091 0.001440091 0.001440091 

> jinny[1:24]
     tunnel    prepared   billowing        game      native       peers 
0.003833041 0.003194201 0.003125710 0.003125710 0.003125710 0.003125710 
    quicker  melancholy      bodies        band       bodys     cabinet 
0.003125710 0.002555361 0.002121992 0.002083807 0.002083807 0.002083807 
     coach        crag      dazzle      deftly    equipped    eyebrows 
0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 
     felled     glasses        jump     lockets    matthews    murmured 
0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 

> rhoda[1:24]
    oblong        dips       tiger      fuller      themoh     swallow 
0.005443664 0.003888331 0.003337767 0.003110665 0.003110665 0.002860943 
    fallen   suspended      cliffs     garland  manybacked     minnows 
0.002707581 0.002384119 0.002332999 0.002332999 0.002332999 0.002332999 
       pond   structure      terror       bunch        foam   moonlight 
0.002332999 0.002332999 0.002105897 0.001907295 0.001907295 0.001907295 
      party      puddle       dream       pools     violets     amorous 
0.001907295 0.001907295 0.001805054 0.001805054 0.001805054 0.001555332 

> susan[1:24]
    kitchen      setter     washing  windowpane        bury        cart 
0.006213103 0.004053254 0.004053254 0.004053254 0.003136025 0.003106551 
       gate      horses       apron     seasons    squirrel        beds 
0.003106551 0.003106551 0.003039940 0.003039940 0.003039940 0.002485241 
     butter       clean         wet      winter        baby        boil 
0.002485241 0.002485241 0.002485241 0.002063764 0.002026627 0.002026627 
   cabbages    carbolic       clara      cradle        eggs      ernest 
0.002026627 0.002026627 0.002026627 0.002026627 0.002026627 0.002026627 

My data doesn't quite match Ramsay & Steger's (qtd. in Ramsay 13); look at the Louis data to see what I mean (I've reordered the terms alphabetically so that you can see the similarities and differences more easily):

Louis

Ramsay
& Steger
Me
accent accent
attempt
australian australian
average average
beast beast
beaten beaten
boasting
bobbing bobbing
boys
clerks clerks
custard custard
discord
disorder
eating-shop eatingshop
england england
eyres eyres
four-thirty fourthirty
grained grained
ham ham
mr
nile nile
oak
pitchers pitchers
stamps
steel steel
thou thou
western western
wilt wilt

The terms fourthirty and eatingshop are victims here of the way R removed punctuation. R can also explain one other of the differences: Ramsay's list has the word mr, which my list lacks. mr is on the list of stopwords I removed from the text. But the others? I don't have any explanation for those. Ramsay's list has these words, which my list lacks (in addition to mr): attempt, discord, and disorder. And my list has oak, stamp, boys, and boasting, which his lacks.

Well, so, okay; but pretty good, right? Well, maybe not. It only gets worse for the other characters. Here is a summary of the discrepancies for the other characters:

Bernard (4 Shared)

Here my list and Ramsay & Steger's are very different.

The lists share only four terms: letter, curiosity, simple, and canopy.

Ramsay & Steger's then has: arrive, bandaged, bowled, brushed, buzzing, complex, concrete, deeply, detachment, final, getting, hoot, hums, important, low, moffat, rabbit, thinks, tick, tooth

important would be removed by my stoplist... the rest though should otherwise be in my list.

But mine has: thats, hampton, lady, ones, elderly, heaven, married, observed, byron, phrases, dinner, willow, phrase, fin, describe, self, stick, sense, nature, thinking.

Let's look at some of the words and try to sort this out; hoot seems a pretty unique word. Going back through the text, I find seven instances of hoot or hoots. They breakdown this way by character:

Well, hoot seems unique to Bernard. Okay, let me jump back into R.

>inspect(dtm[,c('hoot')])
A document-term matrix (6 documents, 1 terms)

Non-/sparse entries: 1/5
Sparsity           : 83%
Maximal term length: 4 
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

             Terms
Docs                  hoot
  bernard.txt 0.0008135849
  jinny.txt   0.0000000000
  louis.txt   0.0000000000
  neville.txt 0.0000000000
  rhoda.txt   0.0000000000
  susan.txt   0.0000000000

Just so that we aren't confused, lets grab the raw counts (rather than the tfidf scores).

> raw <- DocumentTermMatrix(characters)
> inspect(raw[,c('hoot')])
A document-term matrix (6 documents, 1 terms)

Non-/sparse entries: 1/5
Sparsity           : 83%
Maximal term length: 4 
Weighting          : term frequency (tf)

             Terms
Docs          hoot
  bernard.txt    4
  jinny.txt      0
  louis.txt      0
  neville.txt    0
  rhoda.txt      0
  susan.txt      0

Well, that's no help then; hoot is unique to Bernard. At this point I begin to suspect something unpleasant. Maybe in my manual data munging, I bollocks'd something. Obviously, It seems like I got the occurrences of hoots in there, attributed to the right person (though maybe I deleted some other hoots?); but if I deleted something, or double pasted something, that could change the complexion of corpus as a whole, and so dilute the score (or inflate the score of some of these other terms showing up in my list).

So, at this point I went back and reprocessed the file again to insure I didn't break anything. I used this bit of elisp (courtesy of this) to remove (I included it in a macro for a first pass) hard newlines within a paragraph:

(defun remove-line-breaks ()
  "Remove line endings in a paragraph."
  (interactive)
  (let ((fill-column (point-max)))
    (fill-paragraph nil)))

And I ran it again. My scores shifted ever so slightly, but my top terms for Bernard remained the same.

Back in R, let's compare my lowest rank term with hoot again:

>inspect(dtm[,c('canopy','hoot')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 2/10
Sparsity           : 83%
Maximal term length: 6 
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

             Terms
Docs               canopy         hoot
  bernard.txt 0.001219801 0.0008132009
  jinny.txt   0.000000000 0.0000000000
  louis.txt   0.000000000 0.0000000000
  neville.txt 0.000000000 0.0000000000
  rhoda.txt   0.000000000 0.0000000000
  susan.txt   0.000000000 0.0000000000
> inspect(raw[,c('canopy','hoot')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 2/10
Sparsity           : 83%
Maximal term length: 6 
Weighting          : term frequency (tf)

             Terms
Docs          canopy hoot
  bernard.txt      6    4
  jinny.txt        0    0
  louis.txt        0    0
  neville.txt      0    0
  rhoda.txt        0    0
  susan.txt        0    0

That is to say, canopy, based on my raw scores, does look more distinctive than hoot. What about moffat (from Mrs Moffat in the text○So, if mr showed up in their analysis, why not mrs here? Because other characters talk about other Mrses—Mrs Crane, Mrs Constable.) which ranks high on Ramsay & Steger's list, but not at all on mine.

inspect(dtm[,c('moffat','canopy')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 2/10
Sparsity           : 83%
Maximal term length: 6 
Weighting          : term frequency - inverse document frequency (normalized)     (tf-idf)

             Terms
Docs               moffat      canopy
  bernard.txt 0.001219801 0.001219801
  jinny.txt   0.000000000 0.000000000
  louis.txt   0.000000000 0.000000000
  neville.txt 0.000000000 0.000000000
  rhoda.txt   0.000000000 0.000000000
  susan.txt   0.000000000 0.000000000
> inspect(raw[,c('moffat','canopy')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 2/10
Sparsity           : 83%
Maximal term length: 6 
Weighting          : term frequency (tf)

             Terms
Docs          moffat canopy
  bernard.txt      6      6
  jinny.txt        0      0
  louis.txt        0      0
  neville.txt      0      0
  rhoda.txt        0      0
  susan.txt        0      0

So moffat's score is just the same as canopy (but there are a lot of terms with that score, and terms with the same score are then ranked alphabetically, so it gets pushed off our top 24 listRamsay & Steger's scores are likewise ranked alphabetically when they have equal scores; have a look at those lists on page 13, and you'll see islands of alphabetical ordering.).

So, let me jump back to my initial, raw file; I check there, and Moffat indeed occurs 6 times.

So what on earth is going on here? At this point, I don't know. Here are, I think, the possibilities. The fact that the greatest discrepancy comes from the character with the most monologue data is perhaps meaningful, but how it's meaningful is not obvious. So:

After tinkering for a bit, I suspected that this might be so. But looking at the raw counts for my data makes me doubt that. One thing you might suspect, if carving up the text into characters' monologues were the problem, would be that some key term might be misattributed; but, for instance, my raw counts of catullus seem consistent with Ramsay & Steger's results:

> inspect(dtm[,c('story','catullus')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 5/7
Sparsity           : 58%
Maximal term length: 8 
Weighting          : term frequency - inverse document frequency (normalized)     (tf-idf)

             Terms
Docs                 story    catullus
  bernard.txt 0.0011797090 0.000124653
  jinny.txt   0.0000000000 0.000000000
  louis.txt   0.0002763958 0.000000000
  neville.txt 0.0031438302 0.002076189
  rhoda.txt   0.0000000000 0.000000000
  susan.txt   0.0000000000 0.000000000
> inspect(raw[,c('story','catullus')])
A document-term matrix (6 documents, 2 terms)

Non-/sparse entries: 5/7
Sparsity           : 58%
Maximal term length: 8 
Weighting          : term frequency (tf)

             Terms
Docs          story catullus
  bernard.txt    15        1
  jinny.txt       0        0
  louis.txt       1        0
  neville.txt    12        5
  rhoda.txt       0        0
  susan.txt       0        0

That is, Ramsay & Steger think Catullus is distinctive for Neville. And, indeed, it appears to be so. The difference between my results and theirs the tfidf score—that is, in how distinctive it is. If their corpus were differently constructed than mine in some way, it might affect how distinctive it is.

So, there may be a data carving problem; who miscarved though is not obvious from this data, I don't think. It is also possible that there may be some algorithmic difference; I am using the tf-idf algorithm built into R as a sort of black box. My scores are very different from the one's Ramsay & Steger share on pg. 12. So we're definitely doing something different. And that might account for these differences. What clear I need to do is return to algorithm to better understand what's going on here.

For now, though, I don't know. I'll here just summarize the data for the rest of the characters.These are using the reprocessed data, so they may be a little different from above; there were no differences in top terms for Louis or Bernard, and these scores were extracted using exactly the same code as above.

Neville (12 Shared)

Jinny (20 Shared)

Rhoda (13 Shared)

Susan (16 Shared)

Oh, and here is the breakdown of the amount of text I have for each character:

wc *.txt
      46   32608  182921 bernard.txt
      33    6331   34467 jinny.txt
      46    8905   49588 louis.txt
      39   10011   55543 neville.txt
      40    8147   44839 rhoda.txt
      34    6131   33023 susan.txt
     238   72133  400381 total

Bernard has the most,○Of course, because the final chapter is offered entirely in his voice. followed by Neville, Louis, Rhoda, Jinny, and Susan.

Works Cited

Ramsay, Stephen. Reading Machines. Urbana: U of Illinois P, 2011. Print.

Public Domain Editions

This is an extended version of the (less than) two minute "dork short" or "lightning talk" I gave at THATCamp Virginia a while ago (this post has been sitting in the hopper for a while). I offer an observation, an anecdote, and a suggestion.

tl;dr: I'm trying to put together an edition of Claude McKay's Harlem Shadows. Would you like to help?

An Observation:

An enormous wealth of public domain material is available on the web, from sources like Project Gutenberg or The Oxford Text Archive or The Internet Archive or Google Books or smaller projects like The Modernist Journals Project.

Yet, in my experience, these texts seem underused. (Am I wrong?)

An Anecdote:

When I was a teaching assistant for UVA's twentieth-century literature survey a few years ago, the professors taught Claude McKay's Harlem Shadows. Published in 1922, Harlem Shadows is just inside the public domain.

The text they used was a cheap (though still in the neighborhood of $15; here it is at Amazon) paperback facsimile of the 1922 edition. When I opened this slight paperback, it looked eerily familiar.

Compare:

Google Books Title Page

Kessinger Edition Title Page

The top image is from the Google Books edition; the bottom is a scan I just made of the Kessinger edition. Kessinger's "edition" of Harlem Shadows is printed from page images available at Google Books (scanned, in turn, from a copy at Indiana University library). They've cleaned up the title page a bit, but look at the distinct pencil marks. That's Kessinger's business: get new ISBNs for Google Books scans and then sell them. (When folks first noticed Kessinger doing this a while ago it caused some consternation.)

(Worth noting: there is a another copy of Harlem Shadows (scanned from a copy held at Princeton) in GBooks, which misidentifies Max Eastman in the author metadata; in addition to the two Google Books copies, archive.org has two copies; one from the Library of Congress and one from the University of Toronto, all the same edition. Thoughts on easily breaking up those four PDFs and digitally collating them?)

It seems unfortunate that right now a professor who wants to teach Harlem Shadows, ends up assigning Kessinger's rather ugly print-out of a Google Books PDF.

A Suggestion:

Can we do something to make public domain texts more useful? Is there a place for (some) scholars to take the lead here? Rather than paying Kessinger to print out Google Books page-scans, could we not use the (in this case, multiple sets of) page-scans available from a variety of sources to put together a lightly marked up version of the text? Couldn't we draw on existing bibliographies to make clear what the book object represented by those scans actually is. And then, from our single encoding, could we not export to multiple formats: PDF (by way of LaTeX, for folks who want to print this thing out); HTML; and ePub (etc) for eReaders?

Such an idea is not novel; it is merely an expression of the dream of a markup language like TEI. Not so long ago, a proposal for a "A Git Powered Project Gutenberg" lead to a discussion on Hacker News which in turn lead to a hastily arranged group (which just as quickly disarranged itself)—all focused around the idea of making public domain texts better. There is interest in improving the accessibility and usability of public domain texts and it isn't confined to academic literature departments.

Scholars could play a key role here by helping to establish a good text and providing annotations and glosses or other contextual material. In my wilder moments I imagine scholars providing a base text which than then becomes the staple, raw ingredient in a variety of remix editions, produced for audiences varying from high school to the college classroom, and beyond. These texts in turn could be cut and remixed to produce a roll-your-own anthology.

An Acknowledgment and a Goal:

There are some excellent reasons why I shouldn't be doing this. First, in the specific case of Harlem Shadows, I am not a specialist in American, African American, or Caribbean literature in general, nor in Claude McKay's work in particular. Nor am I an expert in text markup. Nor am I sufficiently well versed in the dark bibliographical arts to really be handling the complexities of putting together a proper critical edition.

With those reservations stated, I'm trying to carve some time out to work on this nonetheless. One's reach should exceed one's grasp, else what's a public domain for? But boy would I love some help.

I've converted the plaintext, OCR'd version of Harlem Shadows available through archive.org to a lightly marked up TEI version of that text. This markup itself is worthy of scrutiny; but I wanted to have something to start with on the way to producing a proofread, bibliographically sound, TEI-version of the text; to that I'd like to add annotations and textual notes, as well as supplementary material—early reviews, maybe McKay's prose from this period, as relevant. Think Norton Critical Edition (minus the criticism which is likely too thorny a permissions matter; though I'd love to proved wrong on this front).

To begin:

(A minor technical note: For a while I was imagining that it would be possible to use stand-off markup to keep text and annotation completely separate. This would be great for many reasons; in theory, one could have different sets of notes for different audiences (the high school versus the college class room; a reading versus a scholarly edition); from the little reading I've done, that seems not easily feasible at the moment. For software developers, however, the problem of how to combine constantly evolving sets of dependent texts is simply a fact of life; version control systems, like git, provide some help in managing this problem.)

As a preliminary schedule: begin finalizing markup of the edition by the end of the summer. Continue collecting and adding supplementary material and annotations in the Fall. Then start working on processing the text out to desired formats (the TEI Stylesheets provide a great place to start); so that this time next summer, an edition of sorts (available in multiple formats) is done.

For now I'd be interested in other folks sharing their thoughts, criticism, or enthusiasm. Or, better yet, take some of this material and fix it or fork it.

Graeber and Auden: Traps, Lies, and Love

During break, I've been enjoying reading David Graeber's Debt: The First Five Thousand Years. Graeber's description of the opposing logics of the market and the state recalled to my mind the penultimate stanza of Auden's "September 1, 1939" that I couldn't resist quickly noting it here.

Here is Graeber:

This is the great trap of the twentieth century: on the one side is the logic of the market, where we like to imagine we all start out as individuals who don't owe each other anything. On the other is the logic of the state, where we all begin with a debt we can never truly pay. We are constantly told that they are opposites, and that tbetween them they contain the only real human possibilities. But it's a false dichotomy. States created markets. Markets require states. Neither could continue without the other, at least, in anything like the forms we would recognize today.

These two traps are what Auden will call the romantic lie and the lie of authority; and what Graeber describes as the interdependence of the market and the state is what Auden will call (and famously regret calling) love:

All I have is a voice
To undo the folded lie,
The romantic lie in the brain
Of the sensual man-in-the-street
And the lie of Authority
Whose buildings grope the sky:
There is no such thing as the State
And no one exists alone;
Hunger allows no choice
To the citizen or the police;
We must love one another or die.

The Changing Index of Censorship

(Jan. 17 Update: In a truly remarkable show of solidarity, as you dear reader have likely noticed by now, major sites (including wikipedia) across the internet are mainting their protest of SOPA on January 18th. I've add this bit of javascript to show solidarity with them. Despite the shelving of the house bill, these sites all recognize the continued threat of SOPA, PIPA, or any similar legislation.)

(Jan. 16 Update: With the news that, at least for now, SOPA is shelved, I've removed the javascript mentioned below.)

Depending on when you visit this page, you may notice that a whole bunch of it is blacked out; that's a result of this bit of javascript, protesting SOPA. If you haven't heard about SOPA, here is a fine place to start.

As someone with an interest in the history of obscenity and censorship I have been impressed at how quickly groups like the EFF have described SOPA as a matter of censorship and free speech. Unlike the 1996 Communications Decency Act (I recall the blue ribbon GIFs it inspired well), the object of SOPA is not online obscenity. As these groups recognize, however, matters of "free speech" are increasingly questions of intellectual property and the technologies of copyright enforcement.

This development is entirely consistent with the narrative offered by one of my favorite early-twentieth century anti-censorship tracts, To the Pure: A Study of Obscenity and the Censor, co-authored by William Seagle and Morris Ernst (the lawyer who, a few years later, defended Ulysses in front of Judge John Woolsey). To The Pure outlines a political history of censorship which proceeds through three stages; in a section titled "The Changing Index of Censorship" Ernst and Seagle explain:

With the invention of printing in the middle of the fifteenth century, the first condition for the censorship of literature began to be fulfilled: literature was on its way to popular distribution. The three forms of censorship which we know today began to develop: (1) the religious (2) the political (3) the sexual, which is the modern culmination. The course of evolution may be stated to be from heresy to treason to obscenity. The purpose of authority remains always the same, but the index of censorship changes. Each age produces those formulae of suppression which coincide with its dominant interest. (140)

Each age produces those formulae of suppression which coincide with its dominant interest. The three chief categories of prohibited speech---blasphemy, sedition, obscenity---reflect a historical trajectory. Despite its obvious oversimplifications, Ernst and Seagle's account of this progression surprises me with its insight whenever I return to it.

If one were to continue the history of censorship Ernst began nearly a century ago, it would pass through questions of intellectual property. Rather than obscenity, it is piracy which is the object of contemporary censorship just as obscenity and blasphemy were the objects of a previous age's censorship regime.

We certainly shouldn't ignore other, more traditional, types of censorship which continue to exist (particularly outside of the United States and western Europe); however, if blasphemy was the object of censorship in (what Ernst calls) the "Age of Faith," sedition in the "Age of Divine Right," and obscenity in the "Age of Democracy," piracy is the crime that the censorship regimes of our own information age seek to control. The Index Librorum Prohibitorum and Customs lists of previous eras bear a striking resemblance to the index of prohibited foriegn domains that SOPA would create.

Works Cited

For older posts see the archive.