Chris Forster

My title, recalling E. M. Forster’s (no relation, sadly) Two Cheers for Democracy, might be too generous. I can’t imagine mustering more than two cheers for anything. Two is probably the utter limit of my cheering.

Is “Banned Books Week” anarchronistic? That’s the claim of this article at Slate. John Overholt, in a single tweet, manages to voice what I think are all the most pressing complaints about such a perspective:

And yet, as someone who spends quite a bit of time trying to think about what early-twentieth century censorship means, there is an important grain of truth in the Slate piece that is worth preserving, even if declarations of “Mission Accomplished” feel premature.

The idea of book banning conjures images of state censorship and book burning—images one can find in abundance, for instance, in Kevin Birmingham’s wonderful account Ulysses, The Most Dangerous Book; images of court rooms where lawyers make grand appeals to literary value and freedom of expression.I pass over in silence distinctions between books and literature The scene that Michelle Anne Schingler describes, at Book Riot, however is a very different one. Schingler wastes no time in affirming “No,” book banning is not simply over. Her examples all concern libraries; and indeed as Graham contends in her piece at Slate, the most recent cases that the Banned Books website cites all concern attempts to limit or remove books from library collections or school settings.

When Graham declares “we won,” that “book banning” is over, this is indeed (as Overholt suggests) an oversimplification. It imagines a single struggle, which reaches some sort of crisis, and ends. Graham is offering a narrative very close to that recounted by Charles Rembar in his memoir The End of Obscenity. Rembar was the defense attorney during many of the key 1960s obscenity trials in the United States, and his memoir wonderfully charts the erosion of state censorship in the period. Suppression of books on grounds of obscenity, Rembar suggests (I think, rightly), ends after the trials of Lady Chatterley’s Lover, Tropic of Cancer, and Fanny Hill in the United States. Starting in Roth v. United States, and culminating ultimately in the so-called “Miller Test”, American jurisprudence evolves a set of standards that have the effect of ending the censorship of books on the grounds of obscenity.In English jurisprudence, the 1959 reform of the Obscene Publications Act (which enabled publication of Lawrence’s Lady Chatterley by Penguin) plays the same role as the court cases discussed by Rembar. After those trials, it has proved essentially impossible for a book to be banned on grounds of obscenity; contract, libel, and copyright all continue to shape cultural production in important ways (the latter especially so), but obscenity and its particular brand of state-controlled book burning is indeed over. The Miller standard may justifiably be celebrated as a sort of liberal triumph.

Miller Logo

And boy, do we love to tell this story. Birmingham offers a version in his account of Ulysses; we get a sort of version in movies about Allen Ginsberg’s “Howl”; or in TV movies about the Chatterley trial. Folks love this tale of heroic lawyers fighting on behalf of great works of literature, against philistine puritans—figures like Anthony Comstock or William Joynson-Hicks (more commonly called simply, “Jix”). We tell very similar stories about Elvis and his hips, or Lenny Bruce and his comedy—tales where transgression and freedom contend with (usually comicly absurd) conservatism. (We even tell a version of this story about dancing in small towns.) It’s usually a narrative of triumph, told by liberal proponents who indeed end by declaring “We won.” And, as history, it is usually an oversimplication.For one thing (and this is a hobby horse of mine), it tends to remove books from a broader media history which shapes what it means to “ban” a “book.” I have a different story of my own, about the place of literature in the changing media ecology of the long twentieth century… but that’s another tale for another time.


“Banned Books Week” conflates two narratives, perhaps deliberately. It inserts present instances of political struggle which involve books, particularly (like those described by Schingler) around libraries, into a longer history of book banning. It is, in some ways, a savvy rhetorical move to align parents who want to limit access to particular titles with Anthony Comstock and similar figures (after all, who wants to be this guy). But this conflation also has the, I think unfortunate, effect of casting contemporary debates about education and the meaning of “the public” as matters of “banning books.” I think it makes more sense to understand attempts to limit access to Toni Morrison’s Beloved, not as a debate about book censorship continuous with the suppression of Ulysses or Lady Chatterley’s Lover, but as part of the same political struggle over how to teach the causes of the American Civil War, or even whether to mention evolution. These are debates about books; but more fundamentally they are debates about education and, more importantly, debates about the public. They ask not, “Should this book be legally available?”, but “Should my children learn this?” or “Should my tax dollars pay for this?” Defending against the active defunding of public goods by appealing to the “freedom to read,” seems to me, to be a tactic of ambivalent value.

When “Banned Books Week” began in 1982, the heroic age of the struggle against state censorship of books in the United States was already over. In 1982, rather than the State of New York seeking to prevent folks from reading Ulysses, we find the Moral Majority complaining about works like Our Bodies, Our Selves. This concern with women’s sexuality and health is uncannily recalled when earlier this month a Knoxville parent complained about the explict references to women’s bodies in the The Immortal Life of Henrietta Lacks. Is this debate about women’s sexual health and knowledge, either in the early 1980s or now, best understood as a debate about books? Or does it have more in common with a history that, as this moment, materializes as an effort to defund Planned Parenthood?

Schingler writers, “Reading only about people our parents and pastors are comfortable with isn’t an education, it’s an echo-chamber.” I agree. Reading is a wonderful and potentially transformative experience. It should be celebrated and defended zealously. But if we find people seeking to limit access to books, we may wonder whether their target is books per se, or something else: public education or women’s health, both of which require a well-funded state. Schingler writes, “Libraries are a marketplace of ideas, and if they’re going to operate in a truly democratic fashion, all ideas should be represented.” May be. But the arguments of would-be book-banners are, right now, often couched exactly in market terms—not that this or that book should not be published or legally allowed to circulate, but that my tax dollars shouldn’t have to pay for it. We love the version of this conflict which is a struggle between freedom and censorship; but the conflict today is precisely one which takes place through appeals to market values—not between freedom and suppression, but what to fund according to what criteria. The real argument today seems less about “freedom,” than about our willingness to fund and maintain a robust sense of “public goods.”

As a matter of rhetoric and political tactics, it perhaps makes sense to throw the weight of a long historical struggle against state censorship behind our own moment of squabbles in local school boards or funding lines in state budgets. We should be careful, though, that such rhetoric doesn’t lead us to mistake a fight about public education or women’s health or the rights of queer people for the right to read. Indeed, if we could add a little nuance and history to our sense of the long struggle to publish controversial books, we might even realize that the history of books and their banning is already replete with lessons for these distinct, but not unrelated, struggles (see, for instance, the case of The Well of Loneliness and its banning in England).

So, two cheers for Banned Books Week and for all efforts to protect the freedom to read. The fullest possible access to the textual record is indeed a public good worthy of our time, attention, and dedication. It is not, though, the only good; it may not, at this moment, even be the most pressing one.

[This post, featuring too many graphs, was created with knitr. You can see the source that generated those graphs, and the rest of the post, here. Update: Many thanks to Lincoln Mullen who is not only the maintainer, and one of the authors of the gender package, but noted that I was (inaccurately) using publication dates, rather than author birthdates, when inferring the gender of names; he suggested a smart way of approximating author birth dates (used below), and also drew my attention to the napp dataset, accessible by the gender package. In light of his comments I made a few changes, re-ran the data, and have updated this post. The plots are new; I’ve added a little new text as well which, like this, is in dark blue. 9/11/15]

“there is no such thing as distinguishing men from women.”
The Critical Review, 1777

I’ve been tinkering with the HathiTrust dataset that Ted Underwood and HathiTrust released last month.Some Links: [The Dataset]; [Ted Underwood’s Discussion of It]; [My Previous Exploration of It, Mostly Using R] The thorniest questions I’ve encountered concern how to handle/understand volumes and titles which occur in the dataset more than once (and some related issues—multivolume works, etc). I’ll try to write a quick post about those issues in the future. For now, let’s look at other ways we might explore this dataset.

Examining the Gender of Authorship in HathiTrust Summary Metadata

I find the metadata fascinating in a way that the actual data (the word frequencies for each volume) is not. Let’s consider, for example, the relationship between authorship in the dataset and gender. Authorial gender is not one of the included metadata fields, but we might try to examine it by using the gender package for R. This package uses a variety of historical sources (Social Security data, US Census data, as well as some other sources) to intelligently infer gender based on first names (since the package relies on largely US data, it might not be as successful in predicting gender of names in other Anglophone countries, including England—an issue I note here, parenthetically, and then ignore). The package can also use napp data, covering Canada, the United Kingdom, Germany, Iceland, Norway, and Sweden, from the years 1758 to 1910.This web app nicely illustrates what the package does. Here, for instance, is how you would load the package and infer the likely gender of the name a George born in 1819:


Source: local data frame [1 x 6]

    name proportion_male proportion_female gender year_min year_max
1 george               1                 0   male     1819     1819

This sort of inference based on first names is obviously imperfect; there are cases where, for a variety of reasons, the prediction will be wrong. The suggestion that in 1819 the name George belongs to a man may be very wrong indeed if that particular George is the author of Middlemarch. For certain purposes, however, that misattribution may be exactly what we’re interested in. The simplicity of the approach can be a strength. The package, in most cases, will make the same inference—even the same incorrect inference—about the gender of a name as a reader would. This makes it ideal if you’re interested in how readers understood the authorship of the books they were reading, or how (perceived) authorial gender shaped the market for literature.Before we start taking this too seriously, Francis Beaumont’s name is detected as female (understandable, but wrong); as is Oliver Goldsmith’s (?!?!) in some years (e.g. 1792).

Such a summary of the dataset is very different than, say, using the gender information inferred about a volume based on its author’s first name to train a classifier on the volume-level word counts. Such a classifier could be used on texts from outside the dataset, or on texts within the dataset where an author’s gender is unknown. One might try to use it to test Virginia Woolf’s “guess that Anon, who wrote so many poems without signing them, was often a woman” (49). That sort of project, however, which would attempt to link “gender” (and in this case what exactly that word means becomes rather pressing) not simply to a name in a metadata field, but to a vocabulary (or some other representation of language use), would begin to encounter the thornier theoretical/methodological questions that I am so happy to skirt past here.

Hewing to this more modest, and (I think) less theoretically fraught, goal of understanding the makeup of the dataset, I used the gender package to infer the gender of the author of each volume in the three HathiTrust datasets. To maximize recognition I used a somewhat heterdox method. Since the package expects birthdates, I subtracted 30 and 50 years from the publication date and passed this range to the package (this was Lincoln’s, I think very reasonable, suggestion). I queried against first the napp data; if this returned no result I tried the ipums census data. In both cases, I massaged the dates so that if they were out of range, I checked against the earliest available date (a historically imprecise result strikes me as better than no result at all). So to every row in the metadata summary files I added a column for gender, which represented the result of applying this idiosyncratic use of the gender function to the author’s first name.For my purposes a “first name” is the first word after the comma in the author field; there would be better ways to do this. (This process was actually rather time consuming—hint, mclapply is your friend, as are virtualized servers you can let hum away for hours. Lincoln notes that the package allows you to pass a vector of names to the package; this makes the process more efficient for large datasets, paritcularly when names are repeated. I nevertheless did it the (seriously) less efficient way, in part because I had written the code for an earlier version of the gender package, and in part because my odd use of two methods to try to find data complicates matters.)

I then tallied up the number of works by men and women for each genre. (I did the tallying with Python; you can find the results of those tallies as CSVs here.) In addition to male and female, there are two other categories here: missing means a name was not provided (or, more precisely, was not detected by my script) in the HathiTrust metadata; undetected means that the gender package had not value for the “name” it was given (or, more precisely, whatever string it received from how I parsed the name). That is, missing means no name and undetected means that gender had no association for the name. (There are also columns for each of these values normalized by the number of volumes in the dataset for that year). Without further ado, three area graphs representing the gender breakdown of authorship in each of the HathiTrust datasets (fiction, poetry, drama).

If you right click and open each graph in another tab, they should be a bit bigger.

center center center

The data before 1800 is sparse and so these graphs look a little volatile. The prevalence of missing and undetected in the fiction data before 1800, however, may reflect the lack of attribution common in the late eighteenth century. “Over 80 per cent of all novel titles published in the 1770s and 1780s were published anonymously,” James Raven claims in the introduction to the first volume of the two volume The English Novel 1770–1829: A Bibliographical Survey of Prose Fiction Published in the British Isles (41). (I’ll abbreviate that BSPF for the rest of the post).

In a tweet, Heather Froelich asks, “What’s in those slices of undetected and missing texts.” Looking at the amended metadata file, it looks that there are ~11,000 records with either missing or undetected gender (that’s ~10% of the dataset). The most frequently occuring titles in the missing data are:

The New British novelist;
The British novelists,
Stories by American authors,
The Harvard classics shelf of fiction,
The German classics of the nineteenth and twentieth centuries,
The International library of famous literature,
The lady of the manor,
The book of the thousand nights and one night,
The book of the thousand nights and a night,
The thousand and one nights,
The Odyssey of Homer,
Stories by English authors,
The Bibliophile library of literature, art and rare manuscripts,
The masterpiece library of short stories,
Florence Macarthy,

So why is the author missing from these? Checking the full records, the most frequenltly occuring items in this series are multivolume collections of other works. The New British Novelists lists no author in the dataset’s author metadata; the title appears in the fiction dataset 50 times. Checking the original page images, we see that this is a series, published starting in 1820 which collects different novels by major British novelists. It includes a range of major novels, many themselves multivolume works: Clarissa, Robinson Crusoe, Humphrey Clinker, and so on. The Harvard Classics Shelf of Fiction appears to be a similar case. Is there an existing literature on these sorts of collections and their role in reputation creation/maintenance? In the earlier period, there are titles (like The Infernal Wanderer) which simply lack an author; others (like Turkish Tales) lack an author in the dataset, but currently have one in HathiTrust (perhaps because this record has been updated since the dataset was exported); and quite a few don’t meet my naming convention. Works by Phalaris, [Madame d’] Aulnoy, [Mssr.] Scarron, [Mrs.] Manley, Volatire, Virgil, and many others are “missing” because when I try to splice ‘em up (relying on a comma to separate first and last names), we get nothing. Some of these authors were referred to simply by a last name and title (Mrs. Manley) and this has entered the dataset as simply Manley.

In the undetected data; the most frequently occuring names are:

Bjørnson, Bjørnstjerne
Dostoyevsky, Fyodor
Orczy, Emmuska Orczy
Burgess, Gelett
Cullum, Ridgwell
Hearn, Lafcadio
Watanna, Onoto
MacManus, Seumas
Tagore, Rabindranath
Hemyng, Bracebridge
Gordon-Cumming, Roualeyn
Ritchie, Leitch

A look at the names is enough to guess why gender likely had a probably with them. (There are sufficiently few names here (321 unique individual, undetected names) that I am half tempted to put together a manual reconillation for names and genders). It also provides a clear illustration of the implicit cultural construction of “data.” These “undetected” names are largely non-Anglophone names—and so the attempt to infer one culturally mediated category (gender) gets complicated by the complexities of another one (nationality). Names that are undetected are not randomly distributed through the data but are dispropotionately non-Anglophone.

To more clearly see the trends, let’s look at works published under names that we have identified as female across genres; first raw counts and then as a proportion of all works published per year.

center center

The second graph is the interesting one. Among the genres, female authors are best represented in fiction, and least well-represented in drama. The trend in fiction, however, is odd—while poetry and drama show upward trends (poetry’s is slow and steady across the 19th century; drama’s rather sudden after 1900), fiction has a high point in the early nineteenth century where women represent a larger proportion of fiction writers than anywhere else in this data. At times, early in the data, half of the works of fiction in the dataset are written by a woman (more on this figure below). Yet, over the course of the nineteenth century this proportion diminishes. When the graph ends in 1922, women represent about a quarter of the authors of each of the three genres.

On twitter, I suggested that in the normalized data for fiction by women above, one sees a decline in works by women. This may be consistent with the BSPF data (which, in its admittedly narrower slice, shows a decline from 1815 to 1830). To get some sense whether that’s a fair description, let’s isolate the fiction by women data, and add a rolling mean, with a a window of 5.


At some point, one is reading Rorschach plots; but this plot seems to suggests two periods of downward trends from 1805–1830, and then again from about 1885 to 1900. (That preciptious drop at the end is a function of doing the rolling average running out of data).

Data from The English Novel: A Bibliographical Survey of Prose Fiction, 1770-1830

To get some sense about how reasonable these trendlines look, we might try to compare them to another source. I’ve already quoted the BSPF, which offers a portrait of the authorship of novels between 1770 and 1830. The BSPF has totals based on both what is stated on title pages and in prefaces, as well as more comprehensive totals based on what the editors were able to infer about the authorship of works from other sources.For instance, if a work states that it is “By the author of Waverley, one can make additional inferences about the author’s gender. There turns out to be a significant discrepancy between what a title page, or preface, states, and what one may be able to infer about the gender of an author with just a little more knowledge. The majority of novels in this period were published without a clear statement of authorship. But if we look at the more comprehensive portrait of authorship that the BSPF offers, the story is a little different.


The graph above summarizes the trends in the inferred data. It has three distinct moments—a predominance of “anonymous” or unattributed works until around 1800; the predominance of women writers during the first decades of the nineteenth century, and concluding with what Peter Garside calls “the male invasion of mainstream fiction” (2:63). Garside notes, for instance, “the publication of Jane Austen’s novels was achieved not against the grain but during a period of female ascendancy” (2:75). This data suggests that authors of novels were most likely to be, in this order, anonymous, women, and then men.

The three waves visible in the graph above, however, is based on the inferences that the editors of the BSPF made to ascertain the the gender of the authors in their bibliography. The metadata available on title pages—of the sort that’s compiled in the HT metadata—often lacks information that might otherwise be available to most readers.

Occasionally, full author names are found within a novel—as in a signed Preface, or through the inclusion of an engraved portrait or additional title-page—when the main title-page offers no direct authorial description. Augusta Ann Hirst’s Helen; or Domestic Occurences (1807:28), for example, carries only the bare title on its title-page, though the full author’s name appears immediately afterwards in a Dedication to the Countess Fitzwilliam, and the author’s name later featured directly on the title-page in the Minerva reissue of 1808. (2:68)

HathiTrust has a copy of Helen, or, Domestic Occurrences: A Tale (though it is not included in the fiction dataset). And indeed its title page lacks the author’s name, though one can discover it in the dedication.

Title Page, and End of Dedication from *Helen*

Through the magic of librarians, the HathiTrust record, however, includes the correct author and even notes that its “Dedication signed.”

Looking only at what one can infer about the authorial gender of works from the information available on the title page, most works would be “anonymous,” even if (some) contemporary readers may have been able to see through that that anonymity. Note the difference between the trends in authorship when we look only at information available from examining “proper names from title-pages and prefaces only” with the inferred conclusion (all this data is taken from the wonderfully comprehensive BSPF).


The inferred trends for both male and female authorship are significantly higher than their stated counterparts (these terms, inferred and stated are my clumsy language; for anyone interested, The English Novel, 1770–1830 really is an invaluable, if imposingly weighty, resource). There are perhaps two interesting trends here. The decrease in anonymous authorship at the start of the nineteenth century coincides with a rise in female authorship; female authorship is more public than its male counterpart. After 1820 one sees a sharp rise in male authorship—which is itself a rise in anonymous male authorship.

Comparing HathiTrust and BSPF

Using the method described above to infer authorship in the HathiTrust dataset should produce results similar to the raw, stated dates in the BSPF data. There are, though, a few differences to account for first. For one, James Raven’s and Peter Garside’s introductions to the two volumes of the Bibliographical Survey of Prose Fiction offer summary counts of “New Novels” but the HathiTrust data represents books owned by libraries. To be able to compare to the BSPF data with the HT data, we need to eliminate reprints (we only want new novels) and we need to count works, not books (so, multivolume works should be counted as a single work). I’ve tried to do this rather crudely by creating for each work in the HT fiction dataset an “ID” which consists only of a work’s title and it’s author.Using title alone as an ID could, in theory, lead to a problem if two works have the same title—which is actually quite common for multivolume sets, like The Novels of Walter Scott and the The Novels of Charles Dickens, and similar My script loops over the works in the metadata summary, counting a work as “new” only if we haven’t seen its ID before. Because we look only at title and author (and not enumcron), we also only count one volume from a multivolume work (though, as I mention above, this problem is quite a bit thornier than I’m allowing here).

Second complication: geography: the HT dataset is culled from American libraries, whereas the BSPF data is focused on works published in “the British Isles.” Well, that raises an interesting question (digression ahead!): where were fiction volumes in the HathiTrust dataset published?


As this graph makes clear, most of the works in the HathiTrust dataset were published in 5 places (heck, many were published in one place). Those labels along the x-axis are MARC country codes; so the top publication locations are: New York State (nyu), England (enk), Massachusetts (mau), No place/Unknown (xx), Pennsylvania (pau), Illinois (ilu), Scotland (stk), Germany(gw). This summary, however, represents the entire HT fiction dataset—from 1700-1922. Let’s look at just the portion covered by the BSPF, 1770 and 1830:

center For this period the top two locations are England and Scotland. It seems unlikely, therefore, that any differences between the BSPF and the HT datasets could be attributed to the different geographical coverage of the two datasets. But, just to be sure let’s extract only the works from the fiction dataset published in England and Scotland and Ireland between 1770 and 1830, and compare the gender breakdown one last time.

To create this subset of the HT summary metadata, I’ve used some Python that tries to more closely match the parameters of the BSPF data: it covers only works published between 1770 and 1830, published in England, Scotland, or Ireland, and it tries represent only “new works.” The Python that did this is here; the summary of the data is here.


We can get some sense of how the HT data compares to BSPF by plotting them together.


The data for female authorship in the two datasets (or rather, in the BSPF data and my weird manipulation of the HathiTrust data) seems, to my layman’s eye, surprisingly consistent. Of course, recalling the difference (often of between 10 and 20 percentage point) between authorial gender as determined by consulting title pages/prefaces with what the BSPF editors were able to infer, one might suggest (at least for the period 1770–1830) that the summary I offered above significantly under represents female authorship.

The data for male and anonymous authorship is much less consistent; BSPF reports more anonymous texts and my analysis of the HT metadata; while the HT data reports more male writers. I basically don’t understand why this would be so—I would have expected, if anything, the opposite. The anonymous line for the HT data in the above graph combines both missing authors and undetected, treating as anonymous anything that couldn’t be coaxed into another category; if anything, it should overrepresent anonymous writers. Perhaps this reflects something about the underlying data; or perhaps something about the way I carved up first names. For now, I just don’t know. So, here ends our amble through the data.

Works Cited

Woolf, Virginia. A Room of One’s Own.

Raven, James et al. The English Novel 1770-1829: A Bibliographical Survey of Prose Fiction Published in the British Isles. 2 vols. New York: Oxford University Press, 2000. Print.

I’ve been trying to think intelligently about the place of quantitative data in literary studies, especially in light of two excellent posts, one by Andrew Goldstone, the other by Tressie McMillan Cottom, both responding to this review by Ben Merriman.

But before I could even try to say something interesting in response, Ted Underwood announced that he was making available “a dataset for distant-reading literature in English, 1700-1922” (here is a link to the data). This post is a look at that data, mostly using R. I have, essentially, nothing thoughtful to offer in this post; instead, this is an exploration of this dataset (many, many thanks to Ted Underwood and HathiTrust for this fascinating bounty), studded with some anticlimaxes in the form of graphs that do little beyond give a sense of how one could begin to think about this dataset.

With the exception of a bash script (which may, though, be the most repurposable bit of code), everything here is done in R. I don’t like R, and I’m not very good with it,I think R’s datatypes are what make it a challenge; lists in particular seem to materialize out of nowhere and are frustrating to use… but it is great for making pretty graphs and getting an initial handle on a bunch of data. I try to comment on, and explain, the code below (often in comments)—though if you’ve never looked at R, this may seem really weird. I also may have made some horrible mistakes; if so, please let me know.

The New HathiTrust Data Set

Underwood calls this dataset “an easier place to start with English-language literature” within the HathiTrust dataset. I had poked around the HathiTrust data before, and it really is a very complicated undertaking. This dataset that Underwood has provided makes this much much easier.

The data can be downloaded here. In this post I’ll look at the fiction metadata, and take a peak at the fiction word counts for the years 1915–1919. Those files looks something like this:

  • fiction_metadata.csv: 17 megabytes, containing author, title, date, and place for each work of fiction. It also includes subjects, an id for HathiTrust (htid), and other fields.

  • fiction_yearly_summary.csv: 35 megabytes, containing token frequencies. The first 20 lines look like this.

  • In a directory I uncompressed fiction_1915-1919.tar.gz. The result is 8656 files, each representing a single work, and totalling 827 megabytes. (827 megabytes of text is not “big data”—but it is enough to making toying with it on your laptop at times a little tricky.)

Examining the Metadata: Volumes of Fiction Per Year

So, let’s begin, by loading our plotting library (ggplot) and the CSV file with the fiction metadata file fiction_metadata.csv.


# Load the metadata from the CSV vile <- read.csv('fiction_metadata.csv',header=T)

# Let's look at how many items we have for each date.
ggplot( + 
  geom_histogram(aes($date),binwidth=1) +
  ggtitle('Books per Year in Fiction Dataset') +
  xlab('Year') +
  ylab('Number of Books Per Year in Fiction Data')

Bar Plot of Works of Fiction Per Year in HathiTrust Dataset

This gives a sense of just how few books from before 1800 are in this dataset.

   [1] 101948

nrow(subset(,$date < 1800))
   [1] 1129

That is, 101948 volumes total, 1129 of which were published prior to 1800, or about 1%. The number of volumes appearing in the dataset per year tends to increase constantly—with a few exceptions. That dip around 1861-1864 may be a result of particularly American factors influencing the dataset; and perhaps it is war again accounts for some of the dip at this period end—though that dip seems to begin prior to 1914.

Examining the Metadata: Change in Length of Volumes Over Time

The length of each volume is contained in the totalpages field in the metadata file. Let’s plot the length of works of fiction over time (so, plot date by totalpages).

ggplot(,aes($date,$totalpages)) +
  geom_point(pch='.',alpha=0.1,color='blue') +
  ggtitle('Length of Books by Year') +
  xlab('Year') +
  ylab('Length of Book, in Pages')

Not Especially Legible Plot of Length of Works of Fiction Over Time in HathiTrust Dataset

Interesting. It seems that, in the mid-eighteenth century near the dawn of the novel, works of fiction were around 300 pages long. Their length diversified over the course of the novel’s history, as novels grew both longer and shorter as the possibilities for fiction widened, perhaps as a function of increased readership stemming from both the decreasing cost of books and the increasing rate of literacy.

Well, not really. Matthew Lincoln has a very nice post about the dangers of constructing a “just-so” story (often to insist that this graph tells us “nothing new). But there are at least two problems with the interpretation offered above—one broad and one more specific. Broadly, it is worth reiterating the danger of mistaking this data for an unproblematic representation of any particular historical phenomenon (say especially readership of novels). Underwood describes the dataset carefully as representing works held by ”‘American university and public libraries, insofar as they were digitized in the year 2012 (when the project began).’“ And, of course, lots of other things which would be relevant to an investigation of fiction—think of pulp paperbacks and similar forms—will not be in that sample, because they were often not collected by libraries. (Likeiwse, as Underwood notes, pre 1800 books are more likely to be held in Special Collections, and therefore not digitized).

The second point is specific to the graph above. That scatter plot is sparse in the early half of this period and very dense in the latter half. The translucency of each point (set by alpha=0.2) captures some of this, but nevertheless the graph as a whole overemphases the increased spread of data, when really what is happening is an increase in the amount of data. If we plot things differently, I think this becomes evident. Let’s breakdown our data by decade, and then do a box plot per decade of fiction length:

# This helper function will convert a year into a "decade"
# through some simple division and then return the decade
# as a "factor" (an R data-type).
as.Decade <- function(year) {
  decade <- (as.numeric(year)%/%10)*10

# Add a "decade" column by applying our as.Decade function 
# to the data. (The unlist function... is because lapply returns
# a list, and I'm not very good at R, so that's how I got it to work.$decade <- unlist(lapply($date, as.Decade))

# Box plot of our length data, grouped by decade
  aes($decade,$totalpages)) +
  geom_boxplot() +
  ggtitle('Length of Books, Grouped by Decades') +
  xlab('Decade') +
  ylab('Length of Books, in Pages')

Less Misleading Plot of Length Across Time in HathiTrust Fiction Dataset

This plot confirms that, indeed, we see a greater range in the lengths of works of fiction (so my inference from the previous graph is not completely wrong). But a box plot clarifies what is, to me, a surprising constancy in the length of the works collected in this dataset. The apparent increase in variability in length is real—but it is not the most, or the only, salient feature of this data; this fact is better captured in the second graph (the box plot).

Summary: Frequently Occurring Terms

The file fiction_yearly_summary.csv contains the per-year frequencies of the top 10,000 most frequently occuring tokens in the fiction dataset. We can chart the fluctuations of a term’s use, for instance, across the period.

# Load our data
yearly.summary <- read.csv('fiction_yearly_summary.csv')

# Extract some meaninful bit, say, occurences of `love`
love <- subset(yearly.summary, yearly.summary$word=='love')

# Plot it
ggplot(love,aes(x=love$year,y=love$termfreq)) +
  geom_line() +
  xlab("Occurences of token 'love'") + ylab('Year') +
  ggtitle('"Love" in the Dataset')

Unnormalized Occurences of the Term 'Love' in Dataset

Yet, of course, looking at that sharp rise, we quickly realize—yet again—the importance of normalization. We are not witnessing the explosion of love at the dawn of the twentieth century (and its nearly as rapid declension). We could noralize by adding all the words together—but we only have counts for the top 10,000 wods. Thankfully, the dataset offers “three special tokens for each year: #ALLTOKENS counts all the tokens in each year, including numbers and punctuation; #ALPHABETIC only counts alphabetic tokens; and #DICTIONARYWORD counts all the tokens that were found in an English dictionary.”

So, let’s normalize by using DICTIONARYWORD.

# Let's extract the DICTIONARYWORD tokens into a data frame <- 

# Let's simplify this dataframe to just what we're interested in. <-[c('year','termfreq')]

# And rename the termfreq column to "total"
colnames( <- c('year','total')

# Now we can use merge to combine this data, giving each row 
# a column that contains the total number of (dictionary words)
# for that year. 
love.normalized <- merge(love,, by=c('year'))

# This method profligately repreats data; but it makes things 
# easier. The result looks like this:
>   year word termfreq correctionapplied  total
> 1 1701 love      222                 0  37234
> 2 1702 love        1                 0   7036
> 3 1703 love      524                 0 416126
> 4 1706 love       12                 0  36501
> 5 1708 love      578                 0 482779
> 6 1709 love      361                 0 133847

# Now, graph the data
  geom_line() +
  xlab('Year') +
  ylab('Normalized Frequency of "love"') +
  ggtitle('The Fate of Love')

Normalized Plot of 'Love' in the Dataset

Well, that look’s about right. Just for fun, let’s try a different term, one that is something less of an ever-fixed mark, but which perhaps alters its relative frequency when it historical alteration finds.

# We subset the term we're interested in.
america <- subset(yearly.summary, yearly.summary$word=='america')
# And normalize using our already-constructed 
# data frame.
america.normalized <- merge(america,, by=c('year'))

# Plot as before, though this time we'll use geom_smooth() 
# as well to add a quick "smooth" fit line to get a sense of 
# the trend. Minor digression: things like geom_smooth() are one 
# of the things that make R great (if very dangerous) for an 
# utter amateur.
    geom_line() +
    geom_smooth() +
    xlab('Year') +
    ylab('Normalized Frequency of "america"') +
  ggtitle("Occurences of 'america' in the Dataset")

Occurences of 'america' in the Dataset

Not sure there’s much surprising here, but okay, seems reasonablish.

Extracting Counts from Individual Volume Files

Now, what if you want to look at terms that don’t occur in the top 10,000. Then, you need to dig in to the files for individual volumes. For simplicity’s sake, I’ll look only at one set of those files, representing volumes of fiction between 1915 and 1919, which I’ve uncompressed in a subdirectory called fiction_1915-1919.

I’ve been using R for everything so far, and I imagine you could use R to loop over the files in the directory, open them up and look for a specified term. As someone who finds R idiosyncratic to the point of excruciation, this doesn’t sound particularly fun. R is great when you’re manipulating/plotting data frames—less so when doing more complicated tasks on the filesystem. So, to extract the information we want, I’ll used a simple bash script.


# Our input directory

# Let's take a single command line argument ($1) and store it
# as the value we're looking for (the proverbial needle in our
# data haystack).

# We use this convention, with find and while read 
# because a simple for loop, or ls, might have a problem
# with ~10000 files.
find $INPUTDIRECTORY | while read file
    # For each file, we use grep to search for our term,
    # storing just the number of occurences in result.
    result=$(grep -w -m 1 $NEEDLE $file | awk '{ print $2 }')
    # Get the htid of the file we're looking at from the filename
    id=$(basename $file .tsv)
    # And then print the result to the screen
    echo $id,$result

I’m assuming some familiarity with bash scripts; to make a script executable, using its enough to type chmod +x wordcounter.bash. Save this script to a file (say, wordcounter.bash), make it executable, and then run it with an argument: ./wordcounter.bash positivism and it will output to the screen; pipe that to a csv (type ./wordcounter.bash positivism > positivism.csv) and you can use it in R. Here is what the results look like when they start appearing on the screen:


Those gibberish-looking strings (bc.ark+=13950=tk19k4r10s) are HathiTrust IDs. Then you get a comma, and after the comma the number of times the term appeared in the file… unless it didn’t appear, in which case you just a blank.

Some Notes

This will only work on unixy systems—Linux, OSX, or (I assume) cygwin on Windows.

When a token does not appear in file, this script outputs the htid, a comma, and then nothing. That’s fine—it’s easier to handle this after we’ve imported the resulting csv (to, say, R) than it would have been to write some logic in this script here to output 0. Also, this crude method is probably faster than doing it within R or Python and is certainly not slower. It could be speeded up by doing something fancy, like parallelization. To search through the 8656 files of fiction_1915-1919 for one term took 1 minute and 12 seconds—a totally managable timeframe. Assuming that rate (processing, say, 120 files/second) is roughly constant across the dataset of roughly 180,000 volumes, it should be possible to use this method to search for a term across all the volumes in the dataset in roughly 25 minutes, give or take. That is, of course, based on doing this on my laptop (with a 1.8Ghz Core i5 CPU), no parallelization (though this should be an eminently parallizable task—like really). Not fast, but totally managable.

Plotting Our Extracted Counts from Individual Volume Files

So, assuming the script works… back to R.

# Input the data culled by our custom bash script
gramophone <- read.csv('gramophone.csv')
film <- read.csv('film.csv')
typewriter <- read.csv('typewriter.csv')

# Remember all those spots where a token doesn't occur, 
# which appear as blanks? Those get read by R as NA 
# values. Here we replace them with zeros.
gramophone[] <- 0
film[] <- 0
typewriter[] <- 0

# Let's rename our columns
colnames(gramophone) <- c('htid','gramophone')
colnames(film) <- c('htid','film')
colnames(typewriter) <- c('htid','typewriter')

# We'll put this data together into one data frame
# for convenience sake.
gft <- merge(gramophone,film,by=c('htid'))
gft <- merge(gft,typewriter,by=c('htid'))

Right now, though, all we have is HathiTrust IDs and frequencies of our term (or terms). We have no information about date, or title. So let’s get that information from the metadata files we’ve worked with earlier.

# From our custom culled data
gramophone <- read.csv('gramophone.csv')
film <- read.csv('film.csv')
typewriter <- read.csv('typewriter.csv')

# All those spots where a token doesn't occur, which produce blank lines
gramophone[] <- 0
film[] <- 0
typewriter[] <- 0

colnames(gramophone) <- c('htid','gramophone')
colnames(film) <- c('htid','film')
colnames(typewriter) <- c('htid','typewriter')

# put it all together with our main metadata data frame
gft <- merge(gramophone,film,by=c('htid'))
gft <- merge(gft,typewriter,by=c('htid'))

# Now get the metadata from fiction_metadata.csv and
# merge based on htid. <- read.csv('fiction_metadata.csv',header=T)
gft <- merge(gft,,by=c('htid'))

# To normalize let's load our annual totals as well. We can
# merge those with our dataframe based on date.

# Get Yearly Totals
yearly.summary <- read.csv('fiction_yearly_summary.csv') <- subset(yearly.summary,yearly.summary$word=='#DICTIONARYWORD') <-[c('year','termfreq')]
colnames( <- c('date','total')

# Merge yearly totals with our main dataframe based on date.
gft <- merge(gft,,by=c('date'))

# Our dataframe is now 23 columns:
> [1] "date"          "htid"          "gramophone"    "film"         
> [5] "typewriter"    "recordid"      "oclc"          "locnum"       
> [9] "author"        "imprint"       "place"         "enumcron"     
>[13] "subjects"      "title"         "prob80precise" "genrepages"   
>[17] "totalpages"    "englishpct"    "datetype"      "startdate"    
>[21] "enddate"       "imprintdate"   "total"        

# That's not crazy, but to make things easier to under, 
# let's subset just the data we're interested in right now---say,
# the occurence of our terms and their date.
gft.simple <- gft[,c('date','gramophone','film','typewriter','total')]

>   date gramophone film typewriter     total
> 1 1915          0    0          0 106553905
> 2 1915          0    1          0 106553905
> 3 1915          0    0          0 106553905
> 4 1915          0    0          0 106553905
> 5 1915          0    0          0 106553905
> 6 1915          0    0          0 106553905
> [1] 8655

Okay, looks good—there are our 8655 volumes, each with date of publication, the occurences of our three search terms (gramophone, film, and typewriter), and the total number of DICTIONARYWORDs for that year. Note that each row still represents a single volume—but we’ve discarded author, title, htid, etc. We’ve also added the total dictionary words for a volume’s year to each row (note the repeated totals in those first 1915 volumes), which is grossly inefficient. All this, however, is in the interest of simplicity—so that we can easily plot the relative occurences of our selected terms (here, gramophone, film, and typewriter).

In order to make this data easily plottable, we need some additional R tricks: we need to reformat our data from a “data frame” to a long “data matrix” (using the melt function). Then we can create a stacked bar graph of terms per year. Let’s start by plotting our raw counts.

# Our libraries
library(reshape2) # For melting data.
library(ggplot2)  # For graphing data.

# This next is necessary b/c R throws an error otherwise. 
# Not totally sure why...
gft.simple$date <- as.factor(gft.simple$date)

# Create a "long" format matrix, from our raw counts data.
gft.m <- melt(gft.simple[,c('date','gramophone','film','typewriter')],id.vars='date')

# Create a bar plot of all our values, coded by variable
aes(factor(date),y=value,fill=variable)) +
  geom_bar(stat='identity') +
  xlab('Year') +
  ylab('Raw Word Occurence') +
  ggtitle("Raw Counts for 'gramophone,' 'film,' and 'typewriter'")

Stacked Bar Chart of 'gramophone','film','typewriter' occurences, in Dataset, 1915-1919

These are, though, raw counts. To normalize, we can divide the counts for our terms by the total and plot the result.

# We'll create a new data frame for our normalized data, 
# beginning with out simplified data.
gft.normalized <- gft.simple

# In this new dataframe, normalize our scores by dividing 
# the raw count in each row by the total in each row.
gft.normalized$gramophone <- gft.normalized$gramophone/gft.normalized$total
gft.normalized$film <- gft.normalized$film/gft.normalized$total
gft.normalized$typewriter <- gft.normalized$typewriter/gft.normalized$total

# How does it look?
> date gramophone         film typewriter     total
> 1 1915          0 0.000000e+00          0 106553905
> 2 1915          0 9.384921e-09          0 106553905
> 3 1915          0 0.000000e+00          0 106553905
> 4 1915          0 0.000000e+00          0 106553905
> 5 1915          0 0.000000e+00          0 106553905
> 6 1915          0 0.000000e+00          0 106553905

# Well, that looks about right. Let's begin our melt/plot 
# process again by creating a matrix.
gft.norm.m <- melt(gft.normalized[,c('date','gramophone','film','typewriter')],id.vars='date')
ggplot(gft.norm.m,aes(factor(date),y=value,fill=variable)) +
  geom_bar(stat='identity') +
  xlab('Year') +
  ylab('Normalized Word Frequency (by Year)') +
  ggtitle("Normalized Scores for 'gramophone,' 'film,' and 'typewriter'")

Normalized, Stacked Bar Chart

Normalization makes some minor adjustments, but pretty similar. Not sure I would want to make any claims as to the importance or meaning of these graphs. They’re over a short historical span, and so far lack any richer contextualization. Like I said, for now, anticlimaxes.

Don’t get me wrong, Markdown’s great. Indeed, nearly all the writing I do now is in Markdown (or at least starts that way). There has been a good amount of writing about the virtues of Markdown for academic writing in particular, so I’ll just link to them here:

But Markdown, as it stands, has some drawbacks, which become acute when you are trying to extend it to cover the needs of academic writing (or, say, as a transcription format for texts).

The Problem

What I will describe as “problems” all stems from the fact that Markdown remains essentially a simplified syntax for HTML. A tool like Pandoc, which has a special (and especially powerful) flavor of Markdown all its own, helps reduce the borders between document formats. With Pandoc it becomes easy to convert HTML to LaTeX, or Rich Text Format to Word’s .docx. It could easily feel like Markdown is a universal document format—write it in Markdown, and publish as whatever.

That is a lovely dream—an easy-to-write plaintext format that can easily be output to any desired format. In reality, though, Markdown (even Pandoc’s Markdown) remains yoked to HTML, and so it suffers from some of its problems.

The problem I encounter most frequently in HTML (and in Markdown) concerns nesting a block quote within a paragraph. In short, can you have a block quote within a paragraph? If you’re writing HTML (or MarkDown), the answer is no—HTML treats “block quotes” as block elements; this means that one cannot be contained within a paragraph (this restriction does not exist in LaTeX or TEI). Yet, what could be more common in writing on works of literature? Representing poetry presents its own problems for HTML and Markdown.By contrast to the challenge presented by the mere fact of poetry, note the many syntaxes/tools available for fenced code blocks, syntax highlighting, and so on; Markdown, for now, remains of greatest interest to software developers and so reflects their habits and needs.(Note: If you’re looking for practical advice, you can easily represent poetry in Pandoc’s markdown using “line blocks”; this is not a perfect solution, but it will do for many needs).

Perversely, markdown also represents something of a step backward with regard to semantics. If you’ve spent some time with HTML, you may have noticed how HTML5 cements a model of HTML as a semantic markup language (with, implicitly, matters of presentation controlled by CSS). That means that the <i> tag, which long ago meant italics, has since acquired semantic meaning. According to the w3c, it should be used to “represent[] a span of text offset from its surrounding content without conveying any extra emphasis or importance, and for which the conventional typographic presentation is italic text; for example, a taxonomic designation, a technical term, an idiomatic phrase from another language, a thought, or a ship name.” Those instances where one wishes to express emphasis, use the <em> tag. If you need to mark a title, don’t simply italicize it, use <cite> .But hold up, that cite element obscures the distinctions we normally make between italicizing certain titles and putting others in quotation marks. In practice, of course, I doubt these distinctions are widely respected across the web; but all those at least potentially useful distinctions are lost in markdown, whose syntax marks them all with * or _. Markdown is, in fact, rather unsemantic. (To a lesser degree, one might detect this tendency as well in the way headings—rather than divs—are Markdown’s primary way of structuring a document, but I’ll stop now.) So, two points: Markdown inherits HTML’s document which includes an inability to nest block-level elements within paragraphs; in simplifying HTML, it produces a less semantically clear and rich format. (Technically, of course, one could simply include any HTML element for which Markdown offers no shortened syntax—like <cite> for example.)

A Solution

On the CommonMark forum, some folks have proposed additional syntax to fix the latter problem, and capture some of the semantic distinctions mentioned above (indeed, following the discussions over there has helped sensitize to me some of the challenges and limitations of markdown as a sort of universal format donor). So, some of these issues could be resolved through extensions or modifications of Markdown.

Yet, given these deficits in Markdown, I wonder if it isn’t worth asking a more basic question—whether the plaintext format for “academic” writing should be so tightly yoked to HTML? If Markdown is, fundamentally, a simplified, plaintext syntax for HTML, could we imagine a similar, easy-to-write, plaintext format that wouldn’t be tied to HTML? Could we imagine, say, a format that would represent a simplification of syntax, not of HTML, but of a format better suited to the needs of representing more complex documents? Could we imagine a plaintext format that would be to TEI, say, what markdown is to HTML?

Such a format would not need to look particularly different from Markdown. Its syntax could overlap significantly; as in Pandoc’s Markdown format, file metadata (things like title, author, and so on) could appear (perhaps as YAML) at the front of the file (and be converted into elements within teiHeader). You could still use *, **, and []() as your chief tools; footnotes and references could be marked the same way (you could preserve Pandoc’s wonderful citation system, with such things represented as <refs> in TEI).

The most substantive difference would not be in syntax, but in the document model. Any Markdown file can contain HTML—all HTML is valid markdown; this ensures that Markdown is never less powerful than HTML. But are the burdens of HTML worth the costs if one wishes to do scholarly/academic, or similar types of writing, in plaintext? Projects exist to repurpose Pandoc markdown for scholarly writing: Tim T. Y. Lin’s ScholarlyMarkdown, or Martin Fenner’s similar project, or the workflow linked-to above, by Dennis Tennen and Grant Wythoff at the Programming Historian. What I’m imagining, though, is entirely less practical than any of these projects at the moment because it would necessitate a change in the document model into which markdown is converted. Pandoc works its magic by reading documents from a source format (through a “reader”) into an intermediary format (a format of its own that you can view by outputting -t native), which it can then output (through a “writer”). Could TEI (or some representation of it), essentially, fulfill that role as intermediary format? (A Pandoc car with a TEI engine swapped in?)

I like writing in plaintext, but I don’t love being bound by the peculiarities that Markdown has inherited from HTML. So, it is worth considering what it is that people like about Markdown. I suspect that most of the things people like about Markdown (free, easy to write, nonproprietary, easily usable with version control, and so on), have little to do with its HTML-based document model but stem from its being a plaintext format (and the existing infrastructure of scripts/apps/workflows around markdown). TEI provides an alternative document model—indeed, a richer document model. Imagine a version of Pandoc that uses TEI (or a simplified TEI subset) behind the scenes as its native format. Folks often complain about the complexity and verbosity of TEI (and XML more generally), and not without reason. I would certainly never want to write TEI; but a simplified TEI syntax that could then take advantage of all the virtues of TEI, that would be something.

[Closing Note: At one point I wondered how easy it would be to convert markdown to TEI with Pandoc… I’ve managed to finagle a set of scripts to do that; it’s janky, but for anyone interested, it’s here.]

Recall these lines from Clement C. Moore’s “A Visit from Saint Nicholas,” (alternately titled “The Night Before Christmas” or “‘Twas the Night Before Christmas”), first published in 1823. See wikipedia page for some notes on contentions with regard to its authorship.

When what to my wondering eyes should appear, But a miniature sleigh and eight tiny reindeer…

But, exactly how miniature is this sleigh, and how tiny are these reindeer? While Moore’s poem did a lot to consolidate the mythology of Santa Claus, one thing that has not remained of Moore’s Saint Nicholas is his height. Recalling this insistence on the tinyness of Santa, much of the confusion around his movement through chimney flues is eliminated. But it also lends a different stress to the comparison of the elf’s nose to “like a cherry” or of his “little round belly” that shakes “like a bowl full of jelly.” At stake here is not simply nose complexion nor belly texture, but size.

If today our Santa is our bigger, it was not always so. And many earlier illustrations are consistent with Moore’s text. Consider these from a 1912 edition [] of the poem, by Jessie Wilcox Smith:


Santa Filling Stockings

Likewise, look at this svelte Santa, by Arthur Rackham from this undate edition [HathiTrust], who is clearly small enough to easily slip down that chimney:

Santa Emerging from Chimney

You can find more Santas at the Public Domain Review, including a gun-toting, WWII Santa, or listen to the poem on wax cylinder [1914].

Editions of the poem:

For older posts see the archive.