[Home] [Corpus Analysis] [Feature Sets] [Results]

Corpus Analysis

To set the scene for our genre classification experiments, we looked at the correlations between genres and topics and their evolution over time in the New York Times Annotated Corpus (NYTAC). The results motivate the search for genre classifiers that are robust to changes in the genre-topic distribution. The way we infer the genre and topic of a document in the corpus is described in our article. We identified 19 genres in the corpus with more than 1,000 documents each, namely: News Reports, Letters, Paid Death Notices, Statistics, Reviews, Editorials, Corrections, Obituaries, Summaries, Captions, Biographies, Schedules, Lists, Questions, Analyses, Texts, Interviews, Series, and Chronologies.

Genres and Topics

Next, we recorded the topic tags for the documents belonging to these genres. The tables below show the five most-frequent topic tags for each genre, with the given frequencies based on only those documents with at least one topic tag. As documents can have multiple topic tags, percentages are not additive. Paid Death Notices were ignored for this task, as only 69 of the 132,026 document in this category had an assigned topic tag.

News Reports
(935,516 texts, 92.3% with topic tags)
Politics and Government16.2%
U.S. Politics and Government7.5%
Medicine and Health5.9%

(138,000 texts, 87.4% with topic tags)
Politics and Government16.9%
U.S. Politics and Government11.2%
Medicine and Health9.7%
Education and Schools7.3%
International Relations6.4%

(111,982 texts, 66.9% with topic tags)
Company Reports94.4%
Oil and Gasoline0.5%

(110,533 texts, 95.4% with topic tags)
Books and Literature27.0%
Motion Pictures12.1%

(53,518 texts, 94.0% with topic tags)
Politics and Government45.9%
U.S. Politics and Government28.9%
International Relations18.2%
U.S. International Relations15.7%

(47,707 texts, 74.2% with topic tags)
Politics and Government12.5%
U.S. Politics and Government6.5%

(37,127 texts, 42.0% with topic tags)
Books and Literature6.0%
Politics and Government5.2%
Medicine and Health3.8%

(31,005 texts, 5.0% with topic tags)
News and News Media31.4%
Politics and Government11.1%
U.S. Politics and Government6.6%

(30,571 texts, 78.8% with topic tags)
Politics and Government12.3%
Defense and Military Forces6.2%
U.S. Politics and Government6.0%
Demonstrations and Riots5.3%

(17,740 texts, 94.0% with topic tags)
Politics and Government19.1%
U.S. Politics and Government6.7%

(16,117 texts, 26.3% with topic tags)
Stocks and Bonds20.5%

(14,083 texts, 46.5% with topic tags)
Athletics and Sports16.3%
Politics and Government14.9%

(6,863 texts, 58.5% with topic tags)
Gardens and Gardening11.2%
Home Repairs10.8%
Computers and the Internet7.8%
Travel and Vacations7.7%
Science and Technology7.7%

(6,346 texts, 96.3% with topic tags)
Politics and Government55.4%
U.S. Politics and Government32.5%
International Relations25.4%
U.S. International Relations21.3%

(3,841 texts, 92.2% with topic tags)
Politics and Government59.6%
U.S. Politics and Government43.2%
International Relations22.7%
U.S. International Relations20.4%

(3,537 texts, 93.4% with topic tags)
Medicine and Health9.7%
Books and Literature9.0%
Stocks and Bonds8.0%

(3,281 texts, 96.8% with topic tags)
Surveys and Series72.0%
Politics and Government23.1%
Medicine and Health14.5%

(1,063 texts, 88.4% with topic tags)
Politics and Government28.0%
U.S. Politics and Government13.2%
International Relations10.2%
Defense and Military Forces9.5%

Significant genre-topic correlations are present throughout: Not surprisingly, for obituaries, death is by far the most common topic, while not so for editorials or reviews. Less obviously, politics occurs frequently with many genres as a topic, but not with questions or interviews. We also observe that some genres are concentrated around a very small range of topics (e.g. editorials are very often about politics or related topics), whereas others, like questions or news reports, are distributed more evenly among a much larger range of topics.

While these particular correlations are not surprising, they do confirm that genre classification is closely linked to topic classification. This could be either benefit or harm automated systems, depending on whether the genre-topic distribution remains stable over time.

Changing Genre-Topic Correlations in the NYTAC

To get a feeling for stability over time within the NYTAC, we analyzed how the distribution of genres had evolved over time for documents of a given topic, and conversely, how the distribution of topics has evolved over time for documents of a given genre. The figures below illustrates this for two genres and two topics.

Each point in the upper charts is the percentage of documents about music and literature in the news reports and review genres within each year of publication. They can be seen as the probability of a topic given a genre and year. The upper left chart shows that both topics occurred more often in news reports after 1996 than before. The upper right chart shows that, for reviews, each topic is about equiprobable in 1987, but only intermittently so after that.

Each point in the lower charts is the percentage of documents being news reports and reviews in the collections of documents on music and literature within each year of publication. They can be seen as the probability of a genre given a topic and year. Articles about music (lower left chart) were more likely to be reviews than news reports before 1999, but less likely thereafter. A similar trend can be observed in articles about literature (lower right chart), although reviews remain more probable than news reports for this topic throughout the whole period from 1987 to 2007.

These examples show that the genre-topic distribution in the NYTAC has changed over time in the past. The degree of change certainly varies for other genres and topics in the corpus. However, this example is by no means exceptional, as similar tendencies can be observed for other genre-topic distributions as well. What the example demonstrates is that one should not generally assume that genre-topic correlations remain fixed in an actively growing text corpus.

While typically genre classifiers do not use the topic of a text as an explicit feature, they may do so implicitly through features that correlate strongly with topic. As such, they may be accidentally relying on genre-topic correlations to achieve good results. Because such results could be negatively impacted by any change in distribution (unless they are being re-trained regularly), one might prefer a classification method that used text features that are indicative of genres, but not of topics. This might be true, even when the learned model will be applied in the same domain, as we showed with the example of the NYTAC. This finding motivates the search for stable genre classifiers.