CS4973 · Northeastern University

TopicDrift

How research themes rise, fall, and reinvent themselves across five decades of computer-science conferences.

Michael Maaseide Deen Khan Lucas Oberwager Reece DiGiacomo Brian Looney
Explore the visualizations → View on GitHub

What is TopicDrift?

TopicDrift is a data-driven map of how computer-science research has changed over time. We take decades of conference papers, discover the topics they cover, and watch those topics drift — emerging, peaking, merging, and fading across the years.

It started as a study of a single venue, the International Conference on Software Engineering (ICSE), and grew into a single shared topic model fit across the entire indexed conference landscape. The same topics are then viewed through three lenses — ICSE alone, ten flagship venues, and every qualifying conference — so the themes stay directly comparable no matter how wide you zoom.

2.3M
papers with abstracts, clustered
2,000+
DBLP conferences in the corpus
174
discovered topics
10
overarching research themes
~11K
ICSE papers (1976–2025)
50
years of research traced

How we went about it

Every paper flows through the same pipeline. We assemble metadata, recover the abstracts that make topic modeling possible, cluster the text into topics, and group those topics into human-readable themes — then slice the result by venue.

1

Collect

Parse the full DBLP dump for titles, authors, years, and DOIs across every indexed conference.

2

Enrich

Match papers to OpenAlex to pull in abstracts, concept tags, and citation counts — the text the model actually reads.

3

Cluster

Fit BERTopic across the full corpus — embed → UMAP → HDBSCAN — discovering 174 topics and assigning every paper to one.

4

Label & group

A local instruct LLM (Qwen2.5-3B) names each topic; we curate them into 10 overarching themes.

5

Scope & visualize

Filter the one shared model to ICSE, a top-10 set, or all venues, and render streamgraphs, treemaps, and a drift search.

The ten themes

The 174 discovered topics roll up into ten overarching themes. The colours below are the same ones used throughout every visualization.

Artificial Intelligence
System Design
Emerging Platforms
Program Correctness
Defect Management
Human Factors in SE
Developer Tooling
Requirements Engineering
Software Testing
Software Process

Explore the data

Three interactive scopes, all built on the same topic model. Each pairs a theme-drift streamgraph with a corpus-composition treemap; ICSE adds a keyword-driven drift search.

Where we've left it

The end-to-end pipeline runs: from raw DBLP and OpenAlex through a topic model fit on the full corpus to the three published scopes you can explore above. The ICSE scope is the most reliable — it's the venue we've cleaned and validated most. The topics and themes are not yet as accurate as they need to be, especially at the full "All Conferences" scale, and improving that accuracy is the headline of the remaining work.

This is an in-progress course project. The biggest open items:

  • Accuracy needs to improve. Topic and theme assignments are usable for spotting broad trends, but a meaningful fraction are mislabeled or coarse — particularly across the All Conferences scope. Treat the wide scopes as directional, not authoritative.
  • The corpus is too broad right now. Our ingest is pulling in more than just computing-related papers, so non-CS work leaks into the corpus and distorts the themes. Tighter venue and content filtering is needed.
  • More data cleaning required. Duplicates, missing or mismatched abstracts, and inconsistent venue metadata still need to be scrubbed before the topic space can be trusted at scale.
  • Better topic & theme fitting. The clustering and the 174→10 theme grouping need another pass — finer-grained, better-separated topics and a more principled, less hand-curated theme mapping.
  • Topic labels. Several LLM-generated topic names are noisy or overlap and would benefit from a human editing pass.
  • Beyond conferences. Journals and arXiv preprints aren't yet included, so very recent shifts may lag.