TopicDrift — How Research Themes Evolve Across CS Conferences

The Project

What is TopicDrift?

TopicDrift is a data-driven map of how computer-science research has changed over time. We take decades of conference papers, discover the topics they cover, and watch those topics drift — emerging, peaking, merging, and fading across the years.

It started as a study of a single venue, the International Conference on Software Engineering (ICSE), and grew into a single shared topic model fit across the entire indexed conference landscape. The same topics are then viewed through three lenses — ICSE alone, ten flagship venues, and every qualifying conference — so the themes stay directly comparable no matter how wide you zoom.

By the Numbers

2.3M

papers with abstracts, clustered

2,000+

DBLP conferences in the corpus

174

discovered topics

10

overarching research themes

~11K

ICSE papers (1976–2025)

50

years of research traced

Methodology

How we went about it

Every paper flows through the same pipeline. We assemble metadata, recover the abstracts that make topic modeling possible, cluster the text into topics, and group those topics into human-readable themes — then slice the result by venue.

1

Collect

Parse the full DBLP dump for titles, authors, years, and DOIs across every indexed conference.

2

Enrich

Match papers to OpenAlex to pull in abstracts, concept tags, and citation counts — the text the model actually reads.

3

Cluster

Fit BERTopic across the full corpus — embed → UMAP → HDBSCAN — discovering 174 topics and assigning every paper to one.

4

Label & group

A local instruct LLM (Qwen2.5-3B) names each topic; we curate them into 10 overarching themes.

5

Scope & visualize

Filter the one shared model to ICSE, a top-10 set, or all venues, and render streamgraphs, treemaps, and a drift search.

Taxonomy

The ten themes

The 174 discovered topics roll up into ten overarching themes. The colours below are the same ones used throughout every visualization.

Artificial Intelligence

System Design

Emerging Platforms

Program Correctness

Defect Management

Human Factors in SE

Developer Tooling

Requirements Engineering

Software Testing

Software Process

What's Here to Use

Explore the data

Three interactive scopes, all built on the same topic model. Each pairs a theme-drift streamgraph with a corpus-composition treemap; ICSE adds a keyword-driven drift search.

Single venue

ICSE

Where it began — ~11K ICSE papers, plus a keyword drift search to track any topic's rise and fall.

Open ICSE →

Curated set

Top 10 venues

Ten flagship software-engineering & PL conferences, compared on the same shared theme space.

Open Top 10 →

Full corpus

All conferences

2,000+ venues across all of computer science — watch fields like AI and security swell over the decades.

Open All Conferences →

Status

Where we've left it

The end-to-end pipeline runs: from raw DBLP and OpenAlex through a topic model fit on the full corpus to the three published scopes you can explore above. The ICSE scope is the most reliable — it's the venue we've cleaned and validated most. The topics and themes are not yet as accurate as they need to be, especially at the full "All Conferences" scale, and improving that accuracy is the headline of the remaining work.

This is an in-progress course project. The biggest open items:

Accuracy needs to improve. Topic and theme assignments are usable for spotting broad trends, but a meaningful fraction are mislabeled or coarse — particularly across the All Conferences scope. Treat the wide scopes as directional, not authoritative.
The corpus is too broad right now. Our ingest is pulling in more than just computing-related papers, so non-CS work leaks into the corpus and distorts the themes. Tighter venue and content filtering is needed.
More data cleaning required. Duplicates, missing or mismatched abstracts, and inconsistent venue metadata still need to be scrubbed before the topic space can be trusted at scale.
Better topic & theme fitting. The clustering and the 174→10 theme grouping need another pass — finer-grained, better-separated topics and a more principled, less hand-curated theme mapping.
Topic labels. Several LLM-generated topic names are noisy or overlap and would benefit from a human editing pass.
Beyond conferences. Journals and arXiv preprints aren't yet included, so very recent shifts may lag.