The Project
What is TopicDrift?
TopicDrift is a data-driven map of how computer-science research has changed over time. We take decades of conference papers, discover the topics they cover, and watch those topics drift — emerging, peaking, merging, and fading across the years.
It started as a study of a single venue, the International Conference on Software Engineering (ICSE), and grew into a single shared topic model fit across the entire indexed conference landscape. The same topics are then viewed through three lenses — ICSE alone, ten flagship venues, and every qualifying conference — so the themes stay directly comparable no matter how wide you zoom.
By the Numbers
Methodology
How we went about it
Every paper flows through the same pipeline. We assemble metadata, recover the abstracts that make topic modeling possible, cluster the text into topics, and group those topics into human-readable themes — then slice the result by venue.
Collect
Parse the full DBLP dump for titles, authors, years, and DOIs across every indexed conference.
Enrich
Match papers to OpenAlex to pull in abstracts, concept tags, and citation counts — the text the model actually reads.
Cluster
Fit BERTopic across the full corpus — embed → UMAP → HDBSCAN — discovering 174 topics and assigning every paper to one.
Label & group
A local instruct LLM (Qwen2.5-3B) names each topic; we curate them into 10 overarching themes.
Scope & visualize
Filter the one shared model to ICSE, a top-10 set, or all venues, and render streamgraphs, treemaps, and a drift search.
Taxonomy
The ten themes
The 174 discovered topics roll up into ten overarching themes. The colours below are the same ones used throughout every visualization.
What's Here to Use
Explore the data
Three interactive scopes, all built on the same topic model. Each pairs a theme-drift streamgraph with a corpus-composition treemap; ICSE adds a keyword-driven drift search.
ICSE
Where it began — ~11K ICSE papers, plus a keyword drift search to track any topic's rise and fall.
Top 10 venues
Ten flagship software-engineering & PL conferences, compared on the same shared theme space.
All conferences
2,000+ venues across all of computer science — watch fields like AI and security swell over the decades.
Status
Where we've left it
The end-to-end pipeline runs: from raw DBLP and OpenAlex through a topic model fit on the full corpus to the three published scopes you can explore above. The ICSE scope is the most reliable — it's the venue we've cleaned and validated most. The topics and themes are not yet as accurate as they need to be, especially at the full "All Conferences" scale, and improving that accuracy is the headline of the remaining work.
This is an in-progress course project. The biggest open items:
- Accuracy needs to improve. Topic and theme assignments are usable for spotting broad trends, but a meaningful fraction are mislabeled or coarse — particularly across the All Conferences scope. Treat the wide scopes as directional, not authoritative.
- The corpus is too broad right now. Our ingest is pulling in more than just computing-related papers, so non-CS work leaks into the corpus and distorts the themes. Tighter venue and content filtering is needed.
- More data cleaning required. Duplicates, missing or mismatched abstracts, and inconsistent venue metadata still need to be scrubbed before the topic space can be trusted at scale.
- Better topic & theme fitting. The clustering and the 174→10 theme grouping need another pass — finer-grained, better-separated topics and a more principled, less hand-curated theme mapping.
- Topic labels. Several LLM-generated topic names are noisy or overlap and would benefit from a human editing pass.
- Beyond conferences. Journals and arXiv preprints aren't yet included, so very recent shifts may lag.