How It Works12 min read

Inside Our Selection Algorithm: The Technical Details

By The Academic Digest Team

This is the technical version of our selection pipeline walkthrough. If you want the higher-level overview of how the algorithm works, see Inside Our Selection Engine. This post goes deeper — into the specific scoring functions, the weight calibration, the cross-field discovery math, and the feedback loop that lets the system improve over time.

It is written for researchers who want to understand exactly what is happening when they set up a project, and for the technically curious who want to see how a multi-signal ranking system is built.

The architecture

The Academic Digest selection algorithm runs once per week against the corpus of new papers indexed from 290+ peer-reviewed journals and preprint servers. The pipeline has five stages:

  1. Collection. Pull new papers from RSS feeds, Crossref, and OpenAlex APIs. Roughly 100,000+ papers per week enter the candidate pool.
  2. Filtering. Apply topic-level filters to discard papers that are obviously irrelevant (e.g., a paper in mathematics for a cancer biology project). This narrows the candidate pool from 100,000+ to typically 500 to 5,000 papers per project.
  3. Multi-signal ranking. Score each remaining candidate against the five signals described below.
  4. Selection and diversity constraints. Pick the top N papers (5 for free, up to 10 per project for Premium across 4 projects), with diversity constraints to ensure no single journal dominates and sub-topic variety is preserved.
  5. AI summarisation. For each selected paper, extract 3 to 5 key findings from the abstract.

The interesting part is stage 3 — the multi-signal ranking.

The five signals

Each candidate paper is scored against five independent signals, normalised to a 0-to-1 scale, and combined into a single composite score.

Signal 1: Semantic relevance (weight 0.40)

The semantic relevance signal measures how topically relevant the paper is to the researcher's declared interests. It is computed in two parts.

First, a lexical BM25 score is computed between the paper's title and abstract and the researcher's keywords. BM25 is the standard ranking function used in information retrieval; it scores a document based on the frequency of query terms in the document, weighted by document length. This catches exact keyword matches.

Second, a semantic similarity score is computed using sentence embeddings from a fine-tuned transformer model. The paper's abstract is encoded into a vector, and the cosine similarity is computed against the centroid of the researcher's topic and keyword embeddings. This catches synonyms and conceptually related work that does not share lexical terms.

The two scores are combined:

`` semantic_relevance = 0.4 * normalised_bm25 + 0.6 * normalised_cosine_similarity ``

This weighting biases toward semantic similarity, because lexical matches alone miss synonyms. The transformer model used is fine-tuned on scientific text from the same domains the system serves, which improves the quality of the embeddings for research-paper language.

Signal 2: Topic alignment (weight 0.20)

The topic alignment signal measures how well the paper fits the broader research area defined by the researcher's topic selections.

This is computed using a document classifier — a model trained on labelled research-paper abstracts that predicts the paper's topic from a controlled vocabulary. The classifier outputs a probability distribution over topics, and the score is the probability mass on the researcher's declared topics.

`` topic_alignment = sum(topic_probability for topic in researcher_topics) ``

This catches papers that may not match the specific keywords but are clearly within the researcher's broader research area.

Signal 3: Scientific impact (weight 0.15)

The scientific impact signal measures the journal tier and the citation-weighted influence of the authors.

The journal tier is determined by a curated mapping maintained internally. Elite journals (Nature, Science, Cell, NEJM, PNAS, BMJ, Lancet, JAMA) score 1.0. Top-tier journals score 0.85. Mid-tier journals score 0.6. Lower-tier journals score 0.4.

The author influence signal is computed as the average citation count per paper for the paper's authors, weighted by recency.

`` scientific_impact = 0.7 * normalised_journal_tier + 0.3 * normalised_author_influence ``

The weights intentionally bias toward journal tier because it is the more reliable signal of broad scientific impact.

Signal 4: Author h-index in field (weight 0.10)

The h-index in field signal is a precision-improvement signal: papers authored by researchers with a strong h-index in the researcher's specific field receive a small boost.

The h-index is computed per author per field. We use author disambiguation by name and affiliation, with manual verification for high-impact authors. The h-index in field is the standard h-index restricted to papers in the relevant field.

`` h_index_in_field = min(author_h_index_in_field, 100) / 100 ``

The cap at 100 is to prevent extreme outliers from dominating. The signal is intentionally small (weight 0.10) because it is a tie-breaker, not a primary relevance signal.

Signal 5: Cross-field discovery bonus (weight 0.15)

The cross-field discovery bonus is the signal that surfaces relevant papers from outside the researcher's primary field. This is the most distinctive feature of the ranking algorithm.

The bonus is computed as follows:

`` if topic_alignment < 0.3 and semantic_relevance > 0.5: cross_field_bonus = (semantic_relevance - 0.5) * 2 else: cross_field_bonus = 0 ``

In plain language: if the paper is not in the researcher's declared topics (low topic alignment) but is semantically relevant (high semantic relevance), it receives a bonus proportional to its semantic relevance. This catches papers from adjacent fields that are conceptually relevant but not on-topic.

The bonus is capped at 0.5 to prevent cross-field papers from dominating the digest. The goal is to include cross-field work without overwhelming the digest with it.

The composite score

The five signals are combined into a single composite score:

`` composite_score = ( 0.40 * semantic_relevance + 0.20 * topic_alignment + 0.15 * scientific_impact + 0.10 * h_index_in_field + 0.15 * cross_field_bonus ) ``

The weights are initialised based on prior research on literature recommendation systems, then calibrated against a held-out set of papers labelled as relevant or not relevant by a panel of researchers. After launch, the weights are further tuned based on subscriber feedback (likes and skips) over time.

The composite score is a number between 0 and 1. Papers are sorted by composite score in descending order, and the top N (5 for free, up to 10 per project for Premium) are selected.

Diversity constraints

Pure top-N selection can produce homogeneous digests — all papers from the same journal, or all papers on the same sub-topic. The selection stage applies diversity constraints to ensure the digest is broad:

  • Journal diversity. No single journal accounts for more than 30% of the digest.
  • Sub-topic diversity. The digest must cover at least 2 sub-topics if the researcher has declared multiple sub-interests.
  • Preprint vs published balance. If preprints are available in the field, at least 10% of the digest comes from preprints.
  • Recency preference. Papers published in the last 14 days are preferred over older papers, all else equal.

The diversity constraints are implemented as a constrained optimisation: select the top N papers subject to the constraints above.

The like button as a feedback signal

The like button on every paper card is the most important feedback signal. When a researcher likes a paper, that signal is fed back into the ranking model for future digests.

The feedback loop works as follows:

  1. Like events are aggregated per researcher over a rolling 30-day window.
  2. The aggregated likes are used to compute a "researcher preference vector" — an embedding that captures the topics, methods, and authors of papers the researcher has liked.
  3. The preference vector is added as a sixth signal in the ranking algorithm, with a small initial weight (0.05) that grows as more likes are accumulated.
  4. Papers with high similarity to the preference vector receive a ranking boost.

The like button is intentionally a tie-breaker, not a filter. The semantic relevance and topic alignment signals still drive the selection. Likes just nudge the algorithm toward the researcher's revealed taste.

This feedback loop is what makes the system adaptive. A static ranking algorithm cannot learn from individual researchers; the like button lets the system personalise within a stable multi-signal framework.

Performance characteristics

The system has been tuned for the following performance targets:

  • End-to-end weekly run time. Under 4 hours for the entire pipeline (collection, filtering, ranking, selection, summarisation) across all subscribers.
  • Ranking latency per paper. Under 10 milliseconds per paper per researcher. With 5,000 candidate papers and 4 projects per Premium subscriber, this is 20,000 scoring operations per subscriber, or under 4 minutes for the entire subscriber base.
  • Storage. Each subscriber's preference vector, project configuration, and like history is stored in a row in the subscribers table. Total storage is roughly 10 KB per subscriber.

The architecture is designed to scale horizontally: each subscriber's ranking can be computed independently, and the candidate paper set can be cached and shared across subscribers.

Limitations and honest trade-offs

The algorithm has known limitations:

  • Cold start. New subscribers with no likes receive generic rankings based only on the five signals. The first few digests may be less accurate than later ones as the preference vector accumulates.
  • Field coverage. The algorithm depends on having curated journal lists per field. Fields with thin coverage in our journal database receive less accurate rankings.
  • Language. Currently optimised for English-language papers. Non-English papers are indexed but ranked less accurately due to the English-fine-tuned transformer model.
  • Concept drift. The meaning of a research topic changes over time. The algorithm is recalibrated quarterly but may lag behind rapid shifts in terminology.

These limitations are why we publish the algorithm details openly and invite feedback from researchers who use the system. The goal is continuous improvement, not a fixed black box.

Trying it

The free plan of The Academic Digest gives you 5 curated papers per week using the full multi-signal ranking algorithm. Set up a project, declare your research interests, and compare the rankings to your existing keyword alerts or RSS feeds. The differences in coverage — particularly the cross-field papers you would have missed — usually become obvious within two to three weeks.

For more detail on the multi-signal ranking concept, see How Multi-Signal Paper Ranking Beats Keyword Alerts. For a higher-level overview of the selection pipeline, see Inside Our Selection Engine.

Stop searching. Start reading.

Our multi-signal selection algorithm delivers the papers most relevant to your research, every Monday morning.

Free plan needs no card. Trial requires a card to start · no charge for 14 days · cancel anytime.