How It Works7 min read

How We Source 100,000+ Papers Every Week: OpenAlex, Crossref, and RSS

By The Academic Digest Team

Every Monday morning, The Academic Digest delivers 5 to 40 papers per subscriber, selected from a pool of 100,000+ candidates that arrive in our pipeline each week. The pipeline pulls from four primary data sources: OpenAlex, Crossref, direct RSS feeds from journals, and preprint servers (bioRxiv, medRxiv, arXiv).

This post explains how each source works, what kinds of papers each one surfaces, and how we ensure comprehensive coverage of 290+ peer-reviewed journals.

Why multiple sources

No single data source covers all of the scientific literature. The big indexes (Web of Science, Scopus) are paywalled and expensive to access at scale. The open indexes (OpenAlex, Crossref, Semantic Scholar) are free but have different coverage characteristics. RSS feeds from journals are direct but limited to journals that publish them.

The strategy in 2026 for any open research tooling is to combine multiple sources and deduplicate. Each source has strengths and gaps. The combination covers the gaps of any single one.

OpenAlex

OpenAlex is an open index of the world's scholarly outputs, maintained by the non-profit OurResearch. It catalogues over 200 million works and updates daily. OpenAlex is our primary source for metadata — title, authors, abstract, journal, publication date, citations, references.

The strengths of OpenAlex:

  • Breadth. Covers papers from tens of thousands of journals across all disciplines.
  • Rich metadata. Includes abstracts for most papers (where available), full author lists with affiliations, references, citation counts, and concept tags (topical classifications).
  • Open API. Free to use with generous rate limits, well-documented, and stable.
  • Daily updates. New papers appear within hours of publication for most journals.

The gaps:

  • Coverage of very new papers. OpenAlex is fast but not instant. Some papers appear 1 to 3 days after publication.
  • Coverage of non-English papers. OpenAlex is comprehensive for English-language papers; non-English papers are indexed but with less rich metadata.
  • Coverage of preprints. OpenAlex includes preprints but with less complete metadata than peer-reviewed papers.

We pull new papers from OpenAlex daily, deduplicate against other sources, and store the metadata for ranking.

Crossref

Crossref is the official Digital Object Identifier (DOI) registration agency for scholarly publications. It is the authoritative source for DOI metadata — every paper with a DOI is registered with Crossref by the publisher at the time of publication.

The strengths of Crossref:

  • Authoritative metadata. When a paper has a DOI, Crossref has the canonical metadata — title, authors, journal, volume, issue, page numbers, publication date.
  • Real-time updates. New papers appear in Crossref within hours of publication, often faster than OpenAlex.
  • Comprehensive for DOI-registered papers. Crossref covers essentially every DOI-registered journal, including most peer-reviewed journals.

The gaps:

  • Coverage of preprints. Crossref covers DOI-registered preprints (most of bioRxiv and medRxiv are DOI-registered) but not all arXiv preprints.
  • Coverage of older papers. Crossref's metadata quality is best for papers published after 2000. Older papers have less complete metadata.

We use Crossref as the authoritative metadata source when a DOI is available. For papers without DOIs (mostly older preprints without registered DOIs), we fall back to OpenAlex or the RSS feed.

RSS feeds

Most peer-reviewed journals publish RSS feeds for their tables of contents. The feeds are typically updated within 24 hours of a new paper appearing online.

The strengths of RSS:

  • Direct from the publisher. The feed is generated by the journal's own publishing system, so metadata is accurate and timely.
  • Covers specific journals. RSS allows us to track specific journals precisely — we know exactly which journals we are following.

The gaps:

  • Coverage varies. Some journals publish RSS feeds; others do not. Coverage is better for open-access journals and large commercial publishers.
  • Metadata is minimal. RSS feeds typically include title, authors, and link. Abstracts are not always included.
  • No citation context. RSS feeds do not include references or citation counts.

We use RSS feeds as a complement to OpenAlex and Crossref. For journals we track specifically (Nature, Science, Cell, NEJM, PNAS, BMJ, Lancet, JAMA, and ~280 others), the RSS feed provides a real-time signal of new publications. We cross-reference with OpenAlex and Crossref to enrich the metadata.

Preprint servers

bioRxiv, medRxiv, and arXiv are the three major preprint servers used by researchers in 2026. Each has its own API and RSS feeds.

The strengths of preprint servers:

  • Speed. Preprints appear weeks to months before the peer-reviewed publication. For fast-moving fields, this is the only way to see the latest work.
  • Open access. Preprints are free to read and free to index.
  • Comprehensive within their domains. bioRxiv covers the biological sciences; medRxiv covers clinical and epidemiological research; arXiv covers physics, mathematics, computer science, and quantitative biology.

The gaps:

  • No peer review. Preprints have not been peer reviewed. The selection algorithm must surface preprints that are likely to be reliable, even without formal review.
  • Variable quality. Some preprints are early drafts that change substantially before journal publication. Others are near-final versions.

We index all three preprint servers in our pipeline. The selection algorithm treats preprints and peer-reviewed papers with the same scoring framework — the signals (relevance, topic alignment, scientific impact) are computed identically. The like button helps the system learn which preprints each researcher finds valuable.

Deduplication and the canonical paper

The same paper often appears in multiple sources — OpenAlex, Crossref, the journal's RSS feed, and sometimes the preprint server. We deduplicate using the DOI as the canonical identifier. For papers without DOIs (mostly older papers and some preprints), we deduplicate using a combination of title similarity, author list, and publication date.

Each canonical paper in our system has:

  • A unique internal ID
  • The DOI (if available)
  • Title, authors, abstract, journal, publication date
  • References and citations (where available)
  • Source(s) it was indexed from (for auditability)

This canonical representation feeds the ranking algorithm. The five signals (semantic relevance, topic alignment, scientific impact, author h-index in field, cross-field discovery bonus) are computed once per paper per researcher, regardless of how many sources the paper appeared in.

The journal list

We maintain a curated list of 290+ journals across 35 research fields, from elite (Nature, Science, Cell, NEJM, PNAS) to mid-tier and high-quality regional journals. The list is maintained in data/journals.json and is publicly visible on the tracked journals page.

Adding a journal to the list is a manual process. We evaluate the journal on:

  • Scientific reputation and impact factor.
  • Editorial rigor (peer review process, retraction history).
  • Coverage in our existing pipeline (some journals are already indexed via OpenAlex or Crossref even without explicit addition).

If you would like to suggest a journal for inclusion, the contact page has a form for that.

What this means for you

The practical effect of the multi-source pipeline:

  • Comprehensive coverage. 100,000+ papers per week means most fields see 50 to 200 new papers per week. The selection algorithm picks the 5 to 40 most relevant per subscriber.
  • Fast preprint visibility. Preprints from bioRxiv, medRxiv, and arXiv are indexed daily. New preprints appear in the digest within a week.
  • Reliable metadata. Crossref is the canonical metadata source. OpenAlex enriches with abstracts and citations. RSS confirms the journal's own publication record.
  • Transparent sourcing. The journals we track are publicly listed. The selection algorithm is documented. The data sources are open.

The pipeline is not a black box. Every component — the sources, the deduplication, the canonical representation, the journal list, the ranking algorithm — is documented and reviewable. Researchers who use The Academic Digest can verify what is in their digest and how it got there.

For more detail on the selection pipeline, see the how-it-works page and Inside Our Selection Algorithm.

Stop searching. Start reading.

Our multi-signal selection algorithm delivers the papers most relevant to your research, every Monday morning.

Free plan needs no card. Trial requires a card to start · no charge for 14 days · cancel anytime.