Open Research Information on BigQuery

Sharing the load: Building an open research information collective

A growing, daily updated collection of publicly available datasets on Google BigQuery. Independent groups have come together to share the load of storage, preprocessing, and documentation, making open research information combinable and actionable at scale.

Featured Collections

Please also check the participating collections websites for more information and for many more available datasets.

Crossref

A major DOI registration agency for publisher-provided open metadata. Used in many downstream sources.

SUB Göttingen provides the most recent monthly dump and yearly snapshots. Sesame Open Science gives access to the latest public data dump.

OpenAIRE

A free and open resource that interlinks hundreds of millions of metadata records from over 100k data sources.

Sesame Open Science and InSySPo Campinas provide the full OpenAIRE Graph Dataset in different schemas.

OpenAlex

A free, open, and comprehensive bibliographic database and knowledge graph for scholarly research.

Many collections provide time-specfic snapshots of OpenAlex data, including CWTS, InSySPo Campinas, MultiObs and SUB Göttingen.

CWTS Leiden Ranking

An open edition of the CWTS Leiden Ranking for universities worldwide.

CWTS Leiden provides data underlying the Leiden Ranking Open Edition along with time-specific OpenAlex versions.

FAPESP Virtual Library

The referential information source for research supported by FAPESP from Brazil.

MultiObs provides the FAPESP Virtual Library dataset about research grants in science and technology funded by the São Paulo Research Foundation (FAPESP).

Get Started

Browse Collections – Explore the available datasets from our contributing projects
Contribute – Want to share your data? Check our contributing guide

First steps

Note

To query datasets, you need a Google Cloud project. You can get started with the BigQuery sandbox, which requires no billing account and includes 1 TB of free queries per month.

This example joins data from two different collections – article metadata associated with transformative agreements as provided by SUB Göttingen and a Crossref truth table from Sesame Open Science – to check how many articles from the DEAL Wiley agreement (2019–23) have funder metadata in Crossref.

-- Articles from the DEAL Wiley agreement (2019-23) with funder metadata in Crossref
SELECT cr_truth.has_funders, COUNT(DISTINCT jct_articles.doi) AS articles
FROM `subugoe-collaborative.openbib.jct_articles` AS jct_articles
LEFT JOIN sos-datasources.truthtables.crossref_truthtable_20260131 AS cr_truth
  ON lower(jct_articles.doi) = lower(cr_truth.doi)
WHERE jct_articles.esac_id = "wiley2019deal"
GROUP BY has_funders

library(bigrquery)
library(DBI)

# Replace `your-gcp-project` in the examples below with your own project ID.
my_con <- dbConnect(bigquery(), project = "your-gcp-project")

my_sql <- "
SELECT cr_truth.has_funders, COUNT(DISTINCT jct_articles.doi) AS articles
FROM `subugoe-collaborative.openbib.jct_articles` AS jct_articles
LEFT JOIN sos-datasources.truthtables.crossref_truthtable_20260131 AS cr_truth
  ON lower(jct_articles.doi) = lower(cr_truth.doi)
WHERE jct_articles.esac_id = 'wiley2019deal'
GROUP BY has_funders
"

my_df <- dbGetQuery(my_con, my_sql)

from google.cloud import bigquery

# Replace `your-gcp-project` in the examples below with your own project ID.
client = bigquery.Client(project="your-gcp-project")

sql = """
SELECT cr_truth.has_funders, COUNT(DISTINCT jct_articles.doi) AS articles
FROM `subugoe-collaborative.openbib.jct_articles` AS jct_articles
LEFT JOIN sos-datasources.truthtables.crossref_truthtable_20260131 AS cr_truth
  ON lower(jct_articles.doi) = lower(cr_truth.doi)
WHERE jct_articles.esac_id = 'wiley2019deal'
GROUP BY has_funders
"""

result = client.query(sql).to_dataframe()