Alternatives for what?
Public infrastructure alternatives to Google Big Query - an overview of features
There are pragmatic reasons why, as both data providers and users, we independently arrived at Google Big Query as a useful tool for sharing open data sets. Google solves a bunch of the hard problems, including authentication without the need for institutional affiliation, systems provisioning and a highly performant database system.
At the same time, Google Big Query is certainly not open scholarly infrastructure, nor is it fully equitable and accessible in all parts of the world, and Google is not an organisation many of us feel able to trust. Thus, reliance on Google is not a desirable long-term solution.
There are emerging alternatives both in the cloud and for local computing, and organisations exploring these alternatives or interested in doing so. With ORION-DBs, we hope to spur on these developments and their application for opening up use of scholarly data sets.
As a first contribution, in this blog we discuss some features, or lack thereof, of Google Big Query infrastructure, what they mean for how data are made available and how public infrastructure could strive to emulate or improve these features. Considering these features separately can help to ensure conscious and justified choices are made as to which to prioritise when thinking about local server or cloud deployment.
Having the underlying data available (stored) in publicly owned/governed infrastructure/data storage
This not the case currently in the Google Big Query setup. However, none of the data sources are made available exclusively through ORION-DBs or any of the subsidiary providers. A condition for inclusion in ORION-DBs is that the data sets are openly available somewhere else with an open license. So at the level of the original data sources, there is no lock-in effect. For the harmonised format of the datasources (table structure after ingest) that’s different - an agreed on archival format (e.g parquet) and storage frequency and location could be of added value here.Having the underlying data available (stored) in Europe or other non-US localities
Data in Google Big Query can be stored on (Google-owned) servers in Europe rather than US (other locations are also available). A point of attention is that combining data sets is only possible if they are in the same locality storage-wise. In Google Big Query, this can be solved by mirroring data sets across 2 or more localities (or have them hosted separately in 2 or more localities). This is a barrier that an open alternative might be able to circumvent entirely.Having the code of the infrastructure open source
Core to our goals is helping people build new kinds of analysis and applications. Investing in new platforms will only work if people trust them to remain accessible. This is why open source is one of the Principles of Open Scholarly Infrastructure; a requirement Google Big Query clearly does not meet. Open source doesn’t guarantee continued accessibility, but it helps provide insurance against risks. BigQuery is a major product for Google and is unlikely to disappear quickly, but pricing could increase or other restrictions could make it difficult. An alternative should provide both access as well as enabling others to support that access if needed and open source (and well documented) systems substantially mitigate the risks of infrastructure disappearing from underneath people.Access to the infrastructure (for computing) with no or minimal gate-keeping authentication/access control
Access to Google Big Query requires a Google account, which is in itself a barrier (dependency on US tech, personal information sharing, etc.). However, beyond this, there is no specific authentication (institutional or otherwise) required with the provider(s) of the ORION databases, minimising friction and social barriers. Risks of misuse/overuse are mitigated by a) all data sets being openly available with no or minimal re-use restrictions, b) all compute costs being borne by the user (see below) and c) the infrastructure being robust enough to not break from excessive use.Cost for storage (incl. pre-processing and ingest) kept separate from cost for compute
This removes administrative burden and risk for providers (keeping their costs contained and predictable), gives users agency over their usage (no externally imposed usage limits) and could be considered an interesting sustainability model.Ability to combine data sources from different providers that all use the same infrastructure This removes the necessity for any provider to host ‘all’ data sets - reducing costs for providers and increasing options and flexibility for both users and providers. It also makes it possible for different providers to offer the same data set with a different structure (e.g flat or nested, fully relational or not), rather than enforcing one structure for all hosted databases. As an extension, users can add their local data sets to include in analysis (without publicly exposing them) and providers can also contribute with a single public data set.
The infrastructure providing both access to the data and an analysis interface
Meaning in principle, end users do not have to install specific software to use the data for analysis. At the same time, for Google Big Query, integration with other platforms/software is well documented, extending usage options to, for example, Python and other languages and workflows.
All this is not to uncritically defend our current choice for Google Big Query, but to explain what we, at least, see as some of its separate characteristics that are relevant both for data providers and users. We recommend considering all elements in this list (and potentially others) when scoping the development of public infrastructure for sharing and facilitating usage of public scholarly metadata sources.