James Hare

Your own Wikidata Query Service, with no limits, in 2023

The Wikidata Query Service gives you programmatic access to Wikipedia’s underlying knowledge graph known as Wikidata. The Wikidata project was started in 2012 to create shared data storage for the hundreds of language editions of Wikipedia, and for all the different Wikimedia projects like Wikisource and Wikimedia Commons. For example, if you wanted to track different language Wikipedia articles around the same subject, previously each of those articles needed to link back to all the others. Wikidata replaced this with one database of page links that the different language editions of Wikipedia refer back to. This was Wikidata’s first major undertaking; Wikidata has expanded since then to provide basic structured facts about just about any topic you can think of, including those that do not have a Wikipedia article in any language.

The Wikidata Query Service’s underlying data structure, accessed through a query language called SPARQL, allows you to ask far more complex questions than you could of a more common relational database. Wikidata describes over 100 million concepts using over 11,000 properties. Designing a MySQL or Postgres database that captures all these possible attributes for all these possible things, including the indexes needed to facilitate your desired access patterns, is simply not feasible. Graph database structures avoid this problem by expressing each record as a triple: a subject, a property, and an object. This avoids the need to design schemas altogether, and you can build retrieval indexes around this simpler design.

However, while relational database scaling is a thoroughly solved problem, graph database scaling is not. The clean, tabular structure of a relational database more readily lends itself to what is called sharding, where you split a dataset across multiple machines. Even if you sharded a graph database, to successfully traverse the knowledge graph for a given query requires building that graph structure in memory, so you are nonethless limited by the RAM of your machine. Suffice it to say, graph databases require a lot of RAM, and for that reason and others they are not trivial to run.

It is in this context, and in the context of Blazegraph’s lack of new development, that the Wikimedia Foundation operates the Wikidata Query Service. The linked blog post explains in more detail, but to summarize, the Wikidata Query Service often times out on queries that are more complex or have very long return sets, as the query needs to finish execution and return within the span of sixty seconds. This is the kind of limitation you encounter using an overloaded free service.

Adam Shorland’s blog posts on building your own Wikidata Query Service have helped me deploy the free and open source Wikidata Query Service, developed by Wikimedia Deutschland and the Wikimedia Foundation, on my own hardware. In this post I will describe how to set up Blazegraph, used by the Wikidata Query Service, and QLever, a newer database that improves on Blazegraph but is not yet a full replacement. I also describe how you can avoid doing this work yourself by using available Wikidata Query Service alternatives.

Setting up the (virtual) hardware #

Before you proceed, you should keep in mind that you will need a very large amount of RAM, likely more than is in your laptop. Blazegraph in particular needs a machine with at least 128 GB of RAM, and at this stage in Wikidata’s growth I now recommend 256 GB of RAM. QLever may work with significantly less RAM, as I believe it does not rely as heavily on RAM, but I have not tested this.

Adam’s Blazegraph guide uses high-memory virtual machines in Google Cloud. Personally I prefer to rent dedicated servers, as this gives me more compute power for my money, albeit at the cost of doing slightly more work in keeping up a server. I consistently find Hetzner has the best prices, but you should use whichever provider works best for you. What matters more is the hardware itself than where it’s hosted.

In addition to a minimum of 128 GB of RAM, you will need at least 2 TB of SSD storage. I recommend getting a server with two 1 TB SSDs that are then formatted in RAID 0. This will give you a total of 2 TB of storage, split across two devices, allowing you to read and write at about twice the speed. (However, if one of the two SSDs fails, you lose all the data on both – be mindful of that.) I find that Blazegraph is not really optimized for CPU usage, so any CPU that can address 128 GB of RAM is probably going to be good enough. All told, you will likely be spending at least a couple hundred dollars per month on a rented server, and more if you want redundancy.

In terms of software you should be able to use any operating system so long as you have Docker installed. Follow the installation instructions for your operating system. If you are using Blazegraph, you will also need to install docker-compose, which may or may not be a separate program (or part of docker) depending on the version of your operating system.

Option A: Using Blazegraph #

Blazegraph is used by the current production Wikidata Query Service. It is no longer under development and is generally fussy to work with, but this will give you all the features of the Query Service, including the item label service. Critically, the Blazegraph deployment includes an updater that will ensure your query service stays in sync with Wikidata’s recent changes. (While the Wikimedia Foundation has since developed a new updater, this updater relies on infrastructure that is not available to the public, so we are still using the old one.)

To make setup easier, I have uploaded a simplified version of my setup to GitHub. Clone the private-wikidata-query repository and follow the instructions in the README file. You have the option to use a pre-built Blazegraph database file or to build from a Wikidata-supplied Turtle (TTL)-formatted dump if none of the Blazegraph database files are recent enough. (A Blazegraph instance cannot be more than 90 days behind the present or else it won’t be able to receive updates anymore.)

Assuming you are running this on your own machine, your query service’s frontend interface will be accessible at http://localhost:8099, with the SPARQL endpoint accessible at http://localhost:8099/proxy/wdqs/bigdata/namespace/wdq/sparql.

Of course, just because you can run your own query service, does not mean you should have to. To support the needs of Wikidata bot developers like myself and application development in the Wikidata ecosystem generally, I am building a graph query service called Orb Open Graph. At the moment it only includes data from Wikidata but in the future will include other datasets as well, building a linked data commons that can support and enhance apps like Scholia.

If you are interested, sign up for beta access. This will grant you access to a copy of the Wikidata Query Service with no query timeout, similar to the one documented in this post.

Option B: Using QLever #

QLever is a newer SPARQL query server, actively under development. The process for building a dataset is significantly more optimized, but a dataset once built cannot be updated; it needs to be rebuilt. The developers are working on changing this, but as of writing this is still the case. This is the main reason I do not recommend it yet as a drop-in replacement for Blazegraph. I am also not sure that it has the item label service that Blazegraph has or other Wikidata-specific features.

QLever is operated using a utility called qlever-control. Download a copy of the repository:

git clone https://github.com/ad-freiburg/qlever-control

cd qlever-control

From there, you will use the qlever binary to build Wikidata, the first time putting a period in front followed by a space. Run each of these commands:

. qlever wikidata

qlever get-data

qlever index

qlever start

After you complete these steps you should have QLever running with Wikidata loaded. You can then run a test query:

qlever example-query

This should show you how to make subsequent queries to the service.

You can also use this publicly hosted QLever instance to run queries without setting up your own database.