Deploying Wikidata to different graph databases and what works best: Blazegraph, QLever, QEndpoint, Amazon Neptune
Wikidata is a project, related to Wikipedia, developing a crowdsourced graph database of facts about the world. To ask questions of Wikidata, you write queries in a query language called SPARQL and you submit those queries to the Wikidata Query Service. This query endpoint has become a vital resource for those operating bots to maintain Wikidata. Previously I used the query service for my bots to determine what data had already been included in Wikidata, avoiding unnecessary network traffic that slows the bot down.
As Wikidata has grown, largely as a result of mass imports of journal article metadata, the tendency for queries to time out increased. No amount of query optimization is enough, as the API endpoint could time out by simply having a large return set that takes over one minute to transmit. This immediately caused an issue for my bots; “give me every Wikidata item with a digital object identifier” is not a technically complex query, but it is one with tens of millions of results, which takes a non-negligble amount of time to transmit over the Internet. Blazegraph does not paginate results, meaning you need to be able to download the entire returnset within the one minute allotted by the proxy placed between you and Blazegraph. Getting this data from the Wikidata Query Service would not work for my bot.
In this case it was a simple enough problem to work around. I knew what data I needed and what shape it needed to be in, so I could generate a relational database in MySQL from a Wikidata database dump. But would I have to do this kind of work for each new question I wanted to answer? Could I find new uses of the query service if I overcame this scaling challenge?
Figuring it would be as simple as deploying my own instance of Blazegraph, loading the Wikidata TTL dump, and setting the query timeout limit higher, I set out to build my own Wikidata Query Service. It took more work than I initially foresaw, but I did eventually succeed. From there, I started to look for alternatives. I describe in more detail below, but to summarize, Blazegraph is still the best option if you want to use existing tooling to sync your database with Wikidata’s recent changes stream, but QLever looks to be a promising alternative if you are fine with a one-time dataset. That said, this is an ongoing project, and I am interested in recommendations.
Developed by Systap under a free software license, Blazegraph promises high performance at high scale. To its credit, it has mostly kept up with Wikidata’s growth over the last decade. However, it has not done so with the help of Systap, who was acquired by Amazon at some point after their product had become ingrained in Wikidata’s infrastructure.
As Blazegraph is mostly abandonware at this point, it is basically a black box that you reboot when it falls over and rebuild when it fails completely. (I have had to do one such rebuild.) Loads from scratch can take several days; originally there were timeouts, but the time limits were simply removed, which merely solved the issue on the surface. As mentioned before, queries can take exceptionally long to execute. Certain queries with intermediate result sets result in out of memory errors and don’t run at all, even on generously resourced workstations.
However, for all its flaws, it works. You can load a dataset and you can connect it to the Wikimedia-developed updater that hooks Blazegraph into Wikidata’s recent changes feed and allows anyone to run their own query service. This is my solution to having longer running queries in the short term, but I would prefer to use a database under active development.
If you would like to try this myself, a resource I still consult regularly is Adam Shorland’s 2019 blog post, Your own Wikidata Query Service with no limits. His approach relies on setting up virtual machines in Google Cloud, while I deploy to either rented dedicated servers or my own hardware.
Amazon Neptune #
Amazon Neptune is, as far as I can tell, mostly the same as Blazegraph. However, it is very expensive. To run a Neptune instance with enough resources to support Wikidata will set you back thousands of dollars per month. This may be an attractive option if you need a graph database server that simply works and do not have the resources in-house to make it happen. I otherwise don’t really recommend it.
Developed by the University of Freiburg in Germany, QLever claims to support graphs with up to 100 billion triples on standard hardware. (Wikidata is currently at around 14 billion.) It also claims to better support the kinds of queries that tend to OOM on Blazegraph. A recent exchange I had on Telegram supports this; a query I attempted to run in Blazegraph resulted in an error, while someone with access to QLever was able to produce results.
I had to try out QLever after seeing this. After all the fumbling and steps of setting up Blazegraph, QLever is refreshingly straightforward. It downloads a dump of Wikidata for you, it processes it, and it loads it. It simply gets the job done. The only problem is that once you load a dataset, you cannot update it. You can only re-load it from scratch. This precludes it as a drop-in replacement for wikis running Blazegraph, or for maintenance use cases requiring real-time data, but it seems promising for use cases that can tolerate week-old data.
QEndpoint is another emerging Blazegraph competitor, developed by The QA Company, a graph database and Wikibase consultancy in France. Like QLever, it is designed to be easy to set up with Wikidata. However, it appears they have only succeeded with the “truthy” subset of Wikidata. This subset, which excludes historic, outdated, deprecated, etc. statements, is good enough for most use cases, but my goal is to cater to all use cases of the Wikidata Query Service.
If they hadn’t figured out the full Wikidata set, could I? The answer is no. If you try to load the entire thing, the database runs out of memory and the import crashes. If you try to break it up into smaller pieces, you only import the latest segment; you are not incrementally importing the entire dataset. If you run the full dataset through a “munge” process, used in Blazegraph imports to produce a slightly less expressive but more efficient dataset, it still crashes. The conclusion I have come to is that I cannot get QEndpoint to work for Wikidata, at least not the full dataset.
Where this brings me today #
I believe that Wikidata’s improvement is slowed by the limits of its infrastructure. If the information discovery service that makes Wikidata useful cannot keep up with Wikidata’s growth, it threatens Wikidata’s viability. This is especially the case if this discovery service is used to facilitate Wikidata’s own quality assurance.
Wikidata’s ecosystem of volunteer developers require useful resources for their work. Especially when working with large graphs like Wikidata, there are certain technical requirements that are beyond most people’s means and even beyond what the free-of-charge Wikimedia Cloud Services offers. Therefore, once you are in the position to scale up your bot or service, you run into the limits and momentum comes to a grinding halt. When you are putting in this effort as an unpaid volunteer, it can especially be demoralizing.
In 2022 I soft-launched the Orb Open Graph, a data service built on open datasets like Wikidata. For now, it is simply a private version of the Wikidata Query Service with ten-minute query timeouts. Over time, I would like to incorporate more datasets into this graph, like OpenStreetMap. I am also looking into launching new Wikibases of my own that would expand on this dataset. In the long term, I would like to see Orb Open Graph become an open data backend for web apps such as Scholia. If you have long-running queries that time out or are at risk of doing so, feel free to request access to the closed beta.
Why SPARQL #
You may have noticed that I focusesd exclusively on databases that are queried with SPARQL. There are other graph query languages such as GraphQL. So then, why SPARQL? The answer is simply because that is what Wikidata’s tool and bot ecosystem uses. As we already have plenty of volunteer-developed software that uses SPARQL, I would like to offer a service that maintains that kind of API compatibility.
But what if we were not limited to SPARQL? What other query languages could emerge as alternatives? What derived datasets could take off as alternatives to current uses of the Wikidata Query Service?