Pardalotus Logo Pardalotus

Falsehoods Programmers believe about DOIs

Joe Wass October 30, 2024

DOIs, or Digital Object Identifiers, are everywhere, for a given value of ’everywhere’. They are the identifiers used to identify and link research outputs, and a lot more besides.

Humans are good at spotting patterns, and with something as ubiquitous as DOIs, there are plenty of patterns to spot. However, with hundreds of millions of DOIs and decades of history, it pays not to make generalisations.

These all cropped up in my 10 years at Crossref. Either observed in the scholarly community using DOIs, or when writing software to find and handle DOIs.

This post is aimed at developers working with DOIs for the first time. The DOI Handbook is an excellent detailed document that anyone working with DOIs should read.

If you’re not familiar with the “Falsehoods developers believe about X” genre, check out the Awesome Falsehood list.

I think every developer working with DOIs should know the headlines. As to the details… it depends how curious you are.

All of these are wrong:

1. “A DOI always identifies a scholarly work”

This was true once, when Crossref was the only DOI registration agency. A lot has happened since, and there are now 12 DOI Registration Agencies. Most DOIs are still scholarly, but many are not.

For example, HAND assigns DOIs for “notable talent – in real & virtual worlds”:

Likewise, EIDR, the Entertainment Identifier Registry has DOIs for films like Megalopolis https://doi.org/10.5240/3A34-A746-2EA5-56BF-56A2-5 . Interestingly, EIDR also create shortDOIs. Here’s the shortDOI for the same entity: http://doi.org/10/gwg8ww .

Geoffrey Bilder’s blog post from 2013 goes into quite a lot more detail on the subject of what DOIs do and don’t identify.

2. “You can learn the owner / publisher from the DOI prefix”

Each registration agency specialises in a particular sector, whether that’s scholarly works, datasets, actors or films. The concept of a ‘publisher’ varies between these fields, and each has a different model for ownership.

In the case of Crossref, each member is assigned one or more DOI prefixes, which they use to create DOIs (and later update them). With tens of thousands of members, there is a wide range of organisational structures. There isn’t a one-to-one mapping between prefixes and publishers, especially for the more established publishers which comprise the bulk of DOIs.

More often than not you’re interested in the journal where a scholarly work was published, rather than the publisher. Both are considered metadata, and should be retrieved from the relevant API rather than guessed from the DOI structure.

3. “The owner of the DOI is still the same as when it was created”

Even if you could tell the publisher by looking at the DOI prefix, publishers come and go, transfer journals, and acquire each other. A DOI may end up owned by a different entity than the one that registered it. Elsevier has clocked up 32 prefixes at the time of writing. Some of them once belonged to a different organisation.

The metadata and ownership can change, but DOIs don’t.

4. “You can tell X from only looking at the DOI”

There is a history of people trying to embed information into identifiers.

A long time ago people tried to embed SICIs in DOIs. Here’s a DOI that has a ‘#’ character on the end:

10.1002/(sici)1099-050x(199823/24)37:3/4<197::aid-hrm2>3.0.co;2-#

The genius is that you can encode ISSN, date, location number, etc in the identifier. The downside is huge. The trailing ‘#’ could be erroneously removed (in URL handling it’s a fragment identifier that’s not sent to the server). New SICI DOIs aren’t allowed any more, but old ones still exist.

Quite aside from the risk of errors, what if some characteristics of the metadata ever changed? The whole point of separating out metadata from identifiers is that metadata can change (and, if necessary, be versioned).

The consensus is: create opaque identifiers. Don’t embed any information in the DOI. DataCite recommends using a random string of 6-10 characters from a limited set.

The corollary is: treat identifiers as opaque. Don’t try to guess what the structure of the DOI means. Even if one publisher embeds data a certain way, they might stop. And even if they keep doing it, the other publishers probably don’t do it that way.

5. “You can tell the owner from the DOI / Handle metadata”

The DOI service runs on top of the Handle service. Each DOI is a Handle, and each Handle record has a set of metadata associated with it. If we look at the Handle API for a given DOI:

https://hdl.handle.net/api/handles/10.2139/ssrn.2813111

{
  "responseCode": 1,
  "handle": "10.2139/ssrn.2813111",
  "values": [
    {
      "index": 1,
      "type": "URL",
      "data": {
        "format": "string",
        "value": "https://www.ssrn.com/abstract=2813111"
      },
      "ttl": 86400,
      "timestamp": "2017-06-28T15:19:44Z"
    },
    // ...
    {
      "index": 100,
      "type": "HS_ADMIN",
      "data": {
        "format": "admin",
        "value": {
          "handle": "0.na/10.2139",
          "index": 200,
          "permissions": "111111110010"
        }
      },
      "ttl": 86400,
      "timestamp": "2017-06-28T15:19:44Z"
    }
  ]
}

It’s tempting to look at the HS_ADMIN value and conclude that it belongs to whoever owns 10.2139. That might even be true. But this value is used solely for access control, which is subtly different to ownership.

There are a number of ways a DOI RA can model access control, and looking at a DataCite record we don’t see the same fine-grained HS_ADMIN record:

https://hdl.handle.net/api/handles/10.5438/axvs-my78

{
  "responseCode": 1,
  "handle": "10.5438/axvs-my78",
  "values": [
    {
      "index": 100,
      "type": "HS_ADMIN",
      "data": {
        "format": "admin",
        "value": {
          "handle": "10.admin/codata",
          "index": 300,
          "permissions": "111111111111"
        }
      },
      "ttl": 86400,
      "timestamp": "2023-11-09T10:21:03Z"
    },
    {
      "index": 1,
      "type": "URL",
      "data": {
        "format": "string",
        "value": "https://datacite.org/blog/datacite-dois-for-more-than-just-data/"
      },
      "ttl": 86400,
      "timestamp": "2023-11-09T10:21:03Z"
    }
  ]
}

The lesson is: Allow the registration agency to encapsulate DOI ownership information. Don’t rely on how they represent access control with Handle. If you really want to know the owner, get your data straight from the RA’s API.

6. “DOIs URLs always start with ‘https://doi.org’”

Over the years there have been various official resolvers:

Let alone fallbacks including Crossref’s: https://dx.crossref.org/. This came in handy some years back, when the doi.org domain registration lapsed, and Crossref’s advice was to temporarily use the Crossref resolver.

All of the above are valid DOI proxy servers, and can be used to resolve DOIs. The DOI display guidelines nudge toward https://doi.org/ but there are still plenty expressed against previous formats. The old ones will all continue to work.

Some years back I was analyzing the referrer logs for doi.org to see where links came from. I realised that when Wikipedia switched from HTTP to HTTPS we would be unable to see DOI referrer traffic. So I got involved in the switchover of how DOIs are rendered in Wikipedia (more detail in the blog post). This was one case where being aware of which scheme was used for the DOI resolver made a big difference.

An obscure detail perhaps, but obscure details make for the weirdest bugs. Be sure not to encode incorrect assumptions into your code!

If you’re rendering a DOI in a web page, then use https://doi.org/. Simple.

But if you’re looking for a DOI in a given input, be aware that it might be expressed many different ways!

7. “You can compare DOI strings for equality”

Even if you reduce a URL to its logical form, e.g. https://doi.org/10.1016/j.chroma.2008.01.017 to 10.1016/j.chroma.2008.01.017, you still have to contend with the fact that DOIs must be compared case-insensitive. 10.1016/j.chroma.2008.01.017 is consididered equal to 10.1016/J.CHROMA.2008.01.017.

It’s common to lower-case DOIs when storing and comparing them.

8. “You can compare DOI URLs for equality”

After the last three points, it should be no surprise that you can’t simply compare two DOI URLs. The following things might be different:

  • scheme (http: or https:)
  • resolver (dx.doi.org, doi.org, doi.crossref.org)
  • case (10.1016/j.chroma.2008.01.017 vs 10.1016/J.CHROMA.2008.01.017)

But there’s one more complication. DOIs may contain any printable Unicode character, which means we have to encode the non-URL-safe ones. The DOI handbook has recommended encodings. But there are mandatory and non-mandatory ones.

Mandatory: %, ", #, /, space

Recommended: <, >, {, }, ^, [, ], |, \, +, `

This means there is no canonical way of expressing a DOI URL in a way that can be compared to another. Parse the URL to a logical DOI first.

Oh, and if you’re using Java you shouldn’t be comparing URLs anyway.

The Handle record has a resource URL, to which you are redirected when you click on a DOI. But this isn’t necessarily where you end up. To take an example from Elsevier:

curl -L -https://doi.org/10.1016/j.chroma.2008.01.01717
> GET /10.1016/j.chroma.2008.01.017 HTTP/2
< location: https://linkinghub.elsevier.com/retrieve/pii/S0021967308000605

This ends up with some HTML and JS that does a further redirect in the browser:

function autoRedirectToURL() {
  var url =
    "/retrieve/" +
    document.getElementById("resultName").value +
    "?Redirect=" +
    document.getElementById("redirectURL").value +
    "&key=" +
    document.getElementById("key").value;
  window.location = url;
}
<input
  type="hidden"
  name="redirectURL"
  value="https%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS0021967308000605%3Fvia%253Dihub"
  id="redirectURL"
/>

I wrote up some findings a few years ago in this Crossref blog post about URL redirection a few years back. Things were complicated then, but the landscape has only got more hostile to non-human web users.

Don’t assume that the URL, or even domain name, found in the Handle record is where you will end up. And don’t assume that your bot can follow a DOI all the way.

10. “Each resource has only one DOI”

It’s possible to register multiple DOIs to the same resource URL. This could happen within a registration agency, or for two registration agencies to register a DOI for the same URL. There aren’t any checks for uniqueness at the DOI level.

It’s also possible to alias DOIs to each other. This is a feature of the underlying Handle system, and it uses the HS_ALIAS value. A look at the Handle API:

https://hdl.handle.net/api/handles/10.5479/si.0081024X.43

{
  "responseCode": 1,
  "handle": "10.5479/si.0081024X.43",
  "values": [
    {
      "index": 1,
      "type": "URL",
      "data": {
        "format": "string",
        "value": "http://si-pddr.si.edu/dspace/handle/10088/7018"
      },
      "ttl": 86400,
      "timestamp": "2011-03-16T16:33:59Z"
    },
    // ...
    {
      "index": 1970,
      "type": "HS_ALIAS",
      "data": { "format": "string", "value": "10.5479/si.0081024X.43.1" },
      "ttl": 86400,
      "timestamp": "2013-11-05T19:01:39Z"
    }
  ]
}

This is aliased to 10.5479/si.0081024X.43.1, and when you visit that original DOI you will be redirected to the resource url of the aliased DOI:

https://hdl.handle.net/api/handles/10.5479/si.0081024X.43.1

{
  "responseCode": 1,
  "handle": "10.5479/si.0081024X.43.1",
  "values": [
    {
      "index": 1,
      "type": "URL",
      "data": {
        "format": "string",
        "value": "https://repository.si.edu/handle/10088/7018"
      },
      "ttl": 86400,
      "timestamp": "2014-07-15T19:03:21Z"
    }
    // ...
  ]
}

11. “Each DOI has one resource”

Crossref’s Multiple Resolution service allows multiple publishers to maintain landing pages for the same DOI.

An example is https://doi.org/10.1049/cp.2018.1305 which has resources both for the IET and IEEE.

If you look in the Handle data you will see how this data is represented:

https://hdl.handle.net/api/handles/10.1049/cp.2018.1305

{
  "responseCode": 1,
  "handle": "10.1049/cp.2018.1305",
  "values": [
    {
      "index": 1,
      "type": "URL",
      "data": {
        "format": "string",
        "value": "https://digital-library.theiet.org/content/conferences/10.1049/cp.2018.1305"
      },
      "ttl": 86400,
      "timestamp": "2019-02-01T20:14:34Z"
    },
    // ...
    {
      "index": 1000,
      "type": "10320/loc",
      "data": {
        "format": "string",
        "value": "<locations chooseby=\"locatt,country,weighted\"><location id=\"1\" cr_type=\"MR-LIST\" href=\"http://mr.crossref.org/iPage?doi=10.1049%2Fcp.2018.1305\" weight=\"1\" /><location id=\"2\" cr_src=\"ieee_mr\" label=\"Xplore\" cr_type=\"MR-LIST\" href=\"https://ieeexplore.ieee.org/document/8651208\" weight=\"0\" /></locations>"
      },
      "ttl": 86400,
      "timestamp": "2019-02-26T04:08:22Z"
    }
  ]
}

Don’t bake any assumptions into your code that a given resource URL has only one DOI.

12. “The prefix is made of n digits”

The first hit from Google says that the number after a DOI prefix has “four or more digits”. But there are prefixes as short as 10.11. There are extant prefixes up to 10.80000.

Adding these limits introduces fragility to code. It’s best not to.

13. “DOIs conform to X regex”

This blog post from Crossref nearly 10 years ago suggests a regular expression that matches nearly all Crossref DOIs, at the time.

But even if you did match a DOI-looking-string, there’s no guarantee that it really is a DOI. You have to check for existence against the DOI.org resolver to be sure.

My advice is, it’s better to be more liberal in finding them, then check.

14. “All DOIs support content negotiation”

Content negotiation lets you request metadata in a given format such as RDF, Citeproc-JSON, etc. The CrossCite page lists a growing number of RAs that offer the service.

This query returns the RDF XML for a Crossref DOI:

$ curl -LH "Accept: application/rdf+xml" https://doi.org/10.1126/science.169.3946.635

It is configured in the Handle system, which tells the resolver which API can service the request.

For example, here’s the 10.SERV/Crossref record:

{
  "responseCode": 1,
  "handle": "10.SERV/crossref",
  "values": [
/// ...
    {
      "index": 4,
      "type": "10320/loc",
      "data": {
        "format": "string",
        "value": "<locations http_sc=\"302\">\n<location weight=\"0\" http_role=\"conneg\" href_template=\"https://api.crossref.org/v1/works/{hdl}/transform\" />\n</locations>\n"
      },
      "ttl": 86400,
      "timestamp": "2021-10-13T18:48:17Z"
    },
// ...
  ]
}

And DataCite’s 10.SERV/DataCite:

{
  "responseCode": 1,
  "handle": "10.SERV/DataCite",
  "values": [
    // ...
    {
      "index": 500,
      "type": "10320/loc",
      "data": {
        "format": "string",
        "value": "<locations http_sc=\"302\">\n<location weight=\"0\" http_role=\"conneg\" href_template=\"https://data.crosscite.org/{hdl}\"/>\n</locations>"
      },
      "ttl": 86400,
      "timestamp": "2019-10-02T02:04:42Z"
    }
  ]
}

In the case of EIDR, this is done at the prefix level:

{
  "responseCode": 1,
  "handle": "0.NA/10.5240",
  "values": [
    {
      "index": 4,
      "type": "HS_NAMESPACE",
      "data": {
        "format": "string",
        "value": "<namespace><locs>10.5240/locations-template</locs></namespace>"
      },
      "ttl": 86400,
      "timestamp": "2017-07-24T17:51:33Z"
    }
  ]
}

Which in turn leads to a list of EIDR’s supported formats and redirects.

At the time of writing, HAND DOIs don’t support content negotiation though. If we query HAND for Tom Hanks, no dice:

$ curl -v -LH "Accept: application/rdf+xml" https://doi.org/10.23/F72B-0103-B361-071E-08F3

< content-type: text/html

...
<!DOCTYPE html>
<html lang="en">
<head>
...

Note that Content Negotiation is a standard HTTP feature. The special thing with DOIs is the negotiated set of supported scholarly metadata formats.

15. “DOIs are just Handles that start with 10.

Each DOI is a Handle with a prefix starting 10.. Handle proxy servers work with DOIs, and support content negotiation where it is configured. They are effectively the same thing.

Technically though, DOIs are specified by the ISO 26324 standard to be technology-independent.

As they say: In theory theory is the same as practice. In practice, they are quite different.

16. “If it starts with 10. and looks like a DOI, it must be a DOI”

Not all strings that look like DOIs really are DOIs. There have been instances of platforms and publishers inadvertently not registering DOIs, or using DOI-like-strings which look for all the world like DOIs.

Geoffrey Bilder’s blog post from 2016 goes into some detail.

If you want to be sure it’s a DOI, you should resolve it.

17. “DOI is run / owned by Crossref / DataCite / etc”

The DOI system is governed by the DOI foundation on behalf of DOI registration agencies. The board of directors are drawn from registration agencies.

The DONA foundation is responsible for running the Global Handle Registry, which is the root service for Handle resolution. This in turn powers the DOI system. The DOI Foundation is directly affiliated with DONA, and does run some services that make up the Global Handle Registry.

You can read more about the relationship here and there’s some context in the DONA FAQ.

18. “DOIs will always resolve”

DOIs aren’t magic. As long as the DOI and Handle system keeps working as expected, the identifiers will always resove to the resource URLs.

The resource URLs under the control of publishers and other DOI owners might not, though. This research by Martin Eve goes into more detail.

19. “Publishers always update the DOI when the content moves”

Publishers are able to update landing page URLs without updating the DOI resource URL.

Some publishers, such as PLOS and Elsevier, run their own internal link resolvers.

To build on point 9 above, here is the chain of redirects for an example PLOS DOI:

  1. https://doi.org/10.1371/journal.pone.0190046
  2. https://dx.plos.org/10.1371/journal.pone.0190046
  3. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190046
  4. https://journals.plos.org/plosone/doi?id=10.1371/journal.pone.0190046

If PLOS were to update their article URLs, they could update the mapping at dx.plos.org rather than update the DOI Resource URL. So, it’s possible for a landing page to move without the DOI being updated.

The End

Who knew there was so much to misunderstand! Hopefully this helps avoid assumptions that might introduce software bugs. Some of these went into more detail than necessary for everyday use, but I think real-world examples are useful for explanation.

Do you need some help working with DOIs? Get in touch!