Pardalotus Logo Pardalotus

How long is a DOI?

Joe Wass December 22, 2024

In 2024 DataCite released their first public data file. It’s easy to get a copy. Crossref have made a data dump available for the past few years.

Having both files available opens up some interesting possibilities in comparing and combining the two data sources.

The most obvious place to look is the DOIs themselves…

… and the simplest question you could ask is “How long is a DOI?”

Who’s asking?

In theory DOIs can come in all sizes and shapes. When we write software for processing scholarly metadata we must be able to deal with any DOI, which poses its own challenges. But having a good idea of the typical characteristics of the data helps us to make good engineering decisions.

The correct response to “how long is a DOI” is “why are you asking?”. You may have different answers for if you’re trying to build a database index, enforce a data constraint, decide how wide to make a text box, or simply know how many floppy discs to budget for.

We’re going to look at averages, maximums and minimums. You can treat these as heuristics drawn from the actual data. But the data may change, and there will always be exceptions!

DOIs Mean Averages

The Paralotus Snapshot Tool works with both Crossref and DataCite snapshots transparently and it can generate statistics. So let’s run it against the DataCite snapshot:

pardalotus_snapshot_tool --input ~/data/datacite  --stats --verbose

...
Record count: 52863283
...
Total DOI chars: 1155271510
Mean DOI chars: 21.85395
Modal DOI chars: 18
...

And on the Crossref snapshot:

pardalotus_snapshot_tool --input ~/data/crossref  --stats --verbose
...
Total DOI chars: 4012822526
Mean DOI chars: 25.396942
Modal DOI chars: 25
...

...

The mean (average) DOI length is pretty close. 22 characters for DataCite, 24 characters for Crossref.

But there’s more to the story. Crossref has a lot more DOIs, and a lot more history, so there’s more chance for variability. The snapshot tool calculates the frequencies for DOI lengths. We can chart this:

doi-lengths

Source Count Shortest Longest Mode Mean
Crossref 158004152 9 325 25 25
DataCite 52863283 9 152 18 22

Here’s the same chart truncated to remove the long tail. That modal value of 18 really stands out. There’s evidence of their tool for creating standard length suffixes.

doi-lengths

As long as a Unicode string

DOIs are Unicode strings, and they are permitted to contain any printable character. DataCite and Crossref each place their own restrictions on them, but these restrictions have changed over the years.

Unicode is often encoded in UTF-8, which means that each ASCII character is encoded in one byte, but higher code points use up variable length sequences up to 4 bytes. The more non-ASCII Unicode characters used in a DOI, the more bytes it takes to store.

Using Unicode for identifiers enriches our lives with a great many possibilities, and even more challenges:

  • The usability people we get to worry about homoglyphs and duplicates. DataCite have given some thought to this, and their automatic suffix generator avoids i, l, and o for this reason.
  • The schema people can ponder questions like “is the maximum length of a DOI measured in characters or bytes?”.
  • The database people get to think about character encoding (don’t be fooled into using utf-8 in MySQL! Use utf8-mb4).

If we produce the above statistics for UTF-8 bytes we get:

Source Total Chars Total UTF-8 Bytes Difference Max Unicode As UTF-8
Crossref 4012822526 4012822863 337 “–” = 8211 0xe2, 0x80, 0x93
DataCite 1155271510 1155271752 242 “—” = 8212 0xe2, 0x80, 0x94

Unicode makes such small difference that it doesn’t change any statistics. The mean and modal values are the same.

Amusingly, the highest unicode codepoint for each are en-dash and em-dash, one apart. They are both 3-byte sequences, but don’t assume you won’t get higher ones, one day. See the next post on Unicode in DOIs. It’s more detailed but less useful.

The presence of these two kinds of dashes does suggest that usability concerns about homoglyphs aren’t entirely misplaced. They are easy to get mixed up.

The conclusion is that Unicode doesn’t skew the data enough to notice. Just remember to handle it properly.

Answer

So how long is a DOI? Probably about 18 - 25 characters long. But don’t count on it.

Your turn

There’s only so long I can write about DOI lengths, and this is probably it. But maybe you have some ideas.

The Pardalotus Snapshot Tool has other features, with more coming.

Feedback welcome!