Pardalotus Logo Pardalotus

Unicode in DOIs

Joe Wass December 22, 2024

The previous post looked at how long DOIs are. One of the questions was:

Do UTF-8 encodings from Unicode characters make any difference on the statistics around DOI length?

Betteridge to the rescue! The answer is:

No.

Use of Unicode is so sparse that it doesn’t change any statistics. But it does raise an interesting question.

What is the distribution of characters, Unicode or otherwise, in DOIs? And does it differ between DataCite and Crossref?

I can’t think of any practical reason you’d want to know. But that’s no reason not to find out.

Conveniently a--doi-unicode-distribution option was added to the Pardalotus Snapshot Tool shortly after it occurred to me.

DataCite Unicode Distribution

Graph of character code distribution in DataCite

CSV

  • 1 is the most popular character in DataCite DOIs.

Crossref Unicode Distribution

Graph of character code distribution in Crossref

CSV

  • 0 is the most popular character in Crossref DOIs.
  • We see ( and ) in Crossref, a tell-tale sign of SICIs.

Commonalities

  • Numbers are more popular than letters. The fact that every DOI must have a numerical prefix skews this.
  • Non-ASCII characters are very uncommon.
  • As every DOI contains 10. and / it’s not surprising to see 1 and 0 in top place. But DataCite has many more slashes. This may be to do with the way that hierarchical DOIs for datasets are constructed.
  • The higher less-used characters in DataCite and Crossref mostly like diacritics from European languages.

Conclusion

In practice, DOIs use a constrained character set. Despite the difference in history and size, the characteristics between DataCite and Crossref are quite similar.