The previous post looked at how long DOIs are. One of the questions was:
Do UTF-8 encodings from Unicode characters make any difference on the statistics around DOI length?
Betteridge to the rescue! The answer is:
No.
Use of Unicode is so sparse that it doesn’t change any statistics. But it does raise an interesting question.
What is the distribution of characters, Unicode or otherwise, in DOIs? And does it differ between DataCite and Crossref?
I can’t think of any practical reason you’d want to know. But that’s no reason not to find out.
Conveniently a--doi-unicode-distribution
option was added to the Pardalotus Snapshot Tool shortly after it occurred to me.
DataCite Unicode Distribution
1
is the most popular character in DataCite DOIs.
Crossref Unicode Distribution
0
is the most popular character in Crossref DOIs.- We see
(
and)
in Crossref, a tell-tale sign of SICIs.
Commonalities
- Numbers are more popular than letters. The fact that every DOI must have a numerical prefix skews this.
- Non-ASCII characters are very uncommon.
- As every DOI contains
10.
and/
it’s not surprising to see1
and0
in top place. But DataCite has many more slashes. This may be to do with the way that hierarchical DOIs for datasets are constructed. - The higher less-used characters in DataCite and Crossref mostly like diacritics from European languages.
Conclusion
In practice, DOIs use a constrained character set. Despite the difference in history and size, the characteristics between DataCite and Crossref are quite similar.