Pardalotus Logo Pardalotus

Snapshot Tool

This is a command-line tool for working with DataCite and Crossref data dumps. It can convert between formats, combine snapshots and produce statistics.

GitHub: https://github.com/Pardalotus/pardalotus_snapshot_tool

Cargo: https://crates.io/crates/pardalotus_snapshot_tool

If you haven’t already, install Rust.

Install the snapshot tool:

cargo install pardalotus_snapshot_tool

Then you can run

pardalotus_snapshot_tool --help

Input

Input is specified with the --input flag. This can be a specific snapshot file, or a directory containing snapshot files in any format. Directories are visited recursively, and you can check which files are used with the --list-input-files option.

Depending on the format and your hardware, the tool can read up to 20,000 records per second but it can still take a while to read hundreds of millions of records. So pass the --verbose flag to see progress, reported every 10,000 lines read.

Speed fluctuates through the file due to variability in metadata record complexity.

Output

Currently only one output format is supported, .jsonl.gz. This is a single file of newline-delimited JSON records, gzipped. It’s simple to parse.

You can convert a snapshot file by specifying an input file and and output file.

pardalotus_snapshot_tool --input /path/to/snapshot-dir/datacite.tar.gz --output-file output-file.jsonl.gz --verbose

You can combine snapshot files by specifying a directory of input files and an output file.

pardalotus_snapshot_tool --input /path/containing/snapshots --output-file output-file.jsonl.gz --verbose

List DOIs

Use the --print-dois to export the list of DOIs for all records. E.g. to save all DOIs to a file:

pardalotus_snapshot_tool --input /path/to/snapshot-dir/datacite.tar.gz --print-dois > dois.txg

Stats

Use the --stats option to generate:

  • total number of DOIs
  • total size of characters for all DOIs
  • total size of characters for all DOIs in UTF-8 bytes
  • total size of JSON for all DOIs
  • averages
  • frequencies of JSON size
  • frequencies of DOI size