This is a command-line tool for working with DataCite and Crossref data dumps. It can convert between formats, combine snapshots and produce statistics.
GitHub: https://github.com/Pardalotus/pardalotus_snapshot_tool
Cargo: https://crates.io/crates/pardalotus_snapshot_tool
If you haven’t already, install Rust.
Install the snapshot tool:
cargo install pardalotus_snapshot_tool
Then you can run
pardalotus_snapshot_tool --help
Input
Input is specified with the --input
flag. This can be a specific snapshot file, or a directory containing snapshot files in any format. Directories are visited recursively, and you can check which files are used with the --list-input-files
option.
Depending on the format and your hardware, the tool can read up to 20,000 records per second but it can still take a while to read hundreds of millions of records. So pass the --verbose
flag to see progress, reported every 10,000 lines read.
Speed fluctuates through the file due to variability in metadata record complexity.
Output
Currently only one output format is supported, .jsonl.gz
. This is a single file of newline-delimited JSON records, gzipped. It’s simple to parse.
You can convert a snapshot file by specifying an input file and and output file.
pardalotus_snapshot_tool --input /path/to/snapshot-dir/datacite.tar.gz --output-file output-file.jsonl.gz --verbose
You can combine snapshot files by specifying a directory of input files and an output file.
pardalotus_snapshot_tool --input /path/containing/snapshots --output-file output-file.jsonl.gz --verbose
List DOIs
Use the --print-dois
to export the list of DOIs for all records. E.g. to save all DOIs to a file:
pardalotus_snapshot_tool --input /path/to/snapshot-dir/datacite.tar.gz --print-dois > dois.txg
Stats
Use the --stats
option to generate:
- total number of DOIs
- total size of characters for all DOIs
- total size of characters for all DOIs in UTF-8 bytes
- total size of JSON for all DOIs
- averages
- frequencies of JSON size
- frequencies of DOI size