citations Archives - Software Heritage

Episciences links article code through Software Heritage

Nicole Martinelli — Thu, 21 Aug 2025 14:35:00 +0000

Software Heritage, the universal source code archive, preserves and provides access to source code as vital digital heritage. Researchers can directly link their published scholarly articles to the software that powers them. This new capability enhances research reproducibility and transparency by connecting findings to specific software versions.

Software Heritage partnered with the Center for Direct Scientific Communication (CCSD) to make this happen. The collaboration previously enabled software deposits on HAL in 2018, laying the groundwork for this new capability. Episciences, an overlay journal that hosts articles from open repositories like arXiv, Zenodo, and bioRxiv, now builds on a 2018 collaboration that enabled software deposits on open archive HAL, allowing authors to link to software archived there. Authors and journals using Episciences can link their articles with supplementary software via Software Heritage, using a SoftWare Hash IDentifier (SWHID) or a HAL-ID.

There are three basic steps:

Submit software to HAL

Depositing software via HAL ensures its sustainable archiving in Software Heritage. The complete deposit procedure is detailed in the HAL documentation: Deposit software source code.

Or, you can deposit software directly into Software Heritage by making a “Save Code Now” request with a GitHub URL. For more details, see this post: Save and reference research software
Link the software to the Episciences publication by adding a SWHID or a HAL-ID to your publication. For more, check out the Episciences documentation or this YouTube walkthrough.

Building on this ability to link articles and software, Episciences actively works to meet the evolving needs of researchers. Episciences is emerging as a new model in academic publishing, improving the visibility and accessibility of research articles that have already been peer-reviewed and published in conference proceedings. Instead of building a new library (traditional journal), overlay journals act as a highly knowledgeable curator who goes through existing open shelves (repositories), selects the best books, writes introductions for them, and creates a guide (the journal) pointing readers to those excellent, freely available books. This approach allows researchers to submit their conference papers to Episciences for additional scrutiny and broader dissemination, potentially increasing the impact and reach of their work.

“One fundamental aspect of the openness of science is the close link between scientific publications and associated research data. This link is essential for the transparency, reproducibility, and the overall progress of science. Episciences responds to this dynamic by inviting authors to supplement the submission of their document with a link to the dataset and/or software used in their work,” Agnès Magron CCSD

Beyond enabling these connections, Episciences actively contributes to the wider open science movement. The API and connector for Episciences were developed as part of the European Union-funded FAIRCORE4EOSC project. Episciences is also a member of the SCOSS Family. This commitment underscores why enabling this link via Episciences and the Software Heritage integration with HAL is paramount for research reproducibility, transparency, and accountability.

The next step is leveraging the COAR Notify protocol (developed by the Confederation of Open Access Repositories, COAR) to share links between different research object types.

The established partnership with CCSD, the successful HAL integration, and its utilization by Episciences provide a practical way for researchers to ensure their essential software is archived by Software Heritage and discoverable with their publications. Researchers and journals using platforms integrated with open repositories like HAL are encouraged to leverage this capability to link their software to their scholarly articles using Software Heritage. The partnership is about building a clearer, more open scientific story, where the findings and the code that powers them are part of the same picture.

The post Episciences links article code through Software Heritage appeared first on Software Heritage.

Software Heritage Citation Feature: Addressing researcher needs

Morane Gruenpeter — Wed, 07 May 2025 14:35:05 +0000

Software is essential in academic research—whether as a tool for data analysis, a research output, or even the very object of study. A truly Open Science ecosystem requires giving software the same attention and recognition as publications and datasets. In academia, citations are the currency of credit.

When we began working on Software Heritage in 2014, interest in software as a research outcome steadily grew, driven in part by increasing awareness of software’s critical role in tackling the reproducibility crisis. One area that quickly drew attention was software citation, which culminated in the 2016 publication of the FORCE11 “Software citation principles.”

Nearly a decade later, software citation continues to evolve, thanks to the ongoing efforts of initiatives like the FORCE11 Software Citation Working Group (WG) and the joint RDA/FORCE11/ReSA Software Identification Working Group (WG).

Following GitHub’s lead, Software Heritage now has built-in citation support. This is a big step towards finally recognizing software as a real research output.

This blog post gives a quick rundown of the important news and how Software Heritage is shaking things up for citing software. We’re tackling a problem that’s been around for ages and offering a fix.

Citation made easy with Software Heritage

With this new feature, researchers, users, and readers can now easily generate and copy a BibTeX citation directly from Software Heritage into their .bib files.

Citing a specific software version—or even a precise code fragment—is just as simple: select the version or highlight the lines of code, and you’re ready to go.

Before diving deeper into the challenge of credit, it’s helpful to understand what’s already achievable using Software Heritage’s existing tools for software reference and preservation.

The universal archive of Software Heritage allows researchers and developers to:

Prepare their repositories with key metadata files (AUTHORS, README, LICENSE, codemeta.json, etc.)
Trigger archival using the Save Code Now interface, the browser extension, or a webhook
Obtain a Software Hash Identifier (SWHID)—a persistent, verifiable reference to a specific version or component. With the new citation feature, you can generate a ready-to-use BibTeX citation directly from the archived object, offering a more robust and reliable alternative to citing a forge URL.

For details on how to cite, check the Software Heritage documentation. It describes how the system uses the software’s internal data to generate a citation you can export.

Software Heritage can automatically generate a BibTeX citation using the intrinsic metadata archived from a software repository. This metadata is typically sourced from either a codemeta.json or a citation.cff file found in the repository.

To make citing code easier, users can embed a specific version or fragment of code directly into webpages using iframes. A simple Web UI endpoint is provided for this purpose. Here’s an example:

A long, intertwined journey

The “Software Citation Principles” FORCE11 paper highlights four key motivations for citing software:

Credit
Understanding research fields
Discovering software
Reproducibility

These goals are crucial, but not exactly straightforward. To make them happen, we need to look at the practical parts of “software citation.” We’re drawing on the guidance from the European Open Science Cloud (EOSC) report on Scholarly Infrastructures for Research Software (December 2020), full report here. The 94-page report outlines practical recommendations to improve the current landscape by building on and connecting existing infrastructures.

Breaking down software citation components

It might seem like “software citation” is easy – just one more thing to list in your bibliography. But actually, the concept brings up at least four distinct and equally important points:

Archival: Ensuring long-term preservation and accessibility to the source code
Reference: Accurately identifying the exact version used, to ensure reproducibility
Description: Capturing structured, well-curated metadata about the software
Credit: Acknowledging all contributors involved in the software project

Among these, the most complex and often debated issue is credit— how credit can be given? What “thing” do we want to cite?

Artifacts granularity – SWHIDs in citations

Software is often difficult to cite precisely due to its layered and modular structure. Even small projects usually have several parts, so citing a specific version—or even a specific commit—might be needed for reproducibility or clarity.

While DOIs are widely used to cite publications and datasets, software presents different challenges. As Software Heritage co-founder Roberto Di Cosmo and co-authors note:

“[…] we need identifiers that are not only unique and persistent but also intrinsically support integrity.” (Di Cosmo et al., 2018)

This means relying not on a centralized registry, but on cryptographic techniques. SWHIDs meet this need by being computed directly from the content of the digital object, using cryptographically secure hashing algorithms. Anyone with a copy of the object can independently verify the identifier, making SWHIDs uniquely resilient and trustworthy.

Research Data Alliance/FORCE11 Software Source Code Identification WG, 2020, https://doi.org/10.15497/RDA00053

Why is software attribution so hard?

Sometimes, the list of individuals involved in a software project is very long, so long that it’s not feasible to include everyone directly in a citation. As a result, a common practice has emerged: attributing authorship to the project team or collective entity. For instance, the software record might list “The Givaro group” as the author.

How should credit be given?

Just like with articles where the writers are the authors, we usually assume the developers of code are software authors. However, like everything else about software, it’s more complicated than it looks. Software projects involve many different roles, so the simple term “author,” even with “contributor” added, definitely doesn’t cover everyone involved. One paper, “Attributing and Referencing (Research) Software: Best Practices and Outlook From Inria,” identifies nine key roles, similar to what the CRediT system does for research articles, based on extensive real-world experience. Other communities have also identified the need for distinguishing different software roles. (See the SORTÆD example for a Software Role Taxonomy and Authorship Definition.)

But is this approach sufficient?

The answer largely depends on the infrastructure behind the citation, specifically, whether it can maintain the relationships between the software and the people who contributed to it. A named group is only meaningful if it can be linked to individual members, their roles, and the context of their contributions.

Here, Software Heritage offers a solid base. While citations may remain concise, the underlying metadata, especially when enriched with structured information like codemeta.json or citation.cffcan hold detailed attribution records. These can include individual contributors, their specific roles, and even ORCID identifiers to ensure long-term traceability.

This layered approach balances concise citation with rich, machine-readable credit, helping infrastructure bridge the gap between recognition and practical constraints.

Ultimately, giving credit in software requires more than just a name on a list—it requires context, clarity, and systems that respect the complexity of collaborative work. And that’s exactly what we aim to support through Software Heritage’s evolving citation tools and metadata practices.

Finally, the `@software` type in BibTeX gets long-awaited improvements

The @software entry type in BibTeX has been around for a long time, but it was just a placeholder, treated like the @misc entry. In 2020 it was finally enriched with long-awaited metadata for software — a major milestone for software citation in academic publishing.

Thanks to the biblatex-software package available on CTAN, you can now cite software with much greater precision using four dedicated entry types that reflect different levels of granularity:

@software — for general references to computer software
@softwaremodule — for citing a specific module within a larger software project
@softwareversion — for referencing a particular version of a software
@codefragment — for pinpointing a specific code fragment, such as an algorithm or a key function within a program or library

Here’s an example of how this can be implemented using the biblatex-software package. The sample below (entries 2, 6, and 7) demonstrates how tags distinguish between software projects and specific releases:

[Rp] Reproducing and replicating the OCamlP3l experiment

ReScience C, 6 (1), 2020 https://doi.org/10.5281/zenodo.4041602

The road ahead… and a word of caution

In short, we’ve achieved real progress on several fundamental parts of software citation. Now, you can archive, properly reference, and even directly cite any piece of source code from the universal archive. A standard for describing software is taking shape, and there’s growing experience and expertise in curating this metadata and navigating the nuanced challenge of assigning appropriate credit.

We hope these emerging best practices will be adopted by stakeholders beyond Europe, where the EOSC SIRS report and the FAIRCORE4EOSC project have laid crucial groundwork, but also globally.

Nevertheless, it’s important to acknowledge that the question of “how credit is assigned” remains a highly sensitive and intricate challenge, an issue well documented by the DORA declaration and echoed in earlier reflections from the computer science community.

We must avoid repeating these errors with software. The risks could actually be even greater. Simply relying on citation counts for credit could unintentionally harm how we see the value of different software contributions.

As the EOSC SIRS report wisely cautions:

Metrics should not be reduced to simple numeric indicators, to avoid reproducing in the research software world the negative effect that bibliographic indicators have had in the research publishing world. It is necessary to bring together a broad spectrum of expertise, and include in the conversation representatives of the research community that will be directly impacted by the creation of these metrics.

By working together, we can build a culture that truly values and respects software as a key part of research, ensuring that credit is fair, thoughtful, and reflective of the complexity of collaboration.

The post Software Heritage Citation Feature: Addressing researcher needs appeared first on Software Heritage.

Software Heritage and Zenodo integrate to safeguard research

Nicole Martinelli — Wed, 13 Nov 2024 13:59:00 +0000

Forget the image of dusty libraries and yellowing parchments. In today’s digital landscape, the effort to preserve knowledge takes place in server farms and code repositories. Now two digital archives, Zenodo and Software Heritage, are working together with an integration aimed at safeguarding our shared scientific software legacy.

Funded by the EU’s FAIRCORE4EOSC project, these organizations have joined forces to create a seamless pipeline for researchers.

Here’s how it works: Code deposited in Zenodo is automatically archived in Software Heritage, the world’s largest software source code archive. Researchers get a Digital Object Identifier (DOI) for easy citation, while Software Heritage computes a Software Hash Identifier (SWHID) for ensuring the identification of the exact version that is used or mentioned, for reproducibility – a real code fingerprint. All of this takes place behind the scenes, streamlining the archiving process. Researchers can simply deposit code in Zenodo, and the rest is handled automatically.

Zenodo software record archived in Software Heritage, see bottom right corner.

Beyond the basics

Zenodo’s upload form now offers software-specific fields, making it easier to categorize code. Additionally, support for CodeMeta and Citation File Format export formats streamlines citation workflows. Upcoming improvements focus on interoperability, allowing other repositories to join the software preservation movement.

The integration between Zenodo and Software Heritage builds on the 2020 recommendations of the EOSC Scholarly Infrastructures for Research Software report that set out to establish research software as a valuable scholarly output by tackling issues like archiving, referencing, describing, and crediting software artifacts.

The corresponding software record in the Software Heritage archive.

“Software Heritage is taking over the heavy lifting of proactively harvesting and archiving all software source code with its full development history…It’s important that all scholarly repositories, which may be of varying sizes and addressing different institutional or disciplinary needs, properly interface with Software Heritage and offer researchers the additional functionalities they expect, and that research articles reference the archived version of the software.”

Looking ahead

Though the core integration is up and running, in the next six months further backend improvements are planned to ensure seamless interoperability. The integration will also be made into InvenioRDM, making it easier for other repositories to join the Software Heritage network of partners.

This is more than just code archiving. It’s a commitment to the future of research. By ensuring the long-term survival of software, Zenodo and Software Heritage hope to equip researchers to build upon the shoulders of giants – in code form.

By O. Von Corven – Tolzmann, Don Heinrich; Alfred Hessel and Reuben Peiss. The Memory of Mankind. New Castle, DE: Oak Knoll Press, 2001, Public Domain https://commons.wikimedia.org/wiki/Category:Library_of_Alexandria#/media/File:Ancientlibraryalex.jpg

More about the partners

It’s easy to say this integration was bound to happen: The name Zenodo comes from Zenodotus, the first librarian of the Library of Alexandria and considered the father of metadata; Software Heritage is often referred to as the “Library of Alexandria” of software. Interconnecting these major platforms is a crucial step forward for open science in general and source code preservation in particular.

Founded in 2016, Software Heritage is a non-profit organization on a mission to safeguard the very foundation of our digital age – source code.

Software Heritage currently houses the world’s largest collection of publicly accessible source code, amassing nearly 18 billion unique source files from over 282 million software projects as of January 2024.

Created to support European Commission-funded research, Zenodo has evolved into a global platform for sharing and preserving research data, software, and other artifacts. Developed by researchers for researchers, Zenodo aims to democratize open science by providing a barrier-free space for all.

The post Software Heritage and Zenodo integrate to safeguard research appeared first on Software Heritage.