tutorial Archives - Software Heritage

Preserving legacy code with Software Heritage: A tutorial

Nicole Martinelli — Wed, 13 Aug 2025 12:08:00 +0000

This post will walk you through the Software Heritage Acquisition Process (SWHAP), a step-by-step method for properly archiving your legacy source code into the Software Heritage Archive. You can also follow along with the 32-minute YouTube video or use the guide on GitHub prepared by team member Mathilde Fichen. If you’re looking for more help, check out the SWHAP guide or join our mailing list to share information with other rescue and curation teams.

Setting up your local working environment

Let’s get your local workspace set up. First, you’ll use a GitHub template to create a new repository, then clone it to your computer. This creates a local copy, making it easy to manage your files.

Start by creating your own GitHub repository using the provided template. Name it after your software, adding “workbench” to the end (e.g., “my software workbench”) and indicate that it’s a private, temporary workspace. After you create it, you can update the README with details about your software.

Now, let’s create a local copy of this environment. Click the “code” button, copy the SSH link, and then use the git clone command in your Linux terminal to clone the repository to your computer.

Uploading raw material

Once your local workbench is set up, the next crucial step is to upload all your initial materials into the raw materials folder. This includes your original source code material, such as scanned paper listings, compressed files, or any initial digital versions. It’s also vital to upload any relevant documentation that explains the source of the code, such as emails from the historical author, provided the author consents.

“Zenith Z-19 Terminal” by ajmexico is licensed under CC BY 2.0

Next, you’ll upload the machine-readable version of your source code into the source code folder. If your code is in a non-digital format (like on paper), you’ll need to transcribe it first.

For better organization, especially if your software has multiple files, it’s a good idea to create subfolders. Just be sure to use the correct file extensions for your programming language (e.g., .for for Fortran or .py for Python).

To wrap things up, you’ll need to fill out the metadata folder. This folder contains several important elements that you should complete as thoroughly as possible:

Catalog: This file references the initial elements you uploaded into the raw materials folder. You should include details like the item’s name (e.g., “listing from 1971”), its origin (e.g., “author’s personal archives”), where the original is stored, the author’s name, approximate dates, and who collected it, along with any relevant descriptions or notes.
License: If you know the software’s license, fill it in. For private code that you own, you can specify any license you wish. If there’s no license, but you have explicit permission to archive and use the code (for academic or educational purposes, for example), be sure to state that.
Version history.csv: This CSV file is designed to register data for each version of your software. It’s useful for automating the reconstruction of your software’s development history if you have multiple versions. Remember to fill in details such as the directory where each version is stored, author names and emails, creation dates, release tags (official version numbers if available), and a commit message for each version.
Codemeta.json: This file, in JSON format, is not meant for human reading but is crucial for search engines to easily find and identify your code and its linked data once archived. While you can update your codemeta.json file manually, we recommend using the CodeMeta generator website, which allows you to enter your software data in a user-friendly interface and then generates the necessary JSON snippet to paste into your codemeta.json file.

Syncing with GitHub

Once you’ve added all your materials and metadata locally, the next step is to synchronize these changes with your online GitHub repository. You’ll do this using a few Git commands in your Linux terminal. Navigate to your workbench directory and use the git add, git commit, and git push commands for the raw materials, source code, and metadata folders. This ensures all your local work is backed up on your GitHub repository.

After you’ve uploaded all your initial materials locally, the next big step is to make sure everything you’ve added is synchronized with your GitHub repository online. Here’s how you do it:

First, navigate to your workbench directory using your command line. Once you’re in the workbench directory, you’ll use specific Git commands to synchronize your files.
You’ll do this in three main parts:

Raw materials:

Add your raw materials: git add raw materials
Commit these changes: git commit -m "Your small message here"
Push the changes to GitHub: git push

Source code:

Add your source code: git add source code
Commit these changes: git commit -m "Your small message here"
Push the changes to GitHub: git push

Metadata:

Add your metadata: git add metadata
Commit these changes: git commit -m "Your small message here"
Push the changes to GitHub: git push

Finally, check your GitHub repository to confirm that all your documents, like your raw materials, are visible. With that, you’ve now completed the first major step of getting your initial materials uploaded and synced to your workbench.

Reconstructing development history

This is a crucial phase, especially if your software has multiple versions. Your goal is to rebuild the development timeline of your source code on a new, dedicated GitHub branch.

1. Create an orphan branch: From your workbench, you first create a new branch called source code. This orphan branch is completely detached and doesn’t carry any previous commit history from your master branch.

2. Clean the branch: After creating a SourceCode branch, you’ll clear out any existing files within it by running git rm -r . and then committing the change. This prepares the branch for you to add each version of your source code one by one.

3. Copy and commit versions: Next, copy paste the first version of your software’s source code into this new branch.

Copy the source contents into our branch:

git checkout master -- source_code/v1/*
mv source_code/v1/* .
rm -rf source_code

Then use the following template to manually create an individual commit/release:

export GIT_COMMITTER_DATE="YYYY-MM-DD HH:MM:SS"
export GIT_COMMITTER_NAME="Commiter Name"
export GIT_COMMITTER_EMAIL="email@address"
export GIT_AUTHOR_DATE="YYYY-MM-DD HH:MM:SS"
export GIT_AUTHOR_NAME="Author Name"
export GIT_AUTHOR_EMAIL="
git add -A
git commit -m "Commit Message Here"

Mind the metadata

When you import source code and commit it, Git will, by default, use your current user information and the present date. This means you would appear as both the committer and the author of the code, and the timestamp would be today’s date—not the historical date from when the code was originally created.

That’s not what we want. To get the commit history right—so it shows the code’s real origin—you have to change the commit’s metadata manually. The template in the guide allows you to explicitly set the author, committer, and dates for the commit, preserving the historical information of the source code. Finally, add a Git tag (for example, v1) to mark this as an official version.

export GIT_COMMITTER_DATE="2024-05-01 00:00:00"
export GIT_COMMITTER_NAME="Math Fichen"
export GIT_COMMITTER_EMAIL="mathfichen@monadresse.com"
export GIT_AUTHOR_DATE="1972-05-01 00:00:00"
export GIT_AUTHOR_NAME="Colmerauer et al."
export GIT_AUTHOR_EMAIL="<>"
git add -A
git commit -m "V1 of MySoftware"

4. Repeat for subsequent versions: If you have multiple versions, repeat the process. You’ll clean the repository again, copy the next version of the source code, and commit it with its respective historical metadata and a new tag (e.g., “v2”).

5. Push the branch: Finally, you’ll push this new source code branch (with its reconstructed history) to your GitHub repository.

Pro-tip: automate the process

If you have many software versions, you can automate the process of updating the commit metadata with a small script called DT2SG. That way you can use the data you entered in the version history.csv file to apply the correct historical metadata automatically.

Run the following Git commands:

dotnet ./DT2SG/DT2SG_app.dll -r mathfichen/MySoftware_Workbench/source_code/ -m mathfichen/MySoftware_Workbench/metadata/version_history.csv

Creating the final public repo

Once the development history is reconstructed in your workbench, you’re ready to create the final public repository on GitHub. This is the repository that will be shared and ultimately archived by Software Heritage.

Go to GitHub and create a new repository. Name your repository after your software and make it public so Software Heritage can harvest it.

Copy the URL of this new, public repository.

Using specific Git commands in your Linux command line, you will transfer all the work you’ve done in your private “workbench” repository into this new public repository. This essentially pushes all branches and their content (master branch with raw materials and metadata, and the source code branch with its development history) to the public repository.

As a final touch, it’s a good idea to add topics to your GitHub repository, such as software heritage, legacy code, archive, and swap. This makes the repository easier to find when people search.

Triggering software heritage archival

The last step is to trigger the Software Heritage acquisition process itself.

Navigate to the Software Heritage “Save Code Now” page.
Enter the URL of your final, public GitHub repository into the designated section.
Submit the URL. Software Heritage will then process and archive your code. After a few minutes, you should be able to search for your software on the Software Heritage archive and find it archived.
As a final touch, you can generate “badges” for your archived software. This generates a code snippet (typically Markdown) that you can copy into your public GitHub repository’s README, displaying a badge confirming your software’s successful archival in Software Heritage.

And just like that, your legacy software is preserved in the Software Heritage archive.

The post Preserving legacy code with Software Heritage: A tutorial appeared first on Software Heritage.

How to preserve legacy code with Software Heritage

Nicole Martinelli — Wed, 06 Aug 2025 14:44:00 +0000

The Software Heritage Acquisition Process, or SWHAP, is a method developed by the Software Heritage team and its partners for saving and archiving older source code. This post and the companion 10-minute YouTube video offer an overview of what SWHAP is all about.

Understanding Software Heritage

First, a quick refresher on Software Heritage. It’s a non-profit dedicated to building a universal, open archive of source code. Usually, Software Heritage works by automatically collecting and saving public code already hosted on forges – online platforms like GitHub or GitLab where devs keep their projects. This automated system is massive, having already archived 400 million projects and over 25 billion unique source code files to date.

The challenge with legacy code

But what if your code isn’t sitting on one of these public platforms? That’s the core issue with “legacy” source code. This is code that isn’t easily accessible online – maybe it’s printed on paper, stuck on an old floppy disk, or just living on your hard drive. Getting this kind of code properly archived for the future is where things get complicated.

Why preserving source code matters

You might wonder why we bother preserving old, seemingly outdated code. Beyond its immediate function, source code is an invaluable record of technological history and human ingenuity. It safeguards intellectual heritage, allowing future generations to learn from past solutions and understand the evolution of software that underpins our world. Preserving these digital artifacts provides crucial context for researchers, historians, and developers to trace ideas and comprehend the thought processes behind their creation.

Margaret Hamilton standing beside the Apollo Guidance Computer (AGC) source code, now archived at Software Heritage.

“Programs must be written for people to read, and only incidentally for machines to execute.”
― Harold Abelson, Structure and Interpretation of Computer Programs

This perspective highlights code’s role as a human document, not just machine instructions. If you have valuable source code you want to preserve but aren’t sure how, the SWHAP process is designed to help.

How to SWHAP: The basics

The SWHAP process involves two primary steps:

Get your legacy source code onto a forge. In most cases, GitHub is the preferred platform, simply because it’s so widely used.
Once it’s on GitHub, we can then trigger Software Heritage’s automated system. This ensures your code is securely pulled into the Software Heritage Archive.

What you’ll need

SWHAP requires a few specific tools and some prep work. First, if your code isn’t already in a digital format – say, if it’s a printout – you’ll need to transcribe it into an electronic file.
After that, you’ll need:

A GitHub account
A Linux command-line interface
Git installed on your computer
A secure SSH key configured for your GitHub account.

If these technical requirements seem daunting, our detailed SWHAP guide provides comprehensive setup assistance. You can also join our mailing list to share information with other rescue and curation teams.

The pitfalls of GitHub uploads

You might be tempted to skip these steps and just manually upload your code directly to GitHub. But that approach can cause significant problems. Here’s a real example of what SWHAP aims to prevent: a public GitHub repository for C-Prolog. While this is historically important code—an early interpreter from 1982—a glance at the screen reveals a GitHub user uploaded it in 2017.

A casual visitor might assume the code is much newer than it is, and that the GitHub user, not the actual creator, wrote it. Worse, if you try to verify the code’s accuracy or origin, the only information is that it was “found somewhere on the net.” That offers no way to confirm its true source or authenticity. This is why SWHAP matters: it makes sure your code lands on GitHub with the correct history and vital information, preventing misunderstandings for anyone looking at it in the future.

Setting up your GitHub repository for SWHAP

Before diving into the precise steps, let’s go over what your GitHub repository should look like for SWHAP.

In this example, the code is called “MySoftware,” and the repository bears that name. It has two main sections, or branches:

The master branch: This holds all the initial information for preservation, including metadata and the code’s origin. Typically, there are three key folders:
- raw materials: For any original documents related to the code you’re preserving (e.g., a scanned paper listing).
- source code: This is where the machine-readable version of your code goes.
- metadata: As the name suggests, this folder holds all the descriptive information about your software.
The source code branch: This becomes crucial if your software has multiple versions. For instance, if you have 10 different iterations, a future user might not want to sift through each one. However, seeing the code’s development over time is still very valuable. In this branch, we’ll recreate the software’s development timeline, adding each version sequentially using Git’s commit feature. This provides a practical way for anyone viewing the repository in the future to track how the source code evolved.

That’s it for the overview. Check out part two, which has a more detailed, step-by-step explanation of the SWHAP process.

The post How to preserve legacy code with Software Heritage appeared first on Software Heritage.

CodemetaR Author streamlines software metadata updates

Nicole Martinelli — Wed, 18 Jun 2025 14:48:00 +0000

By Frédéric Santos, Software Heritage Ambassador

Think of software metadata as the essential ID card for any software package. It’s that structured information that tells you all about the program: its name, version, who made it, what license it uses, what other software it needs to run, and other important details. For R packages specifically, this means information about what the package actually does, which other R packages it depends on, the minimum R version it requires, and even its intended purpose.

To cut through the tangle of diverse, language-specific metadata formats, the Codemeta project stepped in. Its mission is to standardize and improve the sharing of software metadata, offering a universal schema (built on a JSON-like syntax) to describe software. This common approach helps turn metadata into a format that can easily “cross-work” between many different programming languages, making software easier to discover, cite, and reproduce.

This post offers an overview of the 14-minute video tutorial, highlighting how the codemetar R package simplifies generating and updating this crucial Codemeta metadata for your R packages.

Why bother with organized software metadata?

Keeping your software metadata organized and fresh isn’t just a nice-to-have; it’s critical for several reasons. If you’re a researcher, for instance, you know how vital it is for your analyses to be reproducible by others. Accurate software metadata is key here, since it specifies the exact software versions and dependencies that people need to replicate your work precisely. It also tells authors how to properly cite the software they’ve used in their articles, which is very important. But beyond just reproducibility and citation, well-structured metadata acts like a beacon for discoverability. Search engines can find your software more easily, linking it to relevant keywords. Even popular software repositories like GitHub would be a nightmare to navigate without good metadata. So, truly, any software developer – and R developers are definitely included – should pay close attention to how they specify metadata in their packages.

The R-specific way and its bottleneck

When you build an R package, the standard, built-in way to define all this necessary metadata is through the DESCRIPTION file. This file is where you’ll find essential fields like Title, Description, Author, License, and Imports. These details are crucial for users to grasp the package’s purpose and requirements, and they help external tools or repositories reference your package better. You can even generate a DESCRIPTION file template using R’s built-in functions or the devtools package. Sometimes, you might also find richer, though less structured, info in a README file.

But here’s the catch: the DESCRIPTION file has a very specific R-centric format or syntax. This becomes a real headache because other programming languages have their own unique ways of handling package metadata. Take a Julia package, for example; it might use a Project.toml file. Or an Emacs Lisp package, where metadata lives right in the header of its main file. The syntax and structure of these files are vastly different from R’s DESCRIPTION file. This language-specific approach creates a big problem for automatically collecting software metadata across different ecosystems, as each language demands a specific process for data extraction. This challenge directly impacts search engines, software archives, and repositories that aim to gather and organize information about a wide variety of software projects.

Enter Codemeta: The Universal Translator for Software Metadata

To tackle this tangled mess of different metadata formats, the Codemeta project stepped in. Its main goal? To offer a single, universal format for describing software that isn’t tied to any specific programming language. Codemeta achieves this by using a common schema based on a JSON-like syntax. By creating a codemeta.json file, developers can transform their software metadata into a format that can easily “cross-work” across many different programming languages. This standardized approach is a game-changer for improving software discoverability, reproducibility, and citation.

You’d typically place a codemeta.json file at the very root of your R package directory, right alongside your DESCRIPTION file. Since it’s not part of the standard R package build process and doesn’t inherently belong to the R universe, you can just add it to your .Rbuildignore list. This ensures it gets skipped when your package is built, preventing any issues. While the JSON format used by codemeta.json files is powerful, it can be quite heavy, challenging for humans to read, and even harder to type manually.

The hassle of manual codemeta.json management

Given how complex JSON can be, trying to manually create or update codemeta.json files is, frankly, impractical. Sure, there are some automatic generation options, like websites provided by the Codemeta project team, where you fill in fields to get your JSON file. There’s even a newer version (a fork) of the Codemeta generator that can auto-fill fields if you just give it a GitHub or GitLab URL. These are nice, but still require you to manually update the codemeta.json file every time you release a new version of your R package. This opens the door to forgetting to update the file, leaving you with outdated codemeta.json information. Clearly, there’s a need for a more integrated, automated solution, especially for developers who frequently update their packages.

Enter codemetar: Your R Package metadata sidekick

This is precisely where the codemetar package steps in. The codemetar package is designed to help R developers effortlessly generate, parse, modify, and update codemeta.json files for their R packages. It’s readily available on CRAN, so you can install it with the usual install.packages(“codemeta”) command.

Metadata from the DESCRIPTION file automatically populates this codemeta.json file.

The real magic of codemetar lies in its ability to automatically pull relevant information from your R package’s DESCRIPTION file and seamlessly convert it into the Codemeta JSON format. A simple function call like write_codemeta() is all it takes to extract all that useful metadata from your DESCRIPTION file and pop it into a new codemeta.json file. What’s more, codemetar can even dig into your README file to extract extra metadata, like badges showing continuous integration services or minimum R version requirements. And here’s a neat little trick: codemetar automatically adds the generated codemeta.json file to your .Rbuildignore file, ensuring it never messes with your R package build process.

Painless updates and CRAN Integration

A big benefit of using codemetar is how it simplifies keeping your package’s metadata fresh throughout its entire development lifecycle. This is important when you’re pushing out new versions, bringing in new contributors, or even changing where your development is hosted (like moving from GitHub to GitLab). And for those of you already using devtools::release() to push your packages to CRAN, get ready for some truly seamless integration. When this function runs to submit a new package version, it automatically checks if your codemeta.json file is up-to-date and matches the metadata in your DESCRIPTION file. If there’s a mismatch, devtools::release() will give you a warning. This effectively acts as a safety net, ensuring you’ll never forget to update your codemeta.json file when you’re submitting to CRAN.

In a nutshell, the codemetar R package streamlines how R developers manage their package metadata. By offering an easy, automated way to create and maintain codemeta.json files, codemetar not only boosts the discoverability and citation of R packages but also helps ensure their reproducibility by standardizing metadata in a universally understood format.

About Frédéric Santos

Frédéric Santos is a data analyst at CN R S, the French National Centre for Scientific Research. His programming expertise spans R, Julia, Bash, and Emacs Lisp. As an ambassador, his fields of expertise include machine learning, notebooks, and reproducible research. You can explore his projects on GitLab and GitHub as well as his website. He’s been a Software Heritage Ambassador since 2023.

The post CodemetaR Author streamlines software metadata updates appeared first on Software Heritage.

Using the SoftWare Hash Identifier (SWHID): A tutorial

Nicole Martinelli — Fri, 13 Jun 2025 09:25:00 +0000

Software identification is crucial for ensuring the long-term traceability of scholarly outputs. However, identifying software can be complex, resembling an investigation requiring tailored solutions. The Software Hash Identifier (SWHID) is an intrinsic identifier designed for software, acting like a unique fingerprint or DNA sequence intrinsically bound to the software’s content. It complements extrinsic identifiers like DOIs, which typically identify metadata records or broader projects. The SWHID provides actionable solutions for researchers, repository managers, and others involved in the scholarly ecosystem.

This tutorial provides a guide for research support staff, designed to answer the question: “What does an end-user from my institution need to understand about software identification?”

We’ll explain why common identifiers like DOIs aren’t always sufficient for software, highlighting the specific concerns of unique software identification. Most importantly, we’ll introduce a straightforward, “plug-and-play” solution that your community can use, emphasizing the crucial role you’ll play in helping them implement it. This post derives from a two-hour live session by the Software Heritage Open Science team, Morane Gruenpeter and Sabrina Granger, as part of the FAIR implementation workshops. The slides are also available.

Understand what SWHID Identifies

SWHID is used to identify specific software artifacts at different levels of granularity.SWHIDs identify the source code content itself, rather than the project or its metadata. The different types of objects identifiable by a SWHID include:

CNT (Content): Identifies the content of a single file.
DIR (Directory): Identifies a directory, including its contents and the names of the files within it. This SWHID type is recommended for academic use – it’s self-contained and doesn’t depend on external services like Software Heritage to work.
REV (Revision): Identifies a commit in a development history sequence.
REL (Release): Identifies a tagged release, similar to a revision but specifically marked as a release.
SNP (Snapshot): Identifies a point in time, recording all entry points (like branches and releases) found in a software origin and where they pointed at that time.

These intrinsic identifiers correspond to granularity levels from the bottom of the software identification pyramid (Level 10: Code Fragment, Level 9: File, Level 8: Directory, Level 7: Commit, Level 6: Release, Level 5: Snapshot), where the number of items increases as you go down the pyramid.

How to generate a SWHID

A key feature of SWHID is that any end-user can generate one. You do not need an account on Software Heritage or need to be the software author. SWHIDs are free. For digital resources that are frequently created or modified, especially in large volumes, charging a per-identifier fee just doesn’t work.

You can find the SWHID for software artifacts already archived in Software Heritage in the permalinks box on the artifact’s page.
You can also compute a SWHID locally on your own machine using a command-line tool. For the same content, the SWHID computed locally will be the same as the one computed by Software Heritage, as long as the computational method (schema version) is the same.

Deconstruct the SWHID structure

A SWHID is a structured identifier with several parts:

Prefix: Always starts with SWH.
Schema Version: Indicates the hash computation method used (currently 1 for SHA-1). This can evolve if needed, with older hashes remaining valid.
Object Type: Indicates the type of software artifact being identified (C, DI, RE, RL, or SNP).
Hash: The hash value computed for the specific content or object.
Context Parameters (Optional): Provide additional information about where or when the artifact was found or its position within a larger structure. These parameters can include:
- Origin: The URL from which the software originated (e.g., a GitHub or GitLab repository). This parameter differentiates SWHIDs for identical content found in different locations.
- Visit: For artifacts lower in the graph (Content, Directory, Revision, Release), this refers to the snapshot in which the artifact was seen.
- Anchor: For artifacts lower than a snapshot, this is a Revision item from the graph that provides a specific point of reference.
- Path: The path to the artifact within a directory or revision.
- Lines: For content fragments, specifies the lines of code being identified.

Context parameters explain variations in seemingly identical SWHIDs: the core content hash is the same, but the context (e.g., path, origin) differs.

How to use SWHIDs

SWHIDs have several important use cases, primarily related to referencing, reproducibility, and citation of software source code:

Referencing specific code: SWHIDs allow you to point directly to specific versions or parts of software code (files, directories, revisions, etc.). This is different from DOIs, which often point to a metadata record about the software.
Ensuring reproducibility: Because SWHIDs are based on the intrinsic content, they enable reproducibility. If you have the SWHID, you can potentially regenerate or verify the exact content it refers to, even if the original infrastructure where it was found is no longer available.
Citing software: SWHIDs are designed to be used in software citations. The recommended way to facilitate this is to include metadata files like code meta.json or citation.cff alongside your code. Software Heritage can use these files to generate a citation that includes the SWHID of the corresponding artifact (e.g., the directory SWHID is often recommended for academia).
IMPORTANT CITATION RULE: Never include the SWHID itself within the source code files. Adding the SWHID changes the file contents, resulting in a new SWHID for the changed file, which breaks the link to the original content. Instead, include metadata files that allow platforms to generate citations, including the SWHID.
Resolving SWHIDs: SWHIDs can be resolved to access the corresponding software artifact, for example, on the Software Heritage archive (softwareheritage.org) or its operational mirror networks.

What the SWHID is not for

Data Sets: SWHIDs are designed specifically for software source code. While data might be stored alongside code in repositories and thus archived by Software Heritage, SWHIDs are not the recommended identifier for data sets. Other identifier types are more appropriate for data.
AI-Generated Code: Currently, SWHIDs cannot distinguish code generated by AI tools from human-generated code, nor do they provide functionality to specifically track the origin of AI-generated code.

By understanding these steps, you can leverage SWHIDs for robust and reproducible identification, referencing, and citation of software artifacts.

A toolbox

For further info:
https://www.softwareheritage.org/faq/#3_Referencing_and_identification
https://www.softwareheritage.org/how-to-archive-reference-code
https://www.softwareheritage.org/software-hash-identifier-swhid
https://www.swhid.org

The post Using the SoftWare Hash Identifier (SWHID): A tutorial appeared first on Software Heritage.