research Archives - Software Heritage https://www.softwareheritage.org/tag/research/ Wed, 20 Aug 2025 10:36:35 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 https://www.softwareheritage.org/wp-content/uploads/2015/08/cropped-swh-logo-32x32.png research Archives - Software Heritage https://www.softwareheritage.org/tag/research/ 32 32 Episciences links article code through Software Heritage https://www.softwareheritage.org/2025/08/21/episciences-links-article-code-software-heritage/ Thu, 21 Aug 2025 14:35:00 +0000 https://www.softwareheritage.org/?p=46663 Episciences enables linking publications to source code archived in Software Heritage, enhancing research reproducibility.

The post Episciences links article code through Software Heritage appeared first on Software Heritage.

]]>
Software Heritage, the universal source code archive, preserves and provides access to source code as vital digital heritage. Researchers can directly link their published scholarly articles to the software that powers them. This new capability enhances research reproducibility and transparency by connecting findings to specific software versions.

Software Heritage partnered with the Center for Direct Scientific Communication (CCSD) to make this happen. The collaboration previously enabled software deposits on HAL in 2018, laying the groundwork for this new capability. Episciences, an overlay journal that hosts articles from open repositories like arXiv, Zenodo, and bioRxiv, now builds on a 2018 collaboration that enabled software deposits on open archive HAL, allowing authors to link to software archived there. Authors and journals using Episciences can link their articles with supplementary software via Software Heritage, using a SoftWare Hash IDentifier (SWHID) or a HAL-ID. 

There are three basic steps:

  • Submit software to HAL

Depositing software via HAL ensures its sustainable archiving in Software Heritage. The complete deposit procedure is detailed in the HAL documentation: Deposit software source code.

Building on this ability to link articles and software, Episciences actively works to meet the evolving needs of researchers. Episciences is emerging as a new model in academic publishing, improving the visibility and accessibility of research articles that have already been peer-reviewed and published in conference proceedings. Instead of building a new library (traditional journal), overlay journals act as a highly knowledgeable curator who goes through existing open shelves (repositories), selects the best books, writes introductions for them, and creates a guide (the journal) pointing readers to those excellent, freely available books. This approach allows researchers to submit their conference papers to Episciences for additional scrutiny and broader dissemination, potentially increasing the impact and reach of their work.

“One fundamental aspect of the openness of science is the close link between scientific publications and associated research data. This link is essential for the transparency, reproducibility, and the overall progress of science. Episciences responds to this dynamic by inviting authors to supplement the submission of their document with a link to the dataset and/or software used in their work,” Agnès Magron CCSD

Beyond enabling these connections, Episciences actively contributes to the wider open science movement. The API and connector for Episciences were developed as part of the European Union-funded FAIRCORE4EOSC project. Episciences is also a member of the SCOSS Family. This commitment underscores why enabling this link via Episciences and the Software Heritage integration with HAL is paramount for research reproducibility, transparency, and accountability. 

The next step is leveraging the COAR Notify protocol (developed by the Confederation of Open Access Repositories, COAR) to share links between different research object types. 

The established partnership with CCSD, the successful HAL integration, and its utilization by Episciences provide a practical way for researchers to ensure their essential software is archived by Software Heritage and discoverable with their publications. Researchers and journals using platforms integrated with open repositories like HAL are encouraged to leverage this capability to link their software to their scholarly articles using Software Heritage. The partnership is about building a clearer, more open scientific story, where the findings and the code that powers them are part of the same picture.

The post Episciences links article code through Software Heritage appeared first on Software Heritage.

]]>
What’s next in research for Software Heritage https://www.softwareheritage.org/2025/04/09/research-software-heritage-strategy-2025/ Wed, 09 Apr 2025 08:02:00 +0000 https://www.softwareheritage.org/?p=45180 Stefano Zacchiroli discusses security, reproducibility, and data analysis in his new Chief Scientific Officer role.

The post What’s next in research for Software Heritage appeared first on Software Heritage.

]]>
After nearly a decade of Software Heritage, changes are afoot. Co-founder Stefano Zacchiroli is shifting focus to become the Chief Scientific Officer (CSO), while Thomas Aynaud joins the team as the new Chief Technical Officer, taking over Zacchiroli’s previous responsibilities.

The move to CSO allows Zacchiroli to focus on his research interests: digital commons, open-source software engineering, computer security, and the software supply chain. He’s a full professor of computer science at Télécom Paris, Polytechnic Institute of Paris. A Debian developer since 2001, he served as Debian project leader from 2010 to 2013.

In this interview, he talks about getting back to research full-time, how Software Heritage helps make open-source more secure for everyone, and why keeping the hobbyist ethos alive is important.

Stefano Zacchiroli. © Inria / Photo B. Fourrier

You’re the co-founder of Software Heritage, what was the pivot point for changing roles?

In the beginning, we were a fairly small team, so we had to distribute the roles among ourselves. I was technically inclined, since I’d been doing free software technical work for decades, so I picked up the CTO role. I was happy to lay the technical foundations of Software Heritage, which are still fundamental for the archive today. But my real life is as a researcher – I’m a computer science professor, and I’d been doing research for most of my career. Over the past 10 years, all my research has been built upon what Software Heritage enables: very large-scale empirical analyses of the software commons. Eventually, my vocation pulled me in that direction more and more, and I was delighted to find someone to step into the CTO role so I can go back to full-time research work.

What’s a Chief Scientific Officer (CSO)?

Doing research and enabling research has been part of the mission of Software Heritage from the very beginning. We have a lot of research work conducted by team members themselves (like me), members of the technical team helping researchers conduct their work, as well as outside researchers from universities and research labs mining the Software Heritage archive. The CSO role is about coordinating all of that, it doesn’t mean running every possible research project: we want people to be able to do research on Software Heritage on their own. But it does mean keeping an eye on how we interact with researchers, to make their work easier.

Research used to be a side activity of Software Heritage, not part of our main strategy. Now, it’s a key focus and we’re making it visible.

What are you most excited about in your new role? 

There’s a lot of research going on around Software Heritage, from all kinds of angles. I’m the principal investigator for many of them, especially around security use cases. On that front, the question is: how can we effectively leverage all the knowledge collected by Software Heritage, and use it so that free software developers can create more secure software for all of us? The same general principle of efficiently leveraging Software Heritage knowledge to enable the public good can be applied to other fields. Two I’m actively pursuing are: first, how Software Heritage can enable reproducible research when it comes to software. And second, how it can enable build reproducibility of open source software. This lets people using open-source software check where it comes from and have a solid way to trust the link between the code and what’s running on their devices. 

Finally, another angle I’ve been working on extensively is the human aspect of software engineering, specifically the global collaboration involved in creating open-source software. We’ve previously studied diversity in terms of gender and geographic origin. Software Heritage offers a unique perspective on how these trends evolve worldwide.

Software Heritage recently held a partner kickoff for CodeCommons, what’s your involvement in that project? 

So as you can imagine, a lot of people have been interested in trying to use the Software Heritage Archive for large language models (LLMS) for code. We’re not providers of models, but we want to understand our role in helping create ethical datasets, where ethical means, first and foremost, keeping track of where the code comes from when it is used to train LLMs. We want people to be able to know if their code has been used in LLM training. We want to be able to produce open datasets for that, and we want to provide all the relevant information that enables LLM producers to respect the license of the code used in the model and have information about the provenance of the code that has been used. 

What are you most looking forward to leaving behind?

The technical work to maintain and develop the Software Heritage archive has really changed in scale, over time. We have a larger code database, of course, because that’s ever-growing, and we have a larger team. We have about a dozen engineers now. We have changed technologies, so we’re using more large-scale technologies than we were using in the very beginning. I’m very glad to hand that off to the capable hands of Thomas Aynaud, who’s far more skilled than I am on that front and is taking over the leadership of that team, too. 

How will this transition impact the scientific direction of SWH, and what changes can we expect short and long term?

Software Heritage as an initiative is primarily an enabler of many use cases, and large-scale research on the software commons is one of them. I don’t expect the focus of Software Heritage to change as part of my role change. What will change is that Software Heritage will support more independent research teams doing this kind of large-scale research. I expect more collaborations to materialize, more researchers will use Software Heritage data, leading to increased research results, both directly and indirectly. 

Do you see your contributions changing much, or is it a question of quantity or focus? 

No, it’s more about peace of mind. I’ve been primarily doing research for years. This is simply aligning my title with my actual work, not changing what I’m doing. Research used to be a side activity of Software Heritage, not part of our main strategy. Now, it’s a key focus and we’re making it visible.

Follow-up question from the recent Symposium: You mentioned that SWHSec empowers anyone, not just large corporations, to contribute to software security. 

SWHSec is a large and diverse research project, with many teams looking at how Software Heritage can help secure open source for everyone. My presentation was about one specific R&D task: figuring out if a specific software version in the archive is vulnerable. This isn’t a new problem, and there are industry solutions for it. But Software Heritage has something unique: deep source code visibility. We can track vulnerabilities down to individual commits and see if even a random fork on a little-known development forge is affected by a specific vulnerability or not. This is a kind of hidden problem industry usually ignores. For example, if you forked a project two years ago when it was vulnerable, and never merged back a fix, we can tell you it’s still vulnerable. Sure, there’s not a lot of money in this, but we can give developers data and tools to fix these issues. We’ve found real vulnerabilities and helped fix them, that maintainers didn’t know about. As we can do so, it’s part of our social responsibility to do so.

How do you see Software Heritage evolving in terms of its impact on open-source sustainability?

Open-source sustainability is a huge topic, and people are trying all sorts of things to improve the status quo. A common approach is to convince maintainers to become independent businesses, so they get paid for their work. Another is to look at how civil servants build software for governments. Software Heritage helps by showing all the ways people contribute to keeping open source going, and how funding works. For instance, we’ve recently used Software Heritage to show that researchers are really important for maintaining data and machine learning open-source software, which is super important these days. The role of civil servants and researchers is often overlooked in sustainability discussions, but you can see it in our data – by showing who’s really contributing to important software, we can help shape how funding is used. We can also spot software that’s important but isn’t getting enough attention – it needs more maintenance, more contributions, or more funding –  so it’s at risk. We’re the only ones who can do this on such a large scale, and see all the different kinds of contributions you can’t see just by looking at GitHub.

You’ve been a Debian developer for a long time. How has your perspective on open-source development changed over the years? 

Well, back in the early days, it was all volunteers, right? People were doing this just for fun or to “scratch an itch,” as the saying goes. Free software made collaboration possible, and then the industry took notice. Now there’s a lot of intermingling between paid and volunteer contributions. There are very relevant projects like the Linux kernel that are almost entirely maintained by paid contributions. But I think it’s important to keep the door open for hobbyists, both volunteer maintainers and drive-by contributors, to participate. We’re risking a strict separation between commercial open source, paid for by large companies, and hobbyist open source, which gives the ability for everyone to contribute and understand the code they run.

The post What’s next in research for Software Heritage appeared first on Software Heritage.

]]>