A Guide to Open Source for ML Enthusiasts

9 min readOct 23, 2023

In an era driven by data and algorithms, open source and machine learning intersection has become the epicenter of innovation and democratization. In this age of rapid technological evolution, open source serves as a cornerstone for ML enthusiasts, offering boundless opportunities to collaborate, explore, and contribute. Whether you’re a seasoned practitioner or just embarking on your ML journey, prepare to learn about how you can make contributions to the world of open source as an ML engineer.

· Why Open Source for ML?
∘ Popular and influential open-source ML projects
· Getting Started with Git and GitHub
∘ What are Git and GitHub?
∘ Git Basics
∘ Cloning only the Last Commit
∘ Forking and Pull Requests on GitHub
· Ways to Contribute to Open Source as a DS/ML Folk
· Best Practices for Open Source Contributions in ML
· Resources
∘ No-code Contributions
∘ Documentation(beginner-friendly)
∘ Intermediate
∘ Advanced
· Conclusion

Why Open Source for ML?

Open source for ML empowers collaboration, innovation, and accessibility. By sharing code, knowledge, and tools, it democratizes the field, fostering a collective effort to advance machine learning, making it more inclusive and driving progress faster than any entity could achieve alone.

Open-source software offers a number of benefits for machine learning practitioners, including:

1. Collaboration: Open source fosters a collaborative environment, allowing ML practitioners worldwide to work together on shared projects, leading to faster and more diverse advancements.

2. Access to Cutting-Edge Technology: ML professionals can access state-of-the-art algorithms, frameworks, and libraries without cost, leveling the playing field for those with limited resources.

3. Community Support: The open-source ML community provides a wealth of knowledge, peer support, and forums for problem-solving, aiding in professional development.

4. Customization: ML engineers can modify open-source tools to meet specific project requirements, tailoring them to their unique needs. Open-source LLMs can be fine-tuned to your specific use case without having to train one from scratch, hence, reducing computation costs.

5. Knowledge Sharing: Open source promotes the dissemination of ML expertise and best practices, benefitting both newcomers and experienced practitioners.

6. Global Impact: Open source ML has the potential to address pressing global challenges, from healthcare to environmental issues, by leveraging collective expertise and resources.

Popular and influential open-source ML projects

Some of the most popular and influential open-source ML projects include:

One of the biggest trends in the ML open-source community is the increasing popularity of pre-trained ML models. Pre-trained models allow users to build and deploy ML models without having to train them from scratch. This can save a significant amount of time and resources.

Another trend in the ML open-source community is the growing emphasis on responsible ML. Responsible ML is about developing and deploying ML models in a way that is fair, ethical, and transparent. There are a number of open-source projects that are working to make responsible ML more accessible and easier to implement.

One of the challenges facing the ML open-source community is the need for more diversity. The ML community is predominantly white and male. This lack of diversity can lead to biases in ML models and can make it difficult for everyone to benefit from ML. There are a number of initiatives underway to increase diversity in the ML community, but more work needs to be done.

Another challenge facing the ML open-source community is the maintenance of existing projects. Many open-source ML projects are maintained by a small number of volunteers. This can make it difficult to keep up with the latest developments and fix bugs. There are a number of ways to support the maintenance of open-source ML projects, such as donating money or contributing code.

Getting Started with Git and GitHub

Git and GitHub are essential tools for version control and collaborative software development. In this article, we will walk you through the basics of Git, focusing on commands like git init, git clone, git status, git add, git commit, and git push. We’ll also explore the depthparameter when using git cloneand git pullto clone only the last commit. Additionally, we’ll cover how to fork a repository and create pull requests on GitHub.

What are Git and GitHub?

Git is a distributed version control system that allows developers to track changes in their code and collaborate efficiently. It helps you manage your project’s history, work on different features simultaneously, and collaborate with others without conflicts.

GitHub, on the other hand, is a web-based platform that utilizes Git for hosting and collaborating on software projects. It provides a user-friendly interface for managing repositories, issues, and more, making it a popular choice for open-source and private projects.

Git Basics

git init:

Before you can start using Git, you need to set up a new repository or initialize an existing project. The git initcommand is used to create a new Git repository in your local directory. To initialize a repository, navigate to the project’s root folder and use: git init.

This command creates a hidden `.git` directory that stores all the information about your project’s history and configuration.

2. git clone:

The git clonecommand is used to create a copy of a remote Git repository on your local machine. To clone a repository, use the following command: git clone <repository url>.

This will create a local copy of the entire repository, including all its commits and branches. But what if you want to clone only the latest commit to save time and space?

You can do this by adding the — depth parameter, and specifying the number of commits to include. For example, to clone only the latest commit, use: git clone — depth 1 <repository url>.

3. git status:

The git statuscommand helps you see the current state of your working directory and staging area. It shows you which files are modified, untracked, or staged for commit. Use it to stay informed about the status of your project: git status.

4. git add:

Before committing changes, you need to add your modified files to the staging area. The git addcommand allows you to do this. To stage a file, use: git add <filename>.

Or stage all changes with: git add.

5. git commit:

After staging your changes, commit them to the repository with a message describing what you’ve done: git commit -m “your commit message here”.

This will create a snapshot of the staged changes and record them in the project’s history.

6. git push:

To share your local changes with a remote repository, use the git pushcommand: git push.

This sends your commits to the remote repository, allowing others to see your work and collaborate with you.

Cloning only the Last Commit

Using the -depthparameter with git cloneand git pullcan save you time and disk space by fetching only the latest commit. As mentioned earlier, you can clone a repository with just the last commit using: git clone — depth 1 <repository url>.

You can also pull only the latest commit from a remote repository using: git pull -depth 1.

This is useful when you want to quickly check out a project, contribute a quick fix, or save bandwidth and storage.

Forking and Pull Requests on GitHub

Forking is the process of creating a personal copy of someone else’s repository on GitHub. To fork a repository:

Visit the repository you want to fork on GitHub.
Click the Fork button in the upper right corner of the repository page.
This action creates a copy of the repository in your GitHub account.

Now, you can make changes to your forked repository. If you wish to contribute those changes back to the original repository, you create a pull request:

In your forked repository, click the New Pull Request button.
GitHub will compare the changes between your fork and the original repository.
Provide a title and description for your pull request, explaining the changes you’ve made.
Click Create Pull Request to submit your request.

The project’s maintainers can review your changes and merge them into the original repository if they find them suitable.

Ways to Contribute to Open Source as a DS/ML Folk

Contributing to open source as a Machine Learning (ML) developer offers many opportunities beyond just code contributions. Some of the diverse ways through which DS/ML folks can contribute:

Documentation: You can improve or write documentation for DS/ML libraries, and create tutorials, guides, or Jupyter notebooks explaining the usage of tools or algorithms.
Bug Reporting and Fixing: Report issues or bugs you encounter. This can be done by submitting an issue on the “Issues” tab. Also, contribute to fixing bugs.
Data Cleaning and Preprocessing: Contribute to data preprocessing pipelines Also help in data cleaning, augmentation, or feature engineering.
Provide Datasets: This is a very unique way of contributing. If you have access to unique datasets (and proper permissions to share), consider sharing them to aid the project. Contribute to open data repositories or platforms.
Code Review: Review pull requests, especially if you have expertise in that specific algorithm or technique. You can offer insights on algorithm correctness, efficiency, or scalability.
Algorithm Development: Propose new algorithms or improve existing ones. Optimize algorithms for speed, memory usage, hardware, or accuracy.

Best Practices for Open Source Contributions in ML

Contributing to open source Machine Learning (ML) projects has its own unique set of challenges and best practices due to the nature of ML development, which involves data, models, algorithms, and reproducibility concerns.

Here are some best practices tailored for contributing to open-source ML projects:

Understand the Project: Familiarize yourself with the project’s objectives, architecture, and algorithms used. Also, go through the documentation (if it exists) and tutorials if available.
Start Small: your first contribution does not have to be something massive. Start small. It can be resolving small bugs to integrating tests. Look for “good first issue” or “beginner-friendly” tags to begin.
Starting with Reproducibility: Before making changes, ensure you can reproduce existing results. Use tools like Conda environments to ensure consistent environments.
Documentation: Clearly document any changes made to data preprocessing, model architecture, or training parameters.
Testing: ML projects benefit immensely from a robust testing framework. Write unit tests for individual components and integration tests for end-to-end workflows.
Communicate Effectively: When proposing a change, explain the reason behind it and how it benefits the project. Also, learn to be patient and respectful when contributing.
Follow the Project’s Guidelines: Adhere to the style and standards of the project. Learn to write meaningful commit messages. If the project has templates for issues or pull requests, use them.
Providing Performance Metrics: Always include relevant performance metrics (accuracy, F1 score, ROC curve, etc.) when proposing changes to models. Provide both training and validation metrics to avoid overfitting.
Regularly Sync with the Main Branch: ML projects are fast-evolving. Regularly sync your fork with the main branch to incorporate the latest changes and avoid merge conflicts.

Resources

For easy access, we have separated these repos into 4 categories; No-code, Beginner, Intermediate, and Advanced.

Access a curated list of open-source ML projects seeking contributions. These projects cover a wide range of ML domains and technical difficulties. You can find projects ranging from no-code contributions to documentation(beginner-friendly) to computer vision to natural language processing.

No-code Contributions

DagsHub — Dataset Contribution

DagsHub — Model Contribution

DagsHub — Documentation

Documentation(beginner-friendly)

DagsHub — Dataset Contribution(3D models)

DagsHub — Dataset Contribution(audio)

MLSA — Image Classification

Intermediate

DagsHub — Data Pipeline(Data Engineering)

Featureform: Project-Based Contributions

Richard Dushime — Facial Recognition and Biometric Attendance

MLSA — Speech Translation

MLSA — Nigerian Student’s Year One Performance Prediction Project

GDSC — Data Science Contributions

Advanced

DagsHub — Classification and Localization of Thoracic Diseases

DagsHub — Q/A bot(LLM)

- Hacktoberfest for ML Engineers: Find out how to participate in Hacktoberfest, an annual open-source event encouraging contributions in October. Discover how you can make a meaningful impact and earn rewards.

- Other ML-Related Open Source Events: Explore additional open source events and initiatives that cater to ML engineers. Stay informed about opportunities to contribute, learn, and grow in the open-source community.

Conclusion

By the end of this guide, you’ll be well-prepared to dive headfirst into the world of open source for ML. Remember, the open-source community is not only about code; it’s about collaboration, learning, and making a real difference. So, pick your favorite project, contribute, and let your journey begin!

Note: Watch out for Hacktoberfest and explore the repositories listed in the Resources section. It’s an opportunity for ML engineers like you to contribute and grow within the open-source ecosystem. Happy coding! 🚀

This article was co-authored by some Community Leads at the GDSC ML Community — UNILAG, they are Fikayo Adeleke, Samuel Bamgbola, Ifihan Oluseye, and Olorundara Akojede.

Table of Contents