July 23, 2018

Top 20 Python AI and Machine Learning Open Source Projects

Getting into Machine Learning and AI is not an easy task, but is a critical part of data science programs. Many aspiring professionals and enthusiasts find it hard to establish a proper path into the field, given the enormous amount of resources available today. The field is evolving constantly and it is crucial that we keep up with the pace of this rapid development. In order to cope with this overwhelming speed of evolution and innovation, a good way to stay updated and knowledgeable on the advances of ML, is to engage with the community by contributing to the many open-source projects and tools that are used daily by advanced professionals.

Here we update the information and examine the trends since our previous post Top 20 Python Machine Learning Open Source Projects (Nov 2016).

Tensorflow has moved to the first place with triple-digit growth in contributors. Scikit-learn dropped to 2nd place, but still has a very large base of contributors.

Compared to 2016, the projects with the fastest growth in number of contributors were

  1. TensorFlow, 16
  2. Deap, 8
  3. Chainer, 8
  4. Gensim, 8
  5. Neon, 6
  6. Nilearn, 5

Also new in 2018:

  1. Keras, 629 contributors
  2. PyTorch, 399 contributors


Fig. 1: Top 20 Python AI and Machine Learning projects on Github.

Size is proportional to the number of contributors, and color represents to the change in the number of contributors - red is higher, blue is lower. Snowflake shape is for Deep Learning projects, round for other projects.

We see that Deep Learning projects like TensorFlow, Theano, and Caffe are among the most popular.

The list below gives projects in descending order based on the number of contributors on Github. The change in number of contributors is versus 2016 KDnuggets Post on Top 20 Python Machine Learning Open Source Projects.

We hope you enjoy going through the documentation pages of each of these to start collaborating and learning the ways of Machine Learning using Python.

  1. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization. The system is designed to facilitate research in machine learning, and to make it quick and easy to transition from research prototype to production system.Contributors: 1324 (16
  2. Scikit-learn is simple and efficient tools for data mining and data analysis, accessible to everybody, and reusable in various context, built on NumPy, SciPy, and matplotlib, open source, commercially usable – BSD license.Contributors: 1019 (3
  3. Keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.Contributors: 629 (new), Commits: 4371, Github URL: Keras
  4. PyTorch, Tensors and Dynamic neural networks in Python with strong GPU acceleration.Contributors: 399 (new), Commits: 6458, Github URL: pytorch
  5. Theano allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.Contributors: 327 (2
  6. Gensim is a free Python library with features such as scalable statistical semantics, analyze plain-text documents for semantic structure, retrieve semantically similar documents.Contributors: 262 (8
  7. Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors.Contributors: 260 (2
  8. Chainer is a Python-based, standalone open source framework for deep learning models. Chainer provides a flexible, intuitive, and high performance means of implementing a full range of deep learning models, including state-of-the-art models such as recurrent neural networks and variational auto-encoders.Contributors: 154 (8
  9. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.Contributors: 144 (3
  10. Shogun is Machine learning toolbox which provides a wide range of unified and efficient Machine Learning (ML) methods. The toolbox seamlessly allows to easily combine multiple data representations, algorithm classes, and general purpose tools.Contributors: 139 (3
  11. Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. This means you can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and Theano will optimize and stabilize those expressions for you, and compile them to a backend of your choice (CPU or GPU).Contributors: 119 (3.
  12. NuPIC is an open source project based on a theory of neocortex called Hierarchical Temporal Memory (HTM). Parts of HTM theory have been implemented, tested, and used in applications, and other parts of HTM theory are still being developed.Contributors: 85 (1
  13. Neon is Nervana's Python-based deep learning library. It provides ease of use while delivering the highest performance. Note: Intel is no longer supporting Neon, but you may still be able to work with it via what remains on Github. Contributors: 78 (6
  14. Nilearn is a Python module for fast and easy statistical learning on NeuroImaging data. It leverages the scikit-learn Python toolbox for multivariate statistics with applications such as predictive modelling, classification, decoding, or connectivity analysis.Contributors: 69 (5
  15. Orange3 is open source machine learning and data visualization for novice and expert. Interactive data analysis workflows with a large toolbox.Contributors: 53 (3
  16. Pymc is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems.Contributors: 39 (5.
  17. Deap is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data structures transparent. It works in perfect harmony with parallelisation mechanism such as multiprocessing and SCOOP.Contributors: 39 (8
  18. Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mapped into memory so that many processes may share the same data.Contributors: 35 (4
  19. PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.Contributors: 32 (
  20. Fuel is a data pipeline framework which provides your machine learning models with the data they need. It is planned to be used by both the Blocks and Pylearn2 neural network libraries.Contributors: 32 (1

The contributor and commit numbers were recorded in February 2018.

Editor's note: This was originally posted on KDNuggets, and has been reposted with perlesson. Author Ilan Reinstein is a physicist and data scientist.

Ilan Reinstein

About the author

Ilan Reinstein

Data Scientist at NYU Langone Medical Center. MS Applied Physics - NYU Tandon School of Engineering BS Physics - Universidad de Los Andes (Colombia)

Learn data skills for free

Headshot Headshot

Join 1M+ learners

Try free courses