Joining forces with Algolia for an even better npm search

Utilized by jsDelivr, Yarn, CodeSandbox, and several other open-source projects, Algolia’s npm search has been an essential part of the user experience for developers searching for npm packages for almost seven years now. Today, we are excited to announce that we are becoming co-maintainers of the project!

The project was started by Algolia in 2016 with the goal of providing faster and more relevant search results than npm’s official API. Built on top of Algolia’s search platform, it offers real-time full-text search with typo tolerance and advanced ranking based on package popularity.

We use the search on our website and in many of our integrations, and have already contributed with several features in the past. Having a more active role in the project’s development will allow us to bring new features and improvements faster than before. In fact, we have already shipped an extensive set of changes aimed at improving the project’s reliability, and you can read more about those below 🚀

Diving into the original indexing process

To provide a lightning-fast search, Algolia must have all the data in its indices. But how does the data get there? That is the main task of the npm search project – collecting data from the original sources and inserting it into the search index. The data sources are:

The npm replication API, which provides both the list of all packages and the metadata for individual packages (which roughly match data from `package.json`).
The npm downloads API, which provides download statistics for packages used for ranking the results.
The jsDelivr API, which provides jsDelivr download statistics also used for ranking the results.
GitHub, GitLab, and Bitbucket, which are used to retrieve the package changelog in case it is not included in the published files.

The indexing process is further split into two phases:

The bootstrap phase covers creating the initial index from an empty state. It processes several packages in parallel to finish faster.
The watch phase starts after the bootstrap, continuously listens for changes in the npm registry, and applies those changes to the Algolia index. To guarantee consistency, this process is sequential, and changes are processed one by one.

In both phases, “processing a package” means retrieving data from all the relevant data sources, building the final record, and inserting it into the search index.
While the process as described here seems fairly straightforward, one of the issues the project had come to face over the years is caused by the sheer amount of data – with about 2.5 million packages in the public npm registry, even making a single HTTP request per package to get its metadata means making 2.5 million requests in total. Add in the requests needed to periodically update download statistics and detect changelogs, and the number goes higher and higher.

This problem has been made worse by the fact that every service providing the data has its own rate limits, meaning that even if we were able to process the data faster, we could not retrieve it fast enough. Furthermore, since data for a single package are considered an atomic unit, the rate of processing is effectively reduced to that of the slowest external service, and there is a very limited time window for retries in case of temporary failures without blocking the whole process.

The indexing process reimagined

To address the existing issues, we have redesigned the process so that data are fetched in multiple stages, and the performance or downtime of a single data source does not impact the whole process. We have also made changes to avoid repeatedly requesting data we already have.

The key ideas are inspired by traditional message queue systems, but to avoid new dependencies, the queues are implemented on top of additional Algolia indices.

The bootstrap phase

The bootstrap now runs in four independent queues. This allows for better retry handling, more specific rate-limiting control, and better performance for indexing the most essential data. The new .periodic-data and .one-time-data indices significantly reduce the number of requests we need to make in the case of a repeated full bootstrap.

The four bootstrap queues are used for:

Listing the packages: instead of being queued in memory and processed right away, discovered packages are written to the bootstrap queue and processed later. The listing process continues as soon as the packages are safely stored in the queue.
Indexing data from the registry: packages are picked from the bootstrap queue, prioritizing those with a lower retries counter. The indexer retrieves the full document from npm, formats it, and stores it in the main index. This is similar to before, except we do not query the additional data sources at this point. The external data are added later by additional indexers.
Indexing npm downloads: packages have an internal field indicating the time of the last update, and the third queue processes all packages with this value exceeding 30 days. This process runs concurrently with the main indexer and relies on partialUpdateObject and IncrementFrom Algolia operations to perform atomic updates. If the background indexer attempts to update a record that has been changed by the main indexer, the update is discarded. The downloads data are also stored in a separate .periodic-data index to reduce calls to the npm API and speed up the indexing in case of repeated bootstraps or package updates. Whenever we process a package, the .periodic-data index is checked first. The npm downloads API is only queried if we do not have the data yet or if it is older than 30 days. The data index is shared between the bootstrap and watch modes.
Indexing changelogs: packages have an internal field indicating if we have already attempted to find the changelog, and the fourth queue processes all packages where this has a non-zero value. Similar to the downloads indexer, this process runs concurrently with the main indexer and uses the .one-time-data index as a cache to reduce calls to the external services on repeated bootstraps.

The watch phase

The watch phase received a similar set of changes, along with the ability to process multiple updates in parallel while keeping the consistency guarantees it had before. This has been made possible by combining the advanced partialUpdateObject, IncrementFrom, and deleteBy Algolia features.

Additional improvements

On top of the described performance and reliability changes, we have also made several smaller improvements:

Detection of unpublished packages: they should now be correctly removed from the index in all cases.
Using jsDelivr downloads for awarding the “popular” badge: the top 1k jsDelivr packages are now marked as popular, along with the packages that were previously marked as popular based on npm downloads.
Switched to a new source of DefinitelyTyped data as the previous one was no longer available.
Reduced HTTP requests to npm downloads API by using batched requests where possible.

If you are interested in even more technical details, check out the full changes and discussion at https://github.com/algolia/npm-search/pull/1140, and be sure to follow us on Twitter for future posts!