Document stream-based search

Signed-off-by: Eric Peterson <ericpeterson@spotify.com>
2022-02-26 21:05:24 +01:00
parent d0993bc6f1
commit 022507c860
9 changed files with 380 additions and 15 deletions
@@ -0,0 +1,14 @@
+---
+'@backstage/plugin-search-backend-node': minor
+'@backstage/search-common': minor
+---
+
+The Backstage Search Platform's indexing process has been rewritten as a stream
+pipeline in order to improve efficiency and performance on large document sets.
+
+The concepts of `Collator` and `Decorator` have been replaced with readable and
+transform object streams (respectively), as well as factory classes to
+instantiate them.
+
+Accordingly, the `SearchEngine.index()` method has also been replaced with a
+`getIndexer()` factory method that resolves to a writable object stream.
@@ -0,0 +1,6 @@
+---
+'@backstage/plugin-search-backend-module-pg': minor
+---
+
+The `PgSearchEngine` implements the new stream-based indexing process expected
+by the latest `@backstage/search-backend-node`.
@@ -0,0 +1,13 @@
+---
+'@backstage/plugin-techdocs-backend': patch
+---
+
+A `DefaultTechDocsCollatorFactory`, which works with the new stream-based
+search indexing subsystem, is now available. The `DefaultTechDocsCollator` will
+continue to be available for those unable to upgrade to the stream-based
+`@backstage/search-backend-node` (and related packages), however it is now
+marked as deprecated and will be removed in a future version.
+
+To upgrade this plugin and the search indexing subsystem in one go, check
+[this changelog](https://github.com/backstage/backstage/blob/master/packages/create-app/CHANGELOG.md)
+for necessary changes to your search backend plugin configuration.
@@ -0,0 +1,43 @@
+---
+'@backstage/create-app': patch
+---
+
+The Backstage Search Platform's indexing process has been rewritten as a stream
+pipeline in order to improve efficiency and performance on large document sets.
+
+To take advantage of this, upgrade to the latest version of
+`@backstage/plugin-search-backend-node`, as well as any backend plugins whose
+collators you are using. Then, make the following changes to your
+`/packages/backend/src/plugins/search.ts` file:
+
+```diff
+-import { DefaultCatalogCollator } from '@backstage/plugin-catalog-backend';
+-import { DefaultTechDocsCollator } from '@backstage/plugin-techdocs-backend';
+import { DefaultCatalogCollatorFactory } from '@backstage/plugin-catalog-backend';
+import { DefaultTechDocsCollatorFactory } from '@backstage/plugin-techdocs-backend';
+
+// ...
+
+  const indexBuilder = new IndexBuilder({ logger, searchEngine });
+
+  indexBuilder.addCollator({
+    defaultRefreshIntervalSeconds: 600,
+-    collator: DefaultCatalogCollator.fromConfig(config, { discovery }),
+    factory: DefaultCatalogCollatorFactory.fromConfig(config, { discovery }),
+  });
+
+  indexBuilder.addCollator({
+    defaultRefreshIntervalSeconds: 600,
+-    collator: DefaultTechDocsCollator.fromConfig(config, {
+    factory: DefaultTechDocsCollatorFactory.fromConfig(config, {
+      discovery,
+      logger,
+    }),
+  });
+```
+
+If you've written custom collators, decorators, or search engines in your
+Backstage backend instance, you will need to re-implement them as readable,
+transform, and writable streams respectively (including factory classes for
+instantiating them). [A how-to guide for refactoring](https://backstage.io/docs/features/search/how-to-guides#rewriting-alpha-style-collators-for-beta)
+existing implementations is available.
@@ -0,0 +1,6 @@
+---
+'@backstage/plugin-search-backend-module-elasticsearch': minor
+---
+
+The `ElasticSearchSearchEngine` implements the new stream-based indexing
+process expected by the latest `@backstage/search-backend-node`.
@@ -0,0 +1,13 @@
+---
+'@backstage/plugin-catalog-backend': patch
+---
+
+A `DefaultCatalogCollatorFactory`, which works with the new stream-based
+search indexing subsystem, is now available. The `DefaultCatalogCollator` will
+continue to be available for those unable to upgrade to the stream-based
+`@backstage/search-backend-node` (and related packages), however it is now
+marked as deprecated and will be removed in a future version.
+
+To upgrade this plugin and the search indexing subsystem in one go, check
+[this changelog](https://github.com/backstage/backstage/blob/master/packages/create-app/CHANGELOG.md)
+for necessary changes to your search backend plugin configuration.
@@ -213,6 +213,7 @@ parallelization
 Patrik
 Peloton
 performant
+Performant
 plantuml
 Platformize
 Podman
@@ -54,13 +54,14 @@ An index is a collection of such documents of a given type.
 ### Collators

 You need to be able to search something! Collators are the way to define what
-can be searched. Specifically, they're classes which return documents conforming
-to a minimum set of fields (including a document title, location, and text), but
-which can contain any other fields as defined by the collator itself. One
-collator is responsible for defining and collecting documents of a type.
+can be searched. Specifically, they're readable object streams of documents that
+conform to a minimum set of fields (including a document title, location, and
+text), but which can contain any other fields as defined by the collator itself.
+One collator is responsible for defining and collecting documents of a type.

-Some plugins, like the Catalog Backend, provide so-called "default" collators
-which you can use out-of-the-box to start searching across Backstage quickly.
+Some plugins, like the Catalog Backend, provide so-called "default" collator
+factories which you can use out-of-the-box to start searching across Backstage
+quickly.

 ### Decorators

@@ -68,9 +69,15 @@ Sometimes you want to add extra information to a set of documents in your search
 index that the collator may not be aware of. For example, the Software Catalog
 knows about software entities, but it may not know about their usage or quality.

-Decorators are classes which can add extra fields to pre-collated documents.
-This extra metadata could then be used to bias search results or otherwise
-improve the search experience in your Backstage instance.
+Decorators are transform streams which sit between a collator (read stream) and
+an indexer (write stream) during the indexing process. It can be used to add
+extra fields to documents as they are being collated and indexed. This extra
+metadata could then be used to bias search results or otherwise improve the
+search experience in your Backstage instance.
+
+In addition to adding extra metadata, decorators (like any transform stream) can
+also be used to remove metadata, filter out, or even add extra documents at
+index-time.

 ### The Scheduler

@@ -48,10 +48,10 @@ const app = createApp({
 ## How to index TechDocs documents

 The TechDocs plugin has supported integrations to Search, meaning that it
-provides a default collator ready to be used.
+provides a default collator factory ready to be used.

 The purpose of this guide is to walk you through how to register the
-[DefaultTechDocsCollator](https://github.com/backstage/backstage/blob/master/plugins/techdocs-backend/src/search/DefaultTechDocsCollator.ts)
+[DefaultTechDocsCollatorFactory](https://github.com/backstage/backstage/blob/master/plugins/techdocs-backend/src/search/DefaultTechDocsCollatorFactory.ts)
 in your App, so that you can get TechDocs documents indexed.

 If you have been through the
@@ -60,18 +60,19 @@ you should have the `packages/backend/src/plugins/search.ts` file available. If
 so, you can go ahead and follow this guide - if not, start by going through the
 getting started guide.

-1. Import the DefaultTechDocsCollator from `@backstage/plugin-techdocs-backend`.
+1. Import the `DefaultTechDocsCollatorFactory` from
+   `@backstage/plugin-techdocs-backend`.

 ```typescript
-import { DefaultTechDocsCollator } from '@backstage/plugin-techdocs-backend';
+import { DefaultTechDocsCollatorFactory } from '@backstage/plugin-techdocs-backend';
 ```

-2. Register the DefaultTechDocsCollator with the IndexBuilder.
+2. Register the `DefaultTechDocsCollatorFactory` with the IndexBuilder.

 ```typescript
 indexBuilder.addCollator({
  defaultRefreshIntervalSeconds: 600,
-  collator: DefaultTechDocsCollator.fromConfig(config, {
+  factory: DefaultTechDocsCollatorFactory.fromConfig(config, {
    discovery,
    logger,
    tokenManager,
@@ -131,3 +132,264 @@ indexBuilder.addCollator({

 As shown above, you can add a catalog entity filter to narrow down what catalog
 entities are indexed by the search engine.
+
+## How to migrate from Search Alpha to Beta
+
+For the purposes of this guide, Search Beta version is defined as:
+
+- **Search Plugin**: At least `v0.x.y`
+- **Search Backend Plugin**: At least `v0.x.y`
+- **Search Backend Node**: At least `v0.x.y`
+
+In the Beta version, the Search Platform's indexing process has been rewritten
+as a stream pipeline in order to improve efficiency and performance on large
+sets of documents.
+
+If you've not yet extended the Search Platform with custom code, and have
+instead taken advantage of default collators, decorators, and search engines
+provided by existing plugins, the migration process is fairly straightforward:
+
+1. Upgrade to at least version `0.x.y` of
+   `@backstage/plugin-search-backend-node`, as well as any backend plugins whose
+   collators you are using (e.g. at least version `0.x.y` of
+   `@backstage/plugin-catalog-backend` and/or version `0.x.y` of
+   `@backstage/plugin-techdocs-backend`).
+2. Then, make the following changes to your
+   `/packages/backend/src/plugins/search.ts` file:
+
+   ```diff
+   -import { DefaultCatalogCollator } from '@backstage/plugin-catalog-backend';
+   -import { DefaultTechDocsCollator } from '@backstage/plugin-techdocs-backend';
+   +import { DefaultCatalogCollatorFactory } from '@backstage/plugin-catalog-backend';
+   +import { DefaultTechDocsCollatorFactory } from '@backstage/plugin-techdocs-backend';
+   // ...
+     const indexBuilder = new IndexBuilder({ logger, searchEngine });
+     indexBuilder.addCollator({
+       defaultRefreshIntervalSeconds: 600,
+   -    collator: DefaultCatalogCollator.fromConfig(config, { discovery }),
+   +    factory: DefaultCatalogCollatorFactory.fromConfig(config, { discovery }),
+     });
+     indexBuilder.addCollator({
+       defaultRefreshIntervalSeconds: 600,
+   -    collator: DefaultTechDocsCollator.fromConfig(config, {
+   +    factory: DefaultTechDocsCollatorFactory.fromConfig(config, {
+         discovery,
+         logger,
+       }),
+     });
+   ```
+
+Any custom collators, decorators, or search engine implementations will require
+minor refactoring. Continue on for details.
+
+### Rewriting alpha-style collators for beta
+
+In alpha versions of the Backstage Search Platform, collators were classes that
+implemented an `execute` method which resolved an `IndexableDocument` array.
+
+In beta versions, the logic encapsulated by the aforementioned `execute` method
+is contained within an [object-mode][obj-mode] `Readable` stream where each
+object pushed onto the stream is of type `IndexableDocument`. Instances of this
+stream are instantiated by a factory class conforming to the
+`DocumentCollatorFactory` interface.
+
+The optimal conversion strategy will vary depending on the collator's logic, but
+the simplest conversion can follow a process like this:
+
+1. Rename your collator class to something like `YourCollatorFactory` and update
+   it to implement `DocumentCollatorFactory` instead of `DocumentCollator`.
+2. Update its `execute` method so that it resolves
+   `AsyncGenerator<YourIndexableDocument>` instead of `YourIndexableDocument[]`.
+3. Implement `DocumentCollatorFactory`'s `getCollator` method which resolves to
+   `Readable.from(this.execute())` (which is a utility for creating [readable
+   streams][read-stream] from [async generators][async-gen]).
+
+```ts
+import { DocumentCollatorFactory } from '@backstage/plugin-search-backend-node';
+import { Readable } from 'stream';
+export class YourCollatorFactory implements DocumentCollatorFactory {
+  public readonly type: string = 'your-type';
+  async *execute(): AsyncGenerator<YourIndexableDocument> {
+    const widgets = await this.client.getWidgets();
+    for (const widget of widgets) {
+      yield {
+        title: widget.name,
+        location: widget.url,
+        text: widget.description,
+      };
+    }
+  }
+  getCollator() {
+    return Readable.from(this.execute());
+  }
+}
+```
+
+Note: it may be possible to simplify your collator dramatically! If your custom
+collator was previously using streams under the hood (for example, by reading
+newline delimited JSON from a local or remote file), you could just expose the
+stream directly via a simple factory class:
+
+```ts
+import { DocumentCollatorFactory } from '@backstage/plugin-search-backend-node';
+import { createReadStream } from 'fs';
+import { parse } from '@jsonlines/core';
+export class YourCollatorFactory implements DocumentCollatorFactory {
+  public readonly type: string = 'your-type';
+  async getCollator() {
+    const parseStream = parse();
+    return createReadStream('./documents.ndjson').pipe(parseStream);
+  }
+}
+```
+
+### Rewriting alpha-style decorators for beta
+
+In alpha versions of the Backstage Search Platform, decorators were classes that
+implemented an `execute` method which took an `IndexableDocument` array as an
+argument, and resolved a modified array of the same type.
+
+In beta versions, the logic encapsulated by the aforementioned `execute` method
+is contained within an object-mode `Transform` stream which reads objects of
+type `IndexableDocument`, and writes objects of a conforming type. Similar to
+collators, instances of this stream are instantiated by a factory class
+conforming to the `DocumentDecoratorFactory` interface.
+
+Although you can choose to implement a `Transform` stream from scratch, the
+`@backstage/plugin-search-backend-node` package provides a `DecoratorBase` class
+in order to simplify the developer experience. With this base class, all that's
+needed is to transfer your old decorator class logic into the base class' three
+methods (`initialize`, `decorate`, and `finalize`), and implement the factory
+class that instantiates the stream:
+
+```ts
+import { DecoratorBase } from '@backstage/plugin-search-backend-node';
+export class YourDecorator extends DecoratorBase {
+  async initialize() {
+    // Setup logic. Performed once before any documents are consumed.
+  }
+  async decorate(
+    document: YourIndexableDocument,
+  ): Promise<YourIndexableDocument | YourIndexableDocument[] | undefined> {
+    // Perform transformation logic here.
+    return document;
+  }
+  async finalize() {
+    // Teardown logic. Performed once after all documents have been consumed.
+  }
+}
+export class YourDecoratorFactory implements DocumentDecoratorFactory {
+  async getDecorator() {
+    return new YourDecorator();
+  }
+}
+```
+
+Note the return type of the `decorate` method and how each can be used to
+different effect.
+
+- By resolving a single `YourIndexableDocument` object, your decorator can be
+  used to make simple transformations:
+
+  ```ts
+  class BooleanWidgetCoolnessDecorator extends DecoratorBase {
+    async decorator(widget) {
+      // Perform a simple, 1:1 transformation.
+      widget.isCool = widget.isCool === 'true' ? true : false;
+      return widget;
+    }
+  }
+  ```
+
+- By resolving `undefined`, your decorator can filter out documents which
+  shouldn't be in the index:
+
+  ```ts
+  class OnlyCoolWidgetsDecorator extends DecoratorBase {
+    async decorator(widget) {
+      // Perform a simple filter operation.
+      return widget.isCool ? widget : undefined;
+    }
+  }
+  ```
+
+- By resolving an array of `YourIndexableDocument` objects, you can generate
+  multiple documents based on the content of one:
+
+  ```ts
+  class WidgetByVariantDecorator extends DecoratorBase {
+    async decorator(widget) {
+      // Generate one widget doc per widget variant.
+      return widget.variants.map(variant => {
+        // Each widget doc is the given widget plus a "variant" property
+        // pulled from a widget.variants string array.
+        return {
+          ...widget,
+          variant,
+        };
+      });
+    }
+  }
+  ```
+
+In alpha versions, a decorator had access to every `IndexableDocument`
+simultaneously. This is no longer possible in beta versions (precisely to make
+the indexing process more efficient and performant). You will need to modify
+your decorator's logic so that it does not need access to every document at
+once.
+
+### Rewriting alpha-style search engines for beta
+
+Search Engines are responsible for both querying and indexing documents to an
+underlying search engine technology. While the search engine query interface
+didn't change between alpha and beta versions, the indexing half of the
+interface _did_ change.
+
+In alpha versions of the Backstage Search Platform, a search engine implemented
+an `index` method which took a `type` and an `IndexableDocument` array and was
+responsible for writing these documents to the underlying search engine.
+
+In beta versions, the logic encapsulated by the aforementioned `index` method is
+contained within an object-mode `Writable` stream which expects objects of type
+`IndexableDocument`. On the search engine class itself, the `index` method is
+replaced with a `getIndexer` factory method which still takes the `type`, but
+resolves an instance of the aforementioned `Writable` stream.
+
+Although you can choose to implement a `Writable` stream from scratch, the
+`@backstage/plugin-search-backend-node` package provides a
+`BatchSearchEngineIndexer` class in order to simplify the developer experience.
+With this base class, which collects documents in batches of a configurable size
+on your behalf, all that's needed is to transfer your old `index` method logic
+into the base class' three methods (`initialize`, `index`, and `finalize`), and
+implement the factory method that instantiates the stream:
+
+```ts
+import { BatchSearchEngineIndexer } from '@backstage/plugin-search-backend-node';
+import { SearchEngine } from '@backstage/search-common';
+export class YourSearchEngineIndexer extends BatchSearchEngineIndexer {
+  constructor({ type }: { type: string }) {
+    // Customize the number of documents passed to the index method per batch.
+    super({ batchSize: 500 });
+    // An imaginary search engine indexing client.
+    this.index = new SomeSearchEngineIndex({ indexName: type });
+  }
+  async initialize() {
+    // Setup logic. Performed once before any documents are consumed.
+  }
+  async index(documents: IndexableDocument[]) {
+    await this.index.batchOf(documents);
+  }
+  async finalize() {
+    // Teardown logic. Performed once after all documents have been consumed.
+  }
+}
+export class YourSearchEngine implements SearchEngine {
+  async getIndexer(type: string) {
+    return new YourSearchEngineIndexer({ type });
+  }
+}
+```
+
+[obj-mode]: https://nodejs.org/docs/latest-v14.x/api/stream.html#stream_object_mode
+[read-stream]: https://nodejs.org/docs/latest-v14.x/api/stream.html#stream_readable_streams
+[async-gen]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of#iterating_over_async_generators