Document stream-based search

Signed-off-by: Eric Peterson <ericpeterson@spotify.com>
This commit is contained in:
Eric Peterson
2022-02-26 21:05:24 +01:00
parent d0993bc6f1
commit 022507c860
9 changed files with 380 additions and 15 deletions
@@ -0,0 +1,14 @@
---
'@backstage/plugin-search-backend-node': minor
'@backstage/search-common': minor
---
The Backstage Search Platform's indexing process has been rewritten as a stream
pipeline in order to improve efficiency and performance on large document sets.
The concepts of `Collator` and `Decorator` have been replaced with readable and
transform object streams (respectively), as well as factory classes to
instantiate them.
Accordingly, the `SearchEngine.index()` method has also been replaced with a
`getIndexer()` factory method that resolves to a writable object stream.
+6
View File
@@ -0,0 +1,6 @@
---
'@backstage/plugin-search-backend-module-pg': minor
---
The `PgSearchEngine` implements the new stream-based indexing process expected
by the latest `@backstage/search-backend-node`.
+13
View File
@@ -0,0 +1,13 @@
---
'@backstage/plugin-techdocs-backend': patch
---
A `DefaultTechDocsCollatorFactory`, which works with the new stream-based
search indexing subsystem, is now available. The `DefaultTechDocsCollator` will
continue to be available for those unable to upgrade to the stream-based
`@backstage/search-backend-node` (and related packages), however it is now
marked as deprecated and will be removed in a future version.
To upgrade this plugin and the search indexing subsystem in one go, check
[this changelog](https://github.com/backstage/backstage/blob/master/packages/create-app/CHANGELOG.md)
for necessary changes to your search backend plugin configuration.
+43
View File
@@ -0,0 +1,43 @@
---
'@backstage/create-app': patch
---
The Backstage Search Platform's indexing process has been rewritten as a stream
pipeline in order to improve efficiency and performance on large document sets.
To take advantage of this, upgrade to the latest version of
`@backstage/plugin-search-backend-node`, as well as any backend plugins whose
collators you are using. Then, make the following changes to your
`/packages/backend/src/plugins/search.ts` file:
```diff
-import { DefaultCatalogCollator } from '@backstage/plugin-catalog-backend';
-import { DefaultTechDocsCollator } from '@backstage/plugin-techdocs-backend';
+import { DefaultCatalogCollatorFactory } from '@backstage/plugin-catalog-backend';
+import { DefaultTechDocsCollatorFactory } from '@backstage/plugin-techdocs-backend';
// ...
const indexBuilder = new IndexBuilder({ logger, searchEngine });
indexBuilder.addCollator({
defaultRefreshIntervalSeconds: 600,
- collator: DefaultCatalogCollator.fromConfig(config, { discovery }),
+ factory: DefaultCatalogCollatorFactory.fromConfig(config, { discovery }),
});
indexBuilder.addCollator({
defaultRefreshIntervalSeconds: 600,
- collator: DefaultTechDocsCollator.fromConfig(config, {
+ factory: DefaultTechDocsCollatorFactory.fromConfig(config, {
discovery,
logger,
}),
});
```
If you've written custom collators, decorators, or search engines in your
Backstage backend instance, you will need to re-implement them as readable,
transform, and writable streams respectively (including factory classes for
instantiating them). [A how-to guide for refactoring](https://backstage.io/docs/features/search/how-to-guides#rewriting-alpha-style-collators-for-beta)
existing implementations is available.
@@ -0,0 +1,6 @@
---
'@backstage/plugin-search-backend-module-elasticsearch': minor
---
The `ElasticSearchSearchEngine` implements the new stream-based indexing
process expected by the latest `@backstage/search-backend-node`.
+13
View File
@@ -0,0 +1,13 @@
---
'@backstage/plugin-catalog-backend': patch
---
A `DefaultCatalogCollatorFactory`, which works with the new stream-based
search indexing subsystem, is now available. The `DefaultCatalogCollator` will
continue to be available for those unable to upgrade to the stream-based
`@backstage/search-backend-node` (and related packages), however it is now
marked as deprecated and will be removed in a future version.
To upgrade this plugin and the search indexing subsystem in one go, check
[this changelog](https://github.com/backstage/backstage/blob/master/packages/create-app/CHANGELOG.md)
for necessary changes to your search backend plugin configuration.
+1
View File
@@ -213,6 +213,7 @@ parallelization
Patrik
Peloton
performant
Performant
plantuml
Platformize
Podman
+16 -9
View File
@@ -54,13 +54,14 @@ An index is a collection of such documents of a given type.
### Collators
You need to be able to search something! Collators are the way to define what
can be searched. Specifically, they're classes which return documents conforming
to a minimum set of fields (including a document title, location, and text), but
which can contain any other fields as defined by the collator itself. One
collator is responsible for defining and collecting documents of a type.
can be searched. Specifically, they're readable object streams of documents that
conform to a minimum set of fields (including a document title, location, and
text), but which can contain any other fields as defined by the collator itself.
One collator is responsible for defining and collecting documents of a type.
Some plugins, like the Catalog Backend, provide so-called "default" collators
which you can use out-of-the-box to start searching across Backstage quickly.
Some plugins, like the Catalog Backend, provide so-called "default" collator
factories which you can use out-of-the-box to start searching across Backstage
quickly.
### Decorators
@@ -68,9 +69,15 @@ Sometimes you want to add extra information to a set of documents in your search
index that the collator may not be aware of. For example, the Software Catalog
knows about software entities, but it may not know about their usage or quality.
Decorators are classes which can add extra fields to pre-collated documents.
This extra metadata could then be used to bias search results or otherwise
improve the search experience in your Backstage instance.
Decorators are transform streams which sit between a collator (read stream) and
an indexer (write stream) during the indexing process. It can be used to add
extra fields to documents as they are being collated and indexed. This extra
metadata could then be used to bias search results or otherwise improve the
search experience in your Backstage instance.
In addition to adding extra metadata, decorators (like any transform stream) can
also be used to remove metadata, filter out, or even add extra documents at
index-time.
### The Scheduler
+268 -6
View File
@@ -48,10 +48,10 @@ const app = createApp({
## How to index TechDocs documents
The TechDocs plugin has supported integrations to Search, meaning that it
provides a default collator ready to be used.
provides a default collator factory ready to be used.
The purpose of this guide is to walk you through how to register the
[DefaultTechDocsCollator](https://github.com/backstage/backstage/blob/master/plugins/techdocs-backend/src/search/DefaultTechDocsCollator.ts)
[DefaultTechDocsCollatorFactory](https://github.com/backstage/backstage/blob/master/plugins/techdocs-backend/src/search/DefaultTechDocsCollatorFactory.ts)
in your App, so that you can get TechDocs documents indexed.
If you have been through the
@@ -60,18 +60,19 @@ you should have the `packages/backend/src/plugins/search.ts` file available. If
so, you can go ahead and follow this guide - if not, start by going through the
getting started guide.
1. Import the DefaultTechDocsCollator from `@backstage/plugin-techdocs-backend`.
1. Import the `DefaultTechDocsCollatorFactory` from
`@backstage/plugin-techdocs-backend`.
```typescript
import { DefaultTechDocsCollator } from '@backstage/plugin-techdocs-backend';
import { DefaultTechDocsCollatorFactory } from '@backstage/plugin-techdocs-backend';
```
2. Register the DefaultTechDocsCollator with the IndexBuilder.
2. Register the `DefaultTechDocsCollatorFactory` with the IndexBuilder.
```typescript
indexBuilder.addCollator({
defaultRefreshIntervalSeconds: 600,
collator: DefaultTechDocsCollator.fromConfig(config, {
factory: DefaultTechDocsCollatorFactory.fromConfig(config, {
discovery,
logger,
tokenManager,
@@ -131,3 +132,264 @@ indexBuilder.addCollator({
As shown above, you can add a catalog entity filter to narrow down what catalog
entities are indexed by the search engine.
## How to migrate from Search Alpha to Beta
For the purposes of this guide, Search Beta version is defined as:
- **Search Plugin**: At least `v0.x.y`
- **Search Backend Plugin**: At least `v0.x.y`
- **Search Backend Node**: At least `v0.x.y`
In the Beta version, the Search Platform's indexing process has been rewritten
as a stream pipeline in order to improve efficiency and performance on large
sets of documents.
If you've not yet extended the Search Platform with custom code, and have
instead taken advantage of default collators, decorators, and search engines
provided by existing plugins, the migration process is fairly straightforward:
1. Upgrade to at least version `0.x.y` of
`@backstage/plugin-search-backend-node`, as well as any backend plugins whose
collators you are using (e.g. at least version `0.x.y` of
`@backstage/plugin-catalog-backend` and/or version `0.x.y` of
`@backstage/plugin-techdocs-backend`).
2. Then, make the following changes to your
`/packages/backend/src/plugins/search.ts` file:
```diff
-import { DefaultCatalogCollator } from '@backstage/plugin-catalog-backend';
-import { DefaultTechDocsCollator } from '@backstage/plugin-techdocs-backend';
+import { DefaultCatalogCollatorFactory } from '@backstage/plugin-catalog-backend';
+import { DefaultTechDocsCollatorFactory } from '@backstage/plugin-techdocs-backend';
// ...
const indexBuilder = new IndexBuilder({ logger, searchEngine });
indexBuilder.addCollator({
defaultRefreshIntervalSeconds: 600,
- collator: DefaultCatalogCollator.fromConfig(config, { discovery }),
+ factory: DefaultCatalogCollatorFactory.fromConfig(config, { discovery }),
});
indexBuilder.addCollator({
defaultRefreshIntervalSeconds: 600,
- collator: DefaultTechDocsCollator.fromConfig(config, {
+ factory: DefaultTechDocsCollatorFactory.fromConfig(config, {
discovery,
logger,
}),
});
```
Any custom collators, decorators, or search engine implementations will require
minor refactoring. Continue on for details.
### Rewriting alpha-style collators for beta
In alpha versions of the Backstage Search Platform, collators were classes that
implemented an `execute` method which resolved an `IndexableDocument` array.
In beta versions, the logic encapsulated by the aforementioned `execute` method
is contained within an [object-mode][obj-mode] `Readable` stream where each
object pushed onto the stream is of type `IndexableDocument`. Instances of this
stream are instantiated by a factory class conforming to the
`DocumentCollatorFactory` interface.
The optimal conversion strategy will vary depending on the collator's logic, but
the simplest conversion can follow a process like this:
1. Rename your collator class to something like `YourCollatorFactory` and update
it to implement `DocumentCollatorFactory` instead of `DocumentCollator`.
2. Update its `execute` method so that it resolves
`AsyncGenerator<YourIndexableDocument>` instead of `YourIndexableDocument[]`.
3. Implement `DocumentCollatorFactory`'s `getCollator` method which resolves to
`Readable.from(this.execute())` (which is a utility for creating [readable
streams][read-stream] from [async generators][async-gen]).
```ts
import { DocumentCollatorFactory } from '@backstage/plugin-search-backend-node';
import { Readable } from 'stream';
export class YourCollatorFactory implements DocumentCollatorFactory {
public readonly type: string = 'your-type';
async *execute(): AsyncGenerator<YourIndexableDocument> {
const widgets = await this.client.getWidgets();
for (const widget of widgets) {
yield {
title: widget.name,
location: widget.url,
text: widget.description,
};
}
}
getCollator() {
return Readable.from(this.execute());
}
}
```
Note: it may be possible to simplify your collator dramatically! If your custom
collator was previously using streams under the hood (for example, by reading
newline delimited JSON from a local or remote file), you could just expose the
stream directly via a simple factory class:
```ts
import { DocumentCollatorFactory } from '@backstage/plugin-search-backend-node';
import { createReadStream } from 'fs';
import { parse } from '@jsonlines/core';
export class YourCollatorFactory implements DocumentCollatorFactory {
public readonly type: string = 'your-type';
async getCollator() {
const parseStream = parse();
return createReadStream('./documents.ndjson').pipe(parseStream);
}
}
```
### Rewriting alpha-style decorators for beta
In alpha versions of the Backstage Search Platform, decorators were classes that
implemented an `execute` method which took an `IndexableDocument` array as an
argument, and resolved a modified array of the same type.
In beta versions, the logic encapsulated by the aforementioned `execute` method
is contained within an object-mode `Transform` stream which reads objects of
type `IndexableDocument`, and writes objects of a conforming type. Similar to
collators, instances of this stream are instantiated by a factory class
conforming to the `DocumentDecoratorFactory` interface.
Although you can choose to implement a `Transform` stream from scratch, the
`@backstage/plugin-search-backend-node` package provides a `DecoratorBase` class
in order to simplify the developer experience. With this base class, all that's
needed is to transfer your old decorator class logic into the base class' three
methods (`initialize`, `decorate`, and `finalize`), and implement the factory
class that instantiates the stream:
```ts
import { DecoratorBase } from '@backstage/plugin-search-backend-node';
export class YourDecorator extends DecoratorBase {
async initialize() {
// Setup logic. Performed once before any documents are consumed.
}
async decorate(
document: YourIndexableDocument,
): Promise<YourIndexableDocument | YourIndexableDocument[] | undefined> {
// Perform transformation logic here.
return document;
}
async finalize() {
// Teardown logic. Performed once after all documents have been consumed.
}
}
export class YourDecoratorFactory implements DocumentDecoratorFactory {
async getDecorator() {
return new YourDecorator();
}
}
```
Note the return type of the `decorate` method and how each can be used to
different effect.
- By resolving a single `YourIndexableDocument` object, your decorator can be
used to make simple transformations:
```ts
class BooleanWidgetCoolnessDecorator extends DecoratorBase {
async decorator(widget) {
// Perform a simple, 1:1 transformation.
widget.isCool = widget.isCool === 'true' ? true : false;
return widget;
}
}
```
- By resolving `undefined`, your decorator can filter out documents which
shouldn't be in the index:
```ts
class OnlyCoolWidgetsDecorator extends DecoratorBase {
async decorator(widget) {
// Perform a simple filter operation.
return widget.isCool ? widget : undefined;
}
}
```
- By resolving an array of `YourIndexableDocument` objects, you can generate
multiple documents based on the content of one:
```ts
class WidgetByVariantDecorator extends DecoratorBase {
async decorator(widget) {
// Generate one widget doc per widget variant.
return widget.variants.map(variant => {
// Each widget doc is the given widget plus a "variant" property
// pulled from a widget.variants string array.
return {
...widget,
variant,
};
});
}
}
```
In alpha versions, a decorator had access to every `IndexableDocument`
simultaneously. This is no longer possible in beta versions (precisely to make
the indexing process more efficient and performant). You will need to modify
your decorator's logic so that it does not need access to every document at
once.
### Rewriting alpha-style search engines for beta
Search Engines are responsible for both querying and indexing documents to an
underlying search engine technology. While the search engine query interface
didn't change between alpha and beta versions, the indexing half of the
interface _did_ change.
In alpha versions of the Backstage Search Platform, a search engine implemented
an `index` method which took a `type` and an `IndexableDocument` array and was
responsible for writing these documents to the underlying search engine.
In beta versions, the logic encapsulated by the aforementioned `index` method is
contained within an object-mode `Writable` stream which expects objects of type
`IndexableDocument`. On the search engine class itself, the `index` method is
replaced with a `getIndexer` factory method which still takes the `type`, but
resolves an instance of the aforementioned `Writable` stream.
Although you can choose to implement a `Writable` stream from scratch, the
`@backstage/plugin-search-backend-node` package provides a
`BatchSearchEngineIndexer` class in order to simplify the developer experience.
With this base class, which collects documents in batches of a configurable size
on your behalf, all that's needed is to transfer your old `index` method logic
into the base class' three methods (`initialize`, `index`, and `finalize`), and
implement the factory method that instantiates the stream:
```ts
import { BatchSearchEngineIndexer } from '@backstage/plugin-search-backend-node';
import { SearchEngine } from '@backstage/search-common';
export class YourSearchEngineIndexer extends BatchSearchEngineIndexer {
constructor({ type }: { type: string }) {
// Customize the number of documents passed to the index method per batch.
super({ batchSize: 500 });
// An imaginary search engine indexing client.
this.index = new SomeSearchEngineIndex({ indexName: type });
}
async initialize() {
// Setup logic. Performed once before any documents are consumed.
}
async index(documents: IndexableDocument[]) {
await this.index.batchOf(documents);
}
async finalize() {
// Teardown logic. Performed once after all documents have been consumed.
}
}
export class YourSearchEngine implements SearchEngine {
async getIndexer(type: string) {
return new YourSearchEngineIndexer({ type });
}
}
```
[obj-mode]: https://nodejs.org/docs/latest-v14.x/api/stream.html#stream_object_mode
[read-stream]: https://nodejs.org/docs/latest-v14.x/api/stream.html#stream_readable_streams
[async-gen]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of#iterating_over_async_generators