Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions THIRD-PARTY.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ List of third-party dependencies grouped by their license type.
* Apache HBase Unsafe Wrapper (org.apache.hbase.thirdparty:hbase-unsafe:4.1.12 - https://hbase.apache.org/hbase-unsafe)
* Apache HttpAsyncClient (org.apache.httpcomponents:httpasyncclient:4.1.5 - http://hc.apache.org/httpcomponents-asyncclient)
* Apache HttpClient (org.apache.httpcomponents:httpclient:4.5.14 - http://hc.apache.org/httpcomponents-client-ga)
* Apache HttpClient (org.apache.httpcomponents.client5:httpclient5:5.6 - https://hc.apache.org/httpcomponents-client-5.5.x/5.6/httpclient5/)
* Apache HttpComponents Core HTTP/1.1 (org.apache.httpcomponents.core5:httpcore5:5.4.2 - https://hc.apache.org/httpcomponents-core-5.4.x/5.4.2/httpcore5/)
* Apache HttpComponents Core HTTP/2 (org.apache.httpcomponents.core5:httpcore5-h2:5.4.2 - https://hc.apache.org/httpcomponents-core-5.4.x/5.4.2/httpcore5-h2/)
* Apache HttpCore (org.apache.httpcomponents:httpcore:4.4.16 - http://hc.apache.org/httpcomponents-core-ga)
* Apache HttpCore NIO (org.apache.httpcomponents:httpcore-nio:4.4.16 - http://hc.apache.org/httpcomponents-core-ga)
* Apache James :: Mime4j :: Core (org.apache.james:apache-mime4j-core:0.8.13 - http://james.apache.org/mime4j/apache-mime4j-core)
Expand Down Expand Up @@ -216,6 +219,7 @@ List of third-party dependencies grouped by their license type.
* opensearch-compress (org.opensearch:opensearch-compress:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* opensearch-core (org.opensearch:opensearch-core:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* opensearch-geo (org.opensearch:opensearch-geo:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* OpenSearch Java Client (org.opensearch.client:opensearch-java:3.8.0 - https://github.com/opensearch-project/opensearch-java/)
* opensearch-secure-sm (org.opensearch:opensearch-secure-sm:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* opensearch-task-commons (org.opensearch:opensearch-task-commons:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* opensearch-telemetry (org.opensearch:opensearch-telemetry:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
Expand Down Expand Up @@ -348,6 +352,10 @@ List of third-party dependencies grouped by their license type.
* JAXB Runtime (org.glassfish.jaxb:jaxb-runtime:4.0.7 - https://eclipse-ee4j.github.io/jaxb-ri/)
* TXW2 Runtime (org.glassfish.jaxb:txw2:4.0.7 - https://eclipse-ee4j.github.io/jaxb-ri/)

Eclipse Distribution License v. 1.0, Eclipse Public License v. 2.0

* org.eclipse.yasson (org.eclipse:yasson:2.0.2 - https://projects.eclipse.org/projects/ee4j.yasson)

Eclipse Public License, Version 2.0, GPL-2.0-with-classpath-exception

* Jakarta RESTful WS API (jakarta.ws.rs:jakarta.ws.rs-api:3.1.0 - https://github.com/eclipse-ee4j/jaxrs-api)
Expand All @@ -356,6 +364,13 @@ List of third-party dependencies grouped by their license type.

* Jakarta Annotations API (jakarta.annotation:jakarta.annotation-api:1.3.5 - https://projects.eclipse.org/projects/ee4j.ca)

Eclipse Public License 2.0, GNU General Public License, version 2 with the GNU Classpath Exception

* Eclipse Parsson (org.eclipse.parsson:parsson:1.1.7 - https://github.com/eclipse-ee4j/parsson/parsson)
* Jakarta JSON Processing API (jakarta.json:jakarta.json-api:2.1.3 - https://github.com/eclipse-ee4j/jsonp)
* JSON-B API (jakarta.json.bind:jakarta.json.bind-api:2.0.0 - https://eclipse-ee4j.github.io/jsonb-api)
* JSON-P Default Provider (org.glassfish:jakarta.json:2.0.0 - https://github.com/eclipse-ee4j/jsonp)

GENERAL PUBLIC LICENSE, version 3 (GPL-3.0), GNU LESSER GENERAL PUBLIC LICENSE, version 3 (LGPL-3.0), Mozilla Public License Version 1.1

* juniversalchardet (com.github.albfernandez:juniversalchardet:2.5.0 - https://github.com/albfernandez/juniversalchardet)
Expand Down
63 changes: 63 additions & 0 deletions external/opensearch-java/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
stormcrawler-opensearch-java
===========================

A collection of resources for [OpenSearch](https://opensearch.org/) built on the
[OpenSearch Java Client 3.x](https://opensearch.org/docs/latest/clients/java/) and
Apache HttpClient 5:

* [IndexerBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch-java/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java) for indexing documents crawled with StormCrawler
* [Spouts](https://github.com/apache/stormcrawler/blob/master/external/opensearch-java/src/main/java/org/apache/stormcrawler/opensearch/persistence/AggregationSpout.java) and [StatusUpdaterBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch-java/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java) for persisting URL information in recursive crawls
* [MetricsConsumer](https://github.com/apache/stormcrawler/blob/master/external/opensearch-java/src/main/java/org/apache/stormcrawler/opensearch/metrics/MetricsConsumer.java)
* [StatusMetricsBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch-java/src/main/java/org/apache/stormcrawler/opensearch/metrics/StatusMetricsBolt.java) for sending the breakdown of URLs per status as metrics and display its evolution over time.

This module is functionally equivalent to the legacy `external/opensearch` module
(which is based on the deprecated `RestHighLevelClient` and HttpClient 4), but
uses the typed `OpenSearchClient` and the `ApacheHttpClient5TransportBuilder`
transport. Unlike the legacy client, the Java Client 3.x no longer ships a
sniffer nor a built-in `BulkProcessor`; this module provides an internal
`AsyncBulkProcessor` that preserves the same semantics (size/count/time based
flushing, back-pressure, listener callbacks).

Getting started
---------------------

Add the dependency to your crawler project:

```xml
<dependency>
<groupId>org.apache.stormcrawler</groupId>
<artifactId>stormcrawler-opensearch-java</artifactId>
<version>${stormcrawler.version}</version>
</dependency>
```

You will of course need to have both Storm and OpenSearch installed. For the
latter, see the [OpenSearch documentation](https://opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/)
for Docker-based setups.

Schemas are automatically created by the bolts on first use; you can override
them by providing your own index definitions before starting the topology.

Configuration and dashboards
---------------------

For a ready-to-use crawler configuration, example Flux topologies, index
initialization scripts and OpenSearch Dashboards exports, refer to the
[`external/opensearch`](../opensearch) module: all of those resources are
compatible with this module and have not been duplicated here.

Differences from the legacy `external/opensearch` module
---------------------

* `opensearch.<bolt>.responseBufferSize` is no longer supported. The legacy
module used the HC4-based low-level REST client and set a heap response
buffer via `HeapBufferedResponseConsumerFactory`. The HC5-based async
transport used here does not expose an equivalent per-request override, so
the key is ignored. A `WARN` is logged at startup if it is found in the
configuration; remove it when migrating.
* `opensearch.<bolt>.sniff` is no longer supported. The legacy module enabled
node auto-discovery by default via the low-level REST client `Sniffer`. The
OpenSearch Java Client 3.x does not ship a sniffer equivalent, so this
feature is dropped. Keep the `addresses` list up to date manually or put a
load balancer in front of the cluster. A `WARN` is logged at startup if the
key is found in the configuration; remove it when migrating.
114 changes: 114 additions & 0 deletions external/opensearch-java/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
<?xml version="1.0" encoding="UTF-8"?>

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>org.apache.stormcrawler</groupId>
<artifactId>stormcrawler-external</artifactId>
<version>3.5.2-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>

<properties>
<opensearch.server.version>3.5.0</opensearch.server.version>
<opensearch.java.version>3.8.0</opensearch.java.version>
<jacoco.haltOnFailure>true</jacoco.haltOnFailure>
<jacoco.classRatio>0.27</jacoco.classRatio>
<jacoco.instructionRatio>0.27</jacoco.instructionRatio>
<jacoco.methodRatio>0.25</jacoco.methodRatio>
<jacoco.branchRatio>0.17</jacoco.branchRatio>
<jacoco.lineRatio>0.29</jacoco.lineRatio>
<jacoco.complexityRatio>0.13</jacoco.complexityRatio>
</properties>

<artifactId>stormcrawler-opensearch-java</artifactId>
<packaging>jar</packaging>

<name>stormcrawler-opensearch-java</name>
<url>
https://github.com/apache/stormcrawler/tree/master/external/opensearch-java</url>
<description>OpenSearch module for Apache StormCrawler using the new opensearch-java client</description>

<build>
<plugins>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<executions>
<execution>
<id>default-test</id>
<phase>test</phase>
<goals>
<goal>test</goal>
</goals>
</execution>
</executions>
<configuration>
<systemPropertyVariables>
<opensearch-version>${opensearch.server.version}</opensearch-version>
</systemPropertyVariables>
</configuration>
</plugin>
</plugins>
</build>

<dependencies>
<dependency>
<groupId>org.opensearch.client</groupId>
<artifactId>opensearch-java</artifactId>
<version>${opensearch.java.version}</version>
</dependency>

<dependency>
<groupId>org.apache.stormcrawler</groupId>
<artifactId>stormcrawler-core</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers</artifactId>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>junit-jupiter</artifactId>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.awaitility</groupId>
<artifactId>awaitility</artifactId>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<scope>test</scope>
</dependency>

</dependencies>
</project>
Loading