Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove addToResult() and multiple loaders #158

Merged
merged 2 commits into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [2.0.0] - 2024-x-x
### Changed
* __BREAKING__: Removed methods `BaseStep::addToResult()`, `BaseStep::addLaterToResult()`, `BaseStep::addsToOrCreatesResult()`, `BaseStep::createsResult()` and `BaseStep::keepInputData()`. They have already been deprecated in v1.8.0 and shall be replaced with `Step::keep()` and `Step::keepAs()`, `Step::keepFromInput()` and `Step::keepInputAs()`.
* __BREAKING__: As the `addToResult()` method was removed, the library does not use `toArrayForAddToResult()` methods on output objects any longer. Instead, please use `toArrayForResult()`. Therefore, also the `RespondedRequest::toArrayForAddToResult()` is renamed to `RespondedRequest::toArrayForResult()`.
* __BREAKING__: Removed the `result` and `addLaterToResult` properties from `Io` objects (so `Input` and `Output`). They were part of the whole `addToResult` feature and are therefore removed. Instead, there is the `keep` property where kept data is added.
* __BREAKING__: The return type of the `Crawler::loader()` method was changed to no longer allow `array`. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the functionality described below, to directly provide a custom loader to a step.
* __BREAKING__: Refactored the abstract `LoadingStep` class to a trait and removed the `LoadingStepInterface`. Loading steps should now just extend the `Step` class and use the trait. As it is no longer possible to have multiple loaders, the `addLoader` method was renamed to `setLoader`. For the same reason, the methods `useLoader()` and `usesLoader()`, to choose one of multiple loaders from the crawler by key, are removed. Instead, you can now directly provide a different loader to a single step (instead to the crawler), using the trait's new `withLoader()` method (e.g. `Http::get()->withLoader($loader)`).
* __BREAKING__: The `HttpLoader::retryCachedErrorResponses()` method now returns an instance of the new `Crwlr\Crawler\Loader\Http\Cache\RetryManager` class, providing the methods `only()` and `except()` that can be used to restrict retries to certain HTTP response status codes. Previously the method returned the `HttpLoader` itself (`$this`), so if you're using it in a chain and call other loader methods after it, you need to refactor this.
* __BREAKING__: Removed the `Microseconds` class from this package. It was moved to the `crwlr/utils` package that you can use instead.

## [1.10.0] - 2024-08-05
### Added
Expand Down
4 changes: 2 additions & 2 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,8 @@
}
},
"scripts": {
"test": "pest --exclude-group integration --display-warnings",
"test-integration": "pest --group integration --display-warnings",
"test": "pest --exclude-group integration --display-warnings --bail",
"test-integration": "pest --group integration --display-warnings --bail",
"stan": "@php -d memory_limit=4G vendor/bin/phpstan analyse",
"cs": "php-cs-fixer fix -v --dry-run",
"cs-fix": "php-cs-fixer fix -v",
Expand Down
126 changes: 21 additions & 105 deletions src/Crawler.php
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,11 @@
namespace Crwlr\Crawler;

use Closure;
use Crwlr\Crawler\Exceptions\UnknownLoaderKeyException;
use Crwlr\Crawler\Loader\AddLoadersToStepAction;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\Logger\CliLogger;
use Crwlr\Crawler\Steps\BaseStep;
use Crwlr\Crawler\Steps\Exceptions\PreRunValidationException;
use Crwlr\Crawler\Steps\Group;
use Crwlr\Crawler\Steps\Step;
use Crwlr\Crawler\Steps\StepInterface;
use Crwlr\Crawler\Stores\StoreInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
Expand All @@ -24,9 +21,9 @@ abstract class Crawler
protected UserAgentInterface $userAgent;

/**
* @var LoaderInterface|array<string, LoaderInterface>
* @var LoaderInterface
*/
protected LoaderInterface|array $loader;
protected LoaderInterface $loader;

protected LoggerInterface $logger;

Expand Down Expand Up @@ -68,9 +65,9 @@ abstract protected function userAgent(): UserAgentInterface;
/**
* @param UserAgentInterface $userAgent
* @param LoggerInterface $logger
* @return LoaderInterface|array<string, LoaderInterface>
* @return LoaderInterface
*/
abstract protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array;
abstract protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface;

public static function group(): Group
{
Expand Down Expand Up @@ -146,24 +143,17 @@ public function inputs(array $inputs): static
}

/**
* @param string|StepInterface $stepOrResultKey
* @param StepInterface|null $step
* @param StepInterface $step
* @return $this
* @throws InvalidArgumentException|UnknownLoaderKeyException
* @throws InvalidArgumentException
*/
public function addStep(string|StepInterface $stepOrResultKey, ?StepInterface $step = null): static
public function addStep(StepInterface $step): static
{
if (is_string($stepOrResultKey) && $step === null) {
throw new InvalidArgumentException('No StepInterface object provided');
} elseif (is_string($stepOrResultKey)) {
$step->addToResult($stepOrResultKey);
} else {
$step = $stepOrResultKey;
}

$step->addLogger($this->logger);

(new AddLoadersToStepAction($this->loader, $step))->invoke();
if (method_exists($step, 'setLoader')) {
$step->setLoader($this->loader);
}

if ($step instanceof BaseStep) {
$step->setParentCrawler($this);
Expand Down Expand Up @@ -266,8 +256,8 @@ protected function invokeStepsRecursive(Input $input, StepInterface $step, int $

$nextStep = $this->nextStep($stepIndex);

if (!$nextStep && $input->result === null) {
yield from $this->storeAndReturnResults($outputs, $step->createsResult() === true, true);
if (!$nextStep) {
yield from $this->storeAndReturnOutputsAsResults($outputs);

return;
}
Expand All @@ -279,85 +269,22 @@ protected function invokeStepsRecursive(Input $input, StepInterface $step, int $

$this->outputHook?->call($this, $output, $stepIndex, $step);

if ($nextStep) {
if ($input->result === null && $step->createsResult()) {
$childOutputs = $this->invokeStepsRecursive(
new Input($output),
$nextStep,
$stepIndex + 1,
);

/** @var Generator<Output> $childOutputs */

yield from $this->storeAndReturnResults($childOutputs, true);
} else {
yield from $this->invokeStepsRecursive(
new Input($output),
$nextStep,
$stepIndex + 1,
);
}
} else {
yield $output;
}
}
}

/**
* @param Generator<Output> $outputs
* @return Generator<Result>
*/
protected function storeAndReturnResults(
Generator $outputs,
bool $manuallyDefinedResults = false,
bool $callOutputHook = false,
): Generator {
if ($manuallyDefinedResults || $this->anyResultKeysDefinedInSteps()) {
yield from $this->storeAndReturnDefinedResults($outputs, $callOutputHook);
} else {
yield from $this->storeAndReturnOutputsAsResults($outputs, $callOutputHook);
}
}

/**
* @param Generator<Output> $outputs
* @return Generator<Result>
*/
protected function storeAndReturnDefinedResults(Generator $outputs, bool $callOutputHook = false): Generator
{
$results = [];

foreach ($outputs as $output) {
if ($callOutputHook) {
$this->outputHook?->call($this, $output, count($this->steps) - 1, end($this->steps));
}

if ($output->result !== null && !in_array($output->result, $results, true)) {
$results[] = $output->result;
} elseif ($output->addLaterToResult !== null && !in_array($output->addLaterToResult, $results, true)) {
$results[] = new Result($output->addLaterToResult);
}
}

// yield results only after iterating over final outputs, because that could still add properties to result
// resources.
foreach ($results as $result) {
$this->store?->store($result);

yield $result;
yield from $this->invokeStepsRecursive(
new Input($output),
$nextStep,
$stepIndex + 1,
);
}
}

/**
* @param Generator<Output> $outputs
* @return Generator<Result>
*/
protected function storeAndReturnOutputsAsResults(Generator $outputs, bool $callOutputHook = false): Generator
protected function storeAndReturnOutputsAsResults(Generator $outputs): Generator
{
foreach ($outputs as $output) {
if ($callOutputHook) {
$this->outputHook?->call($this, $output, count($this->steps) - 1, end($this->steps));
}
$this->outputHook?->call($this, $output, count($this->steps) - 1, end($this->steps));

$result = new Result();

Expand Down Expand Up @@ -420,17 +347,6 @@ protected function prepareInput(): array
}, $this->inputs);
}

protected function anyResultKeysDefinedInSteps(): bool
{
foreach ($this->steps as $step) {
if ($step->addsToOrCreatesResult()) {
return true;
}
}

return false;
}

protected function logMemoryUsage(): void
{
$memoryUsage = memory_get_usage();
Expand All @@ -445,11 +361,11 @@ protected function firstStep(): ?StepInterface
return $this->steps[0] ?? null;
}

protected function lastStep(): ?Step
protected function lastStep(): ?BaseStep
{
$lastStep = end($this->steps);

if (!$lastStep instanceof Step) {
if (!$lastStep instanceof BaseStep) {
return null;
}

Expand Down
7 changes: 0 additions & 7 deletions src/Exceptions/UnknownLoaderKeyException.php

This file was deleted.

4 changes: 2 additions & 2 deletions src/HttpCrawler.php
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@
abstract class HttpCrawler extends Crawler
{
/**
* @return LoaderInterface|array<string, LoaderInterface>
* @return LoaderInterface
*/
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
{
return new HttpLoader($userAgent, logger: $logger);
}
Expand Down
4 changes: 2 additions & 2 deletions src/HttpCrawler/AnonymousHttpCrawlerBuilder.php
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ public function __construct() {}

public function withBotUserAgent(string $productToken): HttpCrawler
{
$instance = new class () extends HttpCrawler {
$instance = new class extends HttpCrawler {
protected function userAgent(): UserAgentInterface
{
return new UserAgent('temp');
Expand All @@ -27,7 +27,7 @@ protected function userAgent(): UserAgentInterface

public function withUserAgent(string|UserAgentInterface $userAgent): HttpCrawler
{
$instance = new class () extends HttpCrawler {
$instance = new class extends HttpCrawler {
protected function userAgent(): UserAgentInterface
{
return new UserAgent('temp');
Expand Down
8 changes: 1 addition & 7 deletions src/Io.php
Original file line number Diff line number Diff line change
Expand Up @@ -13,24 +13,18 @@ class Io
*/
final public function __construct(
protected mixed $value,
public ?Result $result = null,
public ?Result $addLaterToResult = null,
public array $keep = [],
) {
if ($value instanceof self) {
$this->value = $value->value;

$this->result ??= $value->result;

$this->addLaterToResult ??= $value->addLaterToResult;

$this->keep = $value->keep;
}
}

public function withValue(mixed $value): static
{
return new static($value, $this->result, $this->addLaterToResult, $this->keep);
return new static($value, $this->keep);
}

public function withPropertyValue(string $key, mixed $value): static
Expand Down
63 changes: 0 additions & 63 deletions src/Loader/AddLoadersToStepAction.php

This file was deleted.

5 changes: 4 additions & 1 deletion src/Loader/Http/Messages/RespondedRequest.php
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

namespace Crwlr\Crawler\Loader\Http\Messages;

use Crwlr\Crawler\Cache\Exceptions\MissingZlibExtensionException;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Utils\RequestKey;
use Crwlr\Url\Url;
Expand Down Expand Up @@ -57,6 +58,7 @@ public static function cacheKeyFromRequest(RequestInterface $request): string

/**
* @return mixed[]
* @throws MissingZlibExtensionException
*/
public function __serialize(): array
{
Expand All @@ -74,8 +76,9 @@ public function __serialize(): array

/**
* @return mixed[]
* @throws MissingZlibExtensionException
*/
public function toArrayForAddToResult(): array
public function toArrayForResult(): array
{
$serialized = $this->__serialize();

Expand Down
8 changes: 0 additions & 8 deletions src/Loader/Http/Politeness/TimingUnits/Microseconds.php

This file was deleted.

Loading