Skip to content

Conversation

@zeroSteiner
Copy link

@zeroSteiner zeroSteiner commented Jul 19, 2025

Changed

  • Changed: Changed PDF parser exception messages to state what type was received in addition to what was expected

Fixed

  • Fixed: Fixed an issue with the PDF parser that prevented it from loading files that used a ObjectRef for the Length

Thanks!

@majora2007
Copy link
Member

Can you explain a bit more about the fix and is there a way to write a unit test to trigger?

@zeroSteiner
Copy link
Author

zeroSteiner commented Jul 19, 2025

Can you explain a bit more about the fix

Sure thing, I hope this helps. So the error I ran into was while importing multiple files in my Library. I tracked down the issue to the parser receiving an ObjectRef when it was expecting an int. To address it, I referred to some that was already in the file, specifically these two locations:

  • if (value.Type == PdfLexer.TokenType.ObjectRef) {
    indirectObjects[key] = (long)value.Value;
    }
  • foreach(var key in indirectObjects.Keys) {
    _stream.Seek(_objectOffsets[indirectObjects[key]], SeekOrigin.Begin);
    _lexer.ResetBuffer();
    token = _lexer.NextToken();
    if (token.Type != PdfLexer.TokenType.ObjectStart) {
    throw new PdfMetadataExtractorException("Expected object here");
    }
    token = _lexer.NextToken();
    if (token.Type != PdfLexer.TokenType.String) {
    throw new PdfMetadataExtractorException("Expected string");
    }
    _metadata[key] = (string) token.Value;
    }

I implemented a private method that would use the existing stream to read the object reference using a temporary Lexer instance. Using the temporary instance ensured that the state of the primary _lexer is maintained since I didn't see a way to back up and restore it. Once the object is resolved, it's checked again to see if it's an integer and then things carry on as they did before.

As for the other log messages, I added their types because it was really useful while debugging to see not only what it was expecting but what it received instead. It's possible that there are other areas that could benefit from resolving ObjectRefs, but without PDF test cases on hand, I only updated the processing of the Length field.

and is there a way to write a unit test to trigger?

That might take me a little while. I don't see any existing tests referring to the PdfMetadataExtractor class so I'd need to do it from scratch I think and I'm not super familiar with .NET unit tests. Is there anything you could point me to as a good example or a place for me to get started?

@majora2007
Copy link
Member

BookServiceTests.SeriesFallBackToMetadataTitle might be a good starting point. It uses a pdf packed with the repo to test. You can also use https://github.com/Kareadita/Kavita/pull/3912/files#diff-416981523a2401e533a3b63d10f5153322f77bee20d25af37521780073f7f778 which has unit tests around expecting the code to throw.

It would be helpful if you could also send me (DM/email) a file that caused this issue, so I can test it out on my end as well.

Feel free to ask for help here or discord.

@majora2007
Copy link
Member

majora2007 commented Jul 21, 2025

Spoke with the dev and got the file (Murder Die Senpai 1-2.pdf). I'll be looking into reproducing and setting up unit tests later in the cycle. Thanks again @zeroSteiner for the initial contribution.

@majora2007 majora2007 added the PDF PDF Related label Jul 21, 2025
@majora2007 majora2007 requested a review from Copilot July 21, 2025 12:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the PDF metadata parser to handle cases where the Length field references an object reference instead of being a direct integer value, and improves error messaging throughout the parser. The changes include adding object reference resolution functionality and making exception messages more informative by indicating both expected and received types.

  • Adds support for resolving ObjectRef values when parsing the Length field in PDF metadata
  • Improves error messages to include the actual type received in addition to the expected type
  • Introduces a new ResolveObjectRef method to handle object reference resolution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PDF PDF Related

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants