Update the PDF Metadata Parser to Accept Lengths By ObjectRef #3933

zeroSteiner · 2025-07-19T20:09:11Z

Changed

Changed: Changed PDF parser exception messages to state what type was received in addition to what was expected

Fixed

Fixed: Fixed an issue with the PDF parser that prevented it from loading files that used a ObjectRef for the Length

Thanks!

majora2007 · 2025-07-19T20:15:57Z

Can you explain a bit more about the fix and is there a way to write a unit test to trigger?

zeroSteiner · 2025-07-19T20:54:47Z

Can you explain a bit more about the fix

Sure thing, I hope this helps. So the error I ran into was while importing multiple files in my Library. I tracked down the issue to the parser receiving an ObjectRef when it was expecting an int. To address it, I referred to some that was already in the file, specifically these two locations:

Kavita/API/Helpers/PdfMetadataExtractor.cs

Lines 1297 to 1299 in 152f7ad

    
           if (value.Type == PdfLexer.TokenType.ObjectRef) { 
        
               indirectObjects[key] = (long)value.Value; 
        
           }

Kavita/API/Helpers/PdfMetadataExtractor.cs

Lines 1317 to 1334 in 152f7ad

    
           foreach(var key in indirectObjects.Keys) { 
        
               _stream.Seek(_objectOffsets[indirectObjects[key]], SeekOrigin.Begin); 
        
               _lexer.ResetBuffer(); 
        
               token = _lexer.NextToken(); 
        
               if (token.Type != PdfLexer.TokenType.ObjectStart) { 
        
                   throw new PdfMetadataExtractorException("Expected object here"); 
        
               } 
        
               token = _lexer.NextToken(); 
        
               if (token.Type != PdfLexer.TokenType.String) { 
        
                   throw new PdfMetadataExtractorException("Expected string"); 
        
               } 
        
               _metadata[key] = (string) token.Value; 
        
           }

I implemented a private method that would use the existing stream to read the object reference using a temporary Lexer instance. Using the temporary instance ensured that the state of the primary _lexer is maintained since I didn't see a way to back up and restore it. Once the object is resolved, it's checked again to see if it's an integer and then things carry on as they did before.

As for the other log messages, I added their types because it was really useful while debugging to see not only what it was expecting but what it received instead. It's possible that there are other areas that could benefit from resolving ObjectRefs, but without PDF test cases on hand, I only updated the processing of the Length field.

and is there a way to write a unit test to trigger?

That might take me a little while. I don't see any existing tests referring to the PdfMetadataExtractor class so I'd need to do it from scratch I think and I'm not super familiar with .NET unit tests. Is there anything you could point me to as a good example or a place for me to get started?

majora2007 · 2025-07-19T21:00:43Z

BookServiceTests.SeriesFallBackToMetadataTitle might be a good starting point. It uses a pdf packed with the repo to test. You can also use https://github.com/Kareadita/Kavita/pull/3912/files#diff-416981523a2401e533a3b63d10f5153322f77bee20d25af37521780073f7f778 which has unit tests around expecting the code to throw.

It would be helpful if you could also send me (DM/email) a file that caused this issue, so I can test it out on my end as well.

Feel free to ask for help here or discord.

majora2007 · 2025-07-21T12:10:57Z

Spoke with the dev and got the file (Murder Die Senpai 1-2.pdf). I'll be looking into reproducing and setting up unit tests later in the cycle. Thanks again @zeroSteiner for the initial contribution.

Copilot

Pull Request Overview

This PR updates the PDF metadata parser to handle cases where the Length field references an object reference instead of being a direct integer value, and improves error messaging throughout the parser. The changes include adding object reference resolution functionality and making exception messages more informative by indicating both expected and received types.

Adds support for resolving ObjectRef values when parsing the Length field in PDF metadata
Improves error messages to include the actual type received in addition to the expected type
Introduces a new ResolveObjectRef method to handle object reference resolution

API/Helpers/PdfMetadataExtractor.cs

Parse pdf length object reference

016eb2a

majora2007 added this to v0.8.8 - Kavita+ / FR Jul 19, 2025

github-project-automation bot moved this to Todo in v0.8.8 - Kavita+ / FR Jul 19, 2025

majora2007 added the PDF PDF Related label Jul 21, 2025

majora2007 requested a review from Copilot July 21, 2025 12:11

Copilot AI reviewed Jul 21, 2025

View reviewed changes

API/Helpers/PdfMetadataExtractor.cs Show resolved Hide resolved

API/Helpers/PdfMetadataExtractor.cs Show resolved Hide resolved

API/Helpers/PdfMetadataExtractor.cs Show resolved Hide resolved

majora2007 added this to v0.8.9 - Kavita+ / Journaling Progress and Kavita Sep 28, 2025

majora2007 removed this from v0.8.8 - Kavita+ / FR Sep 28, 2025

github-project-automation bot moved this to In progress in Kavita Sep 28, 2025

github-project-automation bot moved this to Todo in v0.8.9 - Kavita+ / Journaling Progress Sep 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update the PDF Metadata Parser to Accept Lengths By ObjectRef #3933

Update the PDF Metadata Parser to Accept Lengths By ObjectRef #3933

Uh oh!

zeroSteiner commented Jul 19, 2025 •

edited

Loading

Uh oh!

majora2007 commented Jul 19, 2025

Uh oh!

zeroSteiner commented Jul 19, 2025 •

edited

Loading

Uh oh!

majora2007 commented Jul 19, 2025

Uh oh!

majora2007 commented Jul 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Update the PDF Metadata Parser to Accept Lengths By ObjectRef #3933

Are you sure you want to change the base?

Update the PDF Metadata Parser to Accept Lengths By ObjectRef #3933

Uh oh!

Conversation

zeroSteiner commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changed

Fixed

Uh oh!

majora2007 commented Jul 19, 2025

Uh oh!

zeroSteiner commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

majora2007 commented Jul 19, 2025

Uh oh!

majora2007 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zeroSteiner commented Jul 19, 2025 •

edited

Loading

zeroSteiner commented Jul 19, 2025 •

edited

Loading

majora2007 commented Jul 21, 2025 •

edited

Loading