Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js??

[ill.pdf](https://github.com/modesty/pdf2json/files/14447093/ill.pdf)
![image](https://github.com/modesty/pdf2json/assets/130582247/b4a7ad54-798d-41d3-99e7-f5ac200df8c7)

pdf-testfile: Minimal set with UTF-8 characters encoded
[ill01.pdf](https://github.com/py-pdf/pypdf/files/14459161/ill01.pdf)
[ill00.pdf](https://github.com/py-pdf/pypdf/files/14459246/ill00.pdf)

to start the extract of text, use:
`node pdf2json -cvf ill01.pdf`

**I expect true character mappings if there are  UTF-8 characters encoded, see at end for details.**
See extracted text of  ill01.pdf 
See extracted text of ill00.pdf and search for terms that include 'ff' ot 'ft' or "n's"

[ill01.pdf](https://github.com/modesty/pdf2json/files/14459393/ill01.pdf)
PDF file(s) that cause the issue. See top: ill01.pdf

**content of the pdf-file (seen at end):**
```

/Encoding /Identity-H
/DescendantFonts [147 0 R]
/ToUnicode 148 0 R>>
endobj

```
What does the "CMap/encoding Identity-H" tell us?
the character codes (CIDs) are the same as the glyph indices (GIDs), so there's **no need to remap** them. 
### However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values. 
```
0000059775 00000 n 
0000000192 00000 n 
0000000392 00000 n 
0000000591 00000 n 
0000020809 00000 n 
0000006174 00000 n 
0000006481 00000 n 
0000006776 00000 n 
0000021054 00000 n 
0000012089 00000 n 
0000012374 00000 n 
0000012642 00000 n 

```
```
pdf2json@3.0.5 [https://github.com/modesty/pdf2json]
-------------
json2pdf-log:
Warning: Output file will be replaced - ill01.json
Info: Transcoding File ill01.pdf to - ill01.json
Info: about to load PDF file ill01.pdf
Info: Load OK: ill01.pdf
Warning: Setting up fake worker.
Info: PDF loaded. pagesCount = 1
Info: start to parse page:1
Warning: TT: complementing a missing function tail
Info: Skipped: tiny fill: 0 x 0
Info: Success: Page 1
Info: complete parsing page:1
Info: PDF parsing completed.
```

Note that both viewers tested, Chromium or Edge, are able to map the UTF-8-characters as given, 
**pdf.js does not**
**pypdf does not**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? #332

However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? #332

Description

However, without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values.

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions