openpreserve / jhove Goto Github PK

View Code? Open in Web Editor NEW

160.0 160.0 78.0 203.9 MB

File validation and characterisation.

Home Page: http://jhove.openpreservation.org

License: Other

Perl 0.06% Java 92.36% Shell 5.72% HTML 1.46% Batchfile 0.35% Dockerfile 0.05%

digital-preservation format-validation

jhove's People

Contributors

Stargazers

Watchers

Forkers

anjackson willp-bl tledoux vicgc lecs hak223sve chlara kb-dk jayxon bitsgalore johnscancella mahlatsem vaginessa nationallibraryofnorway bl-dpt pmay david-russo bezrukovm smmorrissey shem-sergey rosetta-development jackdos karthi2016 framingham-state-digitalcommons brucefulton jasonzou cchou pwinckles t3rj3 nclarkekb ancarian sweetcard yooylee jacobtakema nvanderperren rgfeldman karenhanson gkberg steffenr2d2 jerem-m tomassehabiaga anett16 carlwilson brunolmfg evermarr thorsted nlnwa deanforsmith samalloing andreakb archivist-liz dinoagw dartmouth-dltg svanteschubert themattcode sbambach maximplusov trimbe asciim0 uw-madison-library nlnzdigitalpreservation darrendignam cgarces codacy-badger leninoc georgiamoppett andreykotov91 marintara tnafrancesca prettybits eoltmanns ankoenigs kaust-library-systems sdobridnuk rsteph-de jamestiotio shreejatech cstollw

jhove's Issues

Make JHove available to Maven Central

The README.md says "The 1.14 release artifacts will be published to Maven central", but I couldn't find them there.

CrossRefStream incorrectly assumes /Index value is a 2 element array

The "isValid" method of CrossRefStream is hard coded to assume that an Index element, if present, is an array of exactly 2 integers. According to the specification, the Index element is "an array containing a pair of integers for each subsection in this section." (http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference15_v6.pdf, page 83 and PDF versions 1.6 and 1.7) Documents that have more than one subsection fail validation as a result. PdfModule.readXRefStreams appears to incorporate this assumption and does not validate object numbers against the actual object ranges specified in the index, but instead looks for numbers between 0 and the (meaningless?) value of CrossRefStream.getNumObjects()

Whitespace in name of target file

Is there a way to work on files with witespace(s)?
e.g. on Linux

touch 'test file.zip'
jhove test\ file.zip # or even jhove 'test file.zip', I tried it in many different ways

The output from jhove is (it thinks I pass 2 files):

Jhove (Rel. 1.15.0-SNAPSHOT, 2016-08-29)
 Date: 2016-08-29 13:11:25 CEST
 RepresentationInformation: test\
  Status: Not well-formed
  ErrorMessage: file not found
 RepresentationInformation: file.zip
  Status: Not well-formed
  ErrorMessage: file not found

The only way it worked for me was to use wildcard

jhove 'test*file.zip'

Then I get the Output:

Jhove (Rel. 1.15.0-SNAPSHOT, 2016-08-29)
 Date: 2016-08-29 13:11:44 CEST
 RepresentationInformation: test file.zip
  ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
  LastModified: 2016-08-29 13:08:11 CEST
  Size: 0
  Format: bytestream
  Status: Well-Formed and valid
  SignatureMatches:
   WARC-kb
   GZIP-kb
  InfoMessage: Zero-length file
  MIMEtype: application/octet-stream

BUT THAT IS TO DIRTY !!!

Is it possible to repair it, or is there another useable way to utilize jhove without such problems?

Review JHOVE for beginners guide

Address outstanding comments and add further information to the new JHOVE user guide: https://drive.google.com/open?id=1or8P5hI_BChnc1itr0KV2qOCZfdyBtRX6QMT0ucJzvU

Java exception Mac

Running jhove in command line on Mac gives the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: JHOVE
Caused by: java.lang.ClassNotFoundException: JHOVE
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

ICC Profile extraction

Hi,
I'm using JHove 1.9 and I want to validate ICC Profile from TIFF image created by scanner. Can JHove extract this information and add it into metadata?
I don't know TIFF standard very close. I want to extract CPP-CS-2012-1498 from following information:

00000500 00 07 57 2c 00 07 f9 01 43 49 45 44 00 07 57 2c |..W,....CIED..W,|
00000510 00 07 f9 01 64 65 73 63 00 00 00 00 00 00 00 11 |....desc........|
00000520 43 50 50 2d 43 53 2d 32 30 31 32 2d 31 34 39 38 |CPP-CS-2012-1498|

Could you help me please ?

@gmcgath commented on SourceForge:

ICC Profile validation would certainly be a useful thing for JHOVE to do. It's a significant task, though, so it's not likely to happen unless it gets funding from somewhere.
If anyone does want to undertake this project, of course, they're welcome to contribute. JHOVE is open source, after all.

PDF 1.7

The PDF module currently doesn't support PDF 1.7 / ISO 32000. It would be very desirable to update it to 1.7.

JPEG/Exif image incorrectly "Not well-formed"

When a JPEG file begins with a APP1 and has a APP0 after, the file is declared "Not well-formed" with the following message : "JFIF APP0 marker not at beginning of file"

Even though the file is indeed not conformant with JPEG JFIF standard, it still conforms to the JPEG/Exif standard (also known as JEITA CP-3451) which Jhove is supposed to handle as specify in http://jhove.openpreservation.org/modules/jpeg/

The following file shows this behaviour
20150213_140637.zip

WaveModule->FormatChuk: ArrayIndexOutOfBoundsException

Class: package edu.harvard.hul.ois.jhove.module.wave.FormatChunk

The setting of "compName" in readChunk() will give an ArrayIndexOutOfBoundsException for all "compressionCode" values greater than than the length of WaveStrings.COMPRESSION_INDEX[](eg. 0xfffe).

Also the calculated value will be wrong for all values of "compressionCode" that is greater than 0xB.

Maven build fails during Tests: parseInvalidWarcFileLonelyMonkeys

Failure in unit test assertions with expected not matching actual values.
Maven output attached
builderror.txt

Issues with JPEG2000 validation

Hello Folks,
sample test have shown that JHOVE cannot cope with certain JPEG2000-files.
If selecting the JPEG2000-module, the JHOVE GUI version will not show findings. For the JHOVE-library, the error is: java.io.EOFException in the code line: jb.process(app, module, handler, files.get(i).toString());

I have example files:
The [Jplyzer testfiles] https://github.com/openpreserve/jpylyzer-test-files/blob/master/bitwiser-icc-corrupted-tagcount-1951.jp2 do not work with JHOVE:

The jpeg2000 from the [google image testsuite] https://drive.google.com/file/d/0B9lJIDXo2oPYZlNnVnRKRFdwVDg/edit do work with JHOVE.

I have not yet found the difference betweent the two of them Jplyzer can cope with them all so far.
Best, Yvonne

ReleaseDetailsTest.java fails when compiling during standard time (NZST)

Likely because the time zone is explicitly set to daylight time here, as per, https://docs.oracle.com/javase/7/docs/api/java/util/TimeZone.html#getDisplayName(boolean,%20int) when true is called:

TimeZone.getDefault().getDisplayName(true, TimeZone.SHORT)

Affected code here:

jhove/jhove-core/src/test/java/org/openpreservation/jhove/ReleaseDetailsTest.java

Line 70 in 1ca1f48

 assertEquals("ReleaseDetails [version=0.1.2-TESTER, buildDate=Sun Jul 31 00:00:00 " + TimeZone.getDefault().getDisplayName(true, TimeZone.SHORT) + " 2011]", instance.toString()); 

JHOVE reporting PDF as v1.3 and as ISO PDF/A-1, Level B

Siegfried reports the file to be PDF v1.3 and not pdf/a.

JHOVE output snippet:

Jhove (Rel. 1.12.48, 2016-05-12)
 Date: 2016-08-08 10:09:34 BST
 RepresentationInformation: c:\Users\pmay\Downloads\281474990846918.pdf
  ReportingModule: PDF-hul, Rel. 1.7 (2012-08-12)
  LastModified: 2016-08-05 14:49:21 BST
  Size: 205146
  Format: PDF
  Version: 1.3
  Status: Well-Formed, but not valid
  SignatureMatches:
   PDF-hul
  ErrorMessage: <snip...>
  MIMEtype: application/pdf
  Profile: ISO PDF/A-1, Level B
  PDFMetadata: <snip...>

ICCProfiles in JPEG are not extracted

When a JPEG file embeds a ICCProfile in an APP2 data segment, this information doesn't appear in the associated mix information (NisoImageMetadata) : the information is supposed to be located in the IccProfile element.

Such an ICCProfile can be validated by intending to construct an java.awt.color.ICC_Profile with the getInstance() method in java.

Installer expired warning directs to wrong web address

The warning that the installer has expired directs you to http://jhove.openpreserve.org.

This should be http://jhove.openpreservation.org.

Temp directory should never default to current directory, e.g. ".".

If no temporary directory is specified in the cofiguration file then this line sets it to the current directory. Using the JVM default might be a better idea.

Empty message body in the validation error

While processing this PDF with JHOVE, http://www.fcla.edu/daitss-test/files/Zheng_Liping_200512_PHD.pdf. An error occurs with a status of "Well-Formed, but not valid"

<size>3649429</size>
<format>PDF</format>
<version>1.5</version>
<status>Well-Formed, but not valid</status>
<sigMatch>
  <module>PDF-hul</module>
</sigMatch>
<messages>
  <message offset="2098097" severity="error"></message>
 <message offset="2098153" severity="error"></message>
</messages>

Shouldn't there be a message body indicating what is validation error?

@gmcgath 02-06-2013:

I've checked in a new version of PdfModule.java that fixes the problem. addDestination was failing to check whether it could safely get a page object number, and throwing a NullPointerException when it couldn't. The handler was assuming there would be a message string in the exception. I've fixed it on both of these points. For now it can be built with the updated source code; it should be fixed in JHOVE 1.10, whenever that happens.

Link to binary distributions hidden on website

Dev Effort

0.5D

Description

Both the readme and the JHOVE website repeatedly refer to the "JHOVE Distribution" (i.e. compiled JARs), but no direct link is provided! Link should point to e.g.:

https://github.com/openpreserve/jhove/releases/tag/v1.11

(Actually there's a link on the website but it's not easy to find)

Runs out of java heap space while JHOVE process a specific pdf against tag profiles

Dev Effort

Description

While JHOVE processes this pdf, http://www.fcla.edu/daitss-test/files/01471-213X-12-33-S2.pdf, it runs out of all JAVA heap space. Is there an infinite loop during tag profile checking?

./jhove -c conf/jhove.conf -m pdf-hul ~/Workspace/describe/01471-213X-12-33-S2.pdf 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.(StringBuilder.java:68)
at edu.harvard.hul.ois.jhove.module.pdf.Tokenizer.getNext(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.getNext(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.getNext(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readArray(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObjectDef(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObjectDef(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureElement.isStructElem(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureElement.buildSubtree(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureElement.buildSubtree(Unknown Source)


at edu.harvard.hul.ois.jhove.module.pdf.StructureElement.buildSubtree(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureTree.getChildren(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.StructureTree.(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.TaggedProfile.satisfiesThisProfile(Unknown Source)
at edu.harvard.hul.ois.jhove.module.pdf.PdfProfile.satisfiesProfile(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)

Stefan Hein - 29-04-2013:

I have the same problem with the following pdf file: http://docserv.uni-duesseldorf.de/servlets/DerivateServlet/Derivate-25614 Are there any updates to that issue?

@gmcgath 29-04-2013:

I'm working on this issue. One user showed that with huge amounts of patience and memory, at least some PDF files that appear to be in an infinite loop are completed after several hours. The StructureTree object can take a huge amount of memory for some files, but once it's build only a couple of flags that were set during its construction are checked. This suggests that the whole tree doesn't have to be in memory at once. I hope to have a fix that takes this into account before too long.
Thanks for the additional example. I'll use it in testing.

Stefan Hein 20-08-2016:

Unfortunately this reported bug is still existing in JHOVE 1.10.

Enhance JHOVE's checksumming capabilities

Dev Effort

See PR: #386

Description

The digest algorithms currently supported by JHOVE are:

CRC32
MD5
SHA1

Java provides native support for these additional algorithms:

SHA256
SHA384
SHA512

These could be added quite easily but this would also require a change to JHOVE's config to allow the user to select the algorithms they wanted to use.

Problem with PDF annotation dictionaries

A file from the Open Planets Foundation format corpus, simple-annotated-in-adobe-x.pdf, is reported as well-formed but not valid, with the not very informative message "Invalid annotations." Setting breakpoints reveals that where an array is expected for the "Annots" array of annotation dictionaries, a keyword is being found instead. I can't immediately figure out why this is. Even if it's not in accordance with the spec, it's an Adobe-generated file.

Further comment from @gmcgath :

File from format corpus:
simple-annotated-in-adobe-x.pdf

and again from @gmcgath

A similar problem exists in the same file with the "Names" dictionary. This looks like an underlying feature of PDF that I've overlooked.

and again from @gmcgath 22-05-2013

I've posted a question at http://superuser.com/questions/589207/can-a-keyword-be-in-a-pdf-annots-array to see if anyone can explain what's going on. So far there have been no answers.

Please attach sources and javadoc to maven artifacts

When using JHove as a library it is nice if the sources are downloadable for an IDE to use (possibly also javadoc but that is not quite as important).

This must be explicitly configured in the pom.xml. See https://maven.apache.org/plugin-developers/cookbook/attach-source-javadoc-artifacts.html for instructions.

java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary

I'm getting a exception when running jhove on a PDF. This happens rarely.

jhove$ ./jhove -c ../cular/ingest/target/classes/jhove.conf ~/fulltext.pdf 
Sep 16, 2015 1:21:25 PM edu.harvard.hul.ois.jhove.JhoveBase init
SEVERE: Testing SEVERE level
java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)
at Jhove.main(Unknown Source)
Jhove (Rel. 1.11, 2013-09-29)
Date: 2015-09-16 13:21:26 EDT
RepresentationInformation: /users/bdc34/fulltext.pdf
ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
LastModified: 2015-09-16 13:08:10 EDT
Size: 938845
Format: bytestream
Status: Well-Formed and valid
SignatureMatches:
PDF-hul
MIMEtype: application/octet-stream```

XMLHandler outputs wrong bitPerSample in MIX

While using XMLHandler, JHOVE creates incorrect number (2) of mix:bitsPerSampleValue MIX tags:

<mix:ImageColorEncoding>
  <mix:BitsPerSample>
    <mix:bitsPerSampleValue>8</mix:bitsPerSampleValue>
    <mix:bitsPerSampleValue>8</mix:bitsPerSampleValue>
    <mix:bitsPerSampleUnit>integer</mix:bitsPerSampleUnit>
  </mix:BitsPerSample>
  <mix:samplesPerPixel>3</mix:samplesPerPixel>
</mix:ImageColorEncoding>

The attached patch fixes iteration over the array of bitsPerSampleValue starting from 0. The accompanying patch is available from SourceForge: https://sourceforge.net/p/jhove/patches/_discuss/thread/5d7c7155/77d4/attachment/XmlHandler.patch

Why is there multiple pull requests that is not merged?

Example of using jhove within another java application

Hello,

I am working on incorporating jhove into another application, however there doesn't seem to be any documentation on how to do so. Could you please point me to some examples of how to validate the various module types in my own java application?

Thanks

Use of xsi:type in AES output

Dev Effort

Description

Both the WAVE and AIFF modules embed audio metadata in AES format without providing a schema. One of the produced elements make use of xsi:type, <tcf:filmFraming tcf:framing="NOT_APPLICABLE" xsi:type="tcf:ntscFilmFramingType"/>.

Because JHOVE schema does not validate embedded xml (processContents="skip"), the use of xsi:type does not cause problem. However, METS & PREMIS schema will validate embedded xml if sufficient definition is available (processContents="lax").

When we import this element into PREMIS document, it is not valid because xsi:type references a Type Definition (http://www.w3.org/TR/xmlschema-1/#xsi_type), thus explicit assertion of type validation is attempted.

The type tcf:ntscFilmFramingType cannot be resolved and causes validation to fail.
Looking into aes.org, we cannot find a schema describing the element in the namespace: http://www.aes.org/tcf.

It appears the AES X098B schema is not publicly available yet (according to Gary).

add Audit handler output explanation to documentation.

Suggesting someone write an explanation of the output from the Audit handler and add it to the documentation.

ICCProfiles in TIFF files are not extracted

When a TIFF file embeds a ICCProfile in an TIFFTAG_ICCPROFILE (code 34675), this information doesn't appear in the associated mix information (NisoImageMetadata) : the information is supposed to be located in the IccProfile element.

Such an ICCProfile can be validated by intending to construct an java.awt.color.ICC_Profile with the getInstance() method in java.

Why is jhoveHome needed?

Dev Effort

0.5D investigation

Description

Before running jhove one needs to set jhoveHome in the configuration file. I don't really understand why this variable is needed, since the relative locations of the launcher scripts and JARs are always fixed. So I think all launch scripts / jars should be able to 'know' their dependencies without any user input (also for each re-install or update the config gets overwritten by the default values, which makes thing unnecessarily complex for a user).

edu.harvard.hul.ois.jhove.ModuleBase: skipBytes() might not skip all requested bytes

For large Wavefiles (>100MB) it happens that not all bytes in the DataChunk are skipped as expected in the method:

public long skipBytes(DataInputStream stream, long bytesToSkip, ModuleBase counted)

this seems to be because the call:

long n = stream.skip(bytesToSkip);

Actually might skip fewer bytes than requested (This is also stated in the Java Documentation). If this occurs it will most probably cause the parsing of the Wave file to fail, since the pointer to the next chunk will be placed inside the DATA chunk.

To avoid this problem the "long n = stream.skip(bytesToSkip);" call could be placed inside a loop that continues until all the desired bytes are skipped, or no more bytes kan be skipped (ie n=0).

incorrect validity report on image

I tested JHOVE2 (1.11) with a "invalid image" (https://bitbucket.org/tdar/tdar.src/src/9c2656809786e6a8730e57e3b71333b9aa5258fd/test-resources/src/main/resources/images/sample_image_formats/grandcanyon_lzw_corrupt.tif?at=default ) It won't open in Photoshop or preview, tiffinfo (libtiff) and identify (imagemagick) both read it as invalid, but jhove seems to report it as "well formed and valid":

[abrin@dev jhove]$ ./jhove ~tdar/tdar.src/test-resources/src/main/resources/images/sample_image_formats/grandcanyon_lzw_corrupt.tif 
Mar 15, 2016 9:02:57 AM edu.harvard.hul.ois.jhove.JhoveBase init
SEVERE: Testing SEVERE level
Jhove (Rel. 1.11, 2013-09-29)
 Date: 2016-03-15 09:02:57 MST
 RepresentationInformation: /home/tdar/tdar.src/test-resources/src/main/resources/images/sample_image_formats/grandcanyon_lzw_corrupt.tif
  ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
  LastModified: 2015-08-30 16:06:33 MST
  Size: 50496
  Format: bytestream
  Status: Well-Formed and valid
  MIMEtype: application/octet-stream


[abrin@dev jhove]$ tiffinfo ~tdar/tdar.src/test-resources/src/main/resources/images/sample_image_formats/grandcanyon_lzw_corrupt.tif 
/home/tdar/tdar.src/test-resources/src/main/resources/images/sample_image_formats/grandcanyon_lzw_corrupt.tif: Not a TIFF or MDI file, bad magic number 24909 (0x614d).
[abrin@dev jhove]$ identify ~tdar/tdar.src/test-resources/src/main/resources/images/sample_image_formats/grandcanyon_lzw_corrupt.tif 
identify.im6: Not a TIFF or MDI file, bad magic number 24909 (0x614d). `/home/tdar/tdar.src/test-resources/src/main/resources/images/sample_image_formats/grandcanyon_lzw_corrupt.tif' @ error/tiff.c/TIFFErrors/508.

Installer has expired!

Trying to run the latest installer gives the message:
"This installer has expired. Please download a new one from http://jhove.openpreserve.org"

Running on Win7, 64bit.

jhove installer: more information in window of step 1

The step 1 info ("Please read the following information:") currently only includes the jhove logo and version number. Would be beneficial to have some more info here (such as github link - or a pointer towards the option to save the installation process into an auto-install script at the end).

Optimise packaging of shell and batch execution scripts.

The izpack installer in the jhove-installer module currently batches the execution scripts in a single directory. The installer configuration then copies and templates the files one by one.

Split the script files into OS-specific directories, e.g. src\main\scripts\windows, etc., then use izpacks filesets to copy and template by OS specific batch.

Correct JhoveView version number currently, 1.12.48 (2016-05-12)

Just a minor thing, the JhoveView version number does not seem to be in alignment with the main application:

Additions to JPEG2000 MIX Output

Dev Effort

Description

I have two feature requests to JPEG2000 module's MIX output that I think would be useful additions:

The default display resolution of the JPEG2000 file reported in the MIX output, preferably as dpi.
jhove:property
jhove:nameDefaultDisplayResolution/jhove:name
<jhove:values arity="Array" type="Property">
jhove:property
jhove:nameHorizResolution/jhove:name
<jhove:values arity="List" type="Property">
jhove:property
jhove:nameNumerator/jhove:name
<jhove:values arity="Scalar" type="Integer">
jhove:value3870/jhove:value
/jhove:values
/jhove:property
jhove:property
jhove:nameDenominator/jhove:name
<jhove:values arity="Scalar" type="Integer">
jhove:value32768/jhove:value
/jhove:values
/jhove:property
jhove:property
jhove:nameExponent/jhove:name
<jhove:values arity="Scalar" type="Integer">
jhove:value5/jhove:value
/jhove:values
/jhove:property
/jhove:values
/jhove:property
jhove:property
jhove:nameVertResolution/jhove:name
<jhove:values arity="List" type="Property">
jhove:property
jhove:nameNumerator/jhove:name
<jhove:values arity="Scalar" type="Integer">
jhove:value3870/jhove:value
/jhove:values
/jhove:property
jhove:property
jhove:nameDenominator/jhove:name
<jhove:values arity="Scalar" type="Integer">
jhove:value32768/jhove:value
/jhove:values
/jhove:property
jhove:property
jhove:nameExponent/jhove:name
<jhove:values arity="Scalar" type="Integer">
jhove:value5/jhove:value
/jhove:values
/jhove:property
/jhove:values
/jhove:property

<mix:SpatialMetrics>
  <mix:samplingFrequencyUnit>in.</mix:samplingFrequencyUnit>
  <mix:xSamplingFrequency>
    <mix:numerator>300</mix:numerator>
    <mix:denominator>1</mix:denominator>
  </mix:xSamplingFrequency>
  <mix:ySamplingFrequency>
    <mix:numerator>300</mix:numerator>
    <mix:denominator>1</mix:denominator>
  </mix:ySamplingFrequency>
</mix:SpatialMetrics>

MIX output of the used compression scheme Lossy / Lossless, like:

<mix:Compression>
  <mix:compressionScheme>JPEG 2000 Lossless</mix:compressionScheme>
</mix:Compression>

JHOVE Incorrectly reading beyond RIFF 'data' Chunk ID and calling it invalid...

I have received a 2GB wav file that I'm having difficulty validating in JHOVE. The tool tells me that I have an invalid character within a CHUNK ID.

Analyzing the file, however, and it seems that JHOVE is reading beyond the CHUNK ID and returning an invalid result.

52 49 46 46 
Chunk ID: 'RIFF'

F8 DE A0 84 
Chunk Size: ~2GB

57 41 56 45 
Format: 'WAVE'

66 6D 74 20 
Sub Chunk 1 ID: 'fmt'

10 00 00 00 
Sub Chunk 1 Size: 16

01 00 
Audio Format: WAVE_FORMAT_PCM

01 00 
Number of Channels: 1

00 77 01 00 
Sample Rate: 96000

00 65 04 00 
Byte Rate: 288,000

03 00 
Block Align: 3

18 00 
Bits per sample: 24-bits

64 61 74 61 
Sub Chunk 2 ID: 'data'

80 C6 A0 84 
Sub Chunk 2 Size: ~1.6GB

A7 *05 00* 70 04 00 6E F6 FF E9 FC FF F7 F4 FF B5 24 00
... data / payload ...

The error message seems to be returned from this part of the code:

https://github.com/gmcgath/jhove/blob/0dc774d98efa8c7581fe1602c3f6e713f499201d/src/main/java/edu/harvard/hul/ois/jhove/module/iff/ChunkHeader.java#L53

The byte causing the first issue is 0x05 at offset 46, I've starred offset 46 and 47. See also the screenshot.

The screenshot has been generated by looking at the following snippet from the 2GB file:

52 49 46 46 F8 DE A0 84 57 41 56 45 66 6D 74 20 10 00 00 00 01 00 01 00 00
77 01 00 00 65 04 00 03 00 18 00 64 61 74 61 80 C6 A0 84 A7 05 00 70 04 00 
6E F6 FF E9 FC FF F7 F4 FF B5 24 00 F2 FC FF 88 FC FF 2C E8 FF 1B 08 00 74 
03 00 26 EE FF 20 F6 FF 86 F6 FF 33 01 00 5F F3 FF C0 FC FF 47

The analysis shows, that 0x05 is no longer in the CHUNK ID, nor is the preceding byte 0x00, which will also show up in error if one artificially turns 0x05 into a byte greater than 0x32.

Screenshot:

JHOVE Version: 1.11
Java: 1.7
Platform: Windows XP SP3
Creating Application (WAV): Adobe Audition CS6 (Macintosh)

Create new PDF/A validation module for JHOVE based on the veraPDF library.

Enable generation of textMD property for text files

Here is a patch against 1.4 version so that Jhove can generate a property conformant with the textMD schema (see http://www.loc.gov/standards/textMD\) for textual files.
The initial thought was to make a simple XSLT transform over the output of jhove in order to generate this information but this doesn't work well because:

not all the needed information is generated by jhove or the output information is already bundled and
the correct management of the charset and the language need to be programmatically verified.

This patch modifies 4 modules:

ASCII-hul
UT8-hul
HTML-hul
XML-hul (the version number has been modified appropriately).

A parameter withTextMD=true activates for each module the generation of the property (see jhove-withTextMD.conf, for an example)
The default is to not generate it to behave as before.
I added the determination of the line ending in html and xml to be able to generate the required element :

there is no performance penalty since the stream classes have been modified using the same algorithm that the one in ASCII module.
I decided NOT to add a TextMDMetadata property type so that the schema jhove.xsd will be unchanged.

So the TextMDMetadata property is of OBJECT type.
The TextHandler and XmlHandler are modified to generate the information (the version number has been modified appropriately).
Hope this patch could be added into Jhove to enhance its handling of textual files.
Thanks for your attention.

The accompanying patch is available from SourceForge: https://sourceforge.net/p/jhove/patches/_discuss/thread/ef9d4da0/52ff/attachment/withTextMD.patch

TIFF module should check for overlapping tag data

Dev Effort

Description

The TIFF specification says: "No data should be referenced from more than one place.TIFF readers and editors are under no obligation to detect this condition and handle it properly. This would not be a problem if TIFF files were read-only entities, but they are not. This warning covers both TIFF field value offsets and fields that are defined as offsets, such as StripOffsets."

The TIFF module doesn't currently check this, and some TIFF files cheat on this point, e.g., by using the same data storage for X and Y resolution if they're the same. Since this is a violation of the spec with regard to file structure, this should really be checked. We have a request for this check.

Java exception under Windows; seems to be config related

Dev Effort

0.5D

Description

After installing JHOVE on Windows and configuring it as described in the readme, execution results in:

Exception in thread "main" java.lang.NoClassDefFoundError: edu/harvard/hul/ois/j
hove/viewer/ConfigWindow
        at edu.harvard.hul.ois.jhove.DefaultConfigurationBuilder.writeDefaultCon
figFile(Unknown Source)
        at edu.harvard.hul.ois.jhove.JhoveBase.init(Unknown Source)
        at Jhove.main(Unknown Source)
Caused by: java.lang.ClassNotFoundException: edu.harvard.hul.ois.jhove.viewer.Co
nfigWindow
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 3 more

This happens under Windows 7; java version is 1.8.0_51. Going back through my files I see this is the same error I got almost 2 years ago:

http://openpreservation.org/blog/2014/01/31/why-cant-we-have-digital-preservation-tools-just-work/

And I remember this started happening after some changes to the way JHOVE looks for its configuration, which was prompted by this long-running issue:

http://sourceforge.net/p/jhove/bugs/53/

If I explicitly specify the location of the config file with the c switch, e.g.:

jhove -c C:\jhove\conf\jhove.conf

In this case JHOVE does run normally.

Menus get lost when closing document window in GUI on OS X

If I "Open" a file with the GUI version of JHOVE version 12 beta and then close it again, bringing the focus back to the main window, the File, Edit, and Help menus disappear. They can be restored by bringing up "About JHOVE" from the JhoveView menu and then closing the resulting window. This was on OS X 10.10.5 (Yosemite).

After noticing this in 1.12 beta, I tried it with 1.9 and got the same result. It seems I should have noticed if it was happening all along. I tried it on Linux, and the problem doesn't occur there. I suspect it's something that's turned up in recent versions of OS X.

JhoveView: Markup Parsing Error: dynPolLoginRedirect.html

Dev Effort

Description

Ubuntu 10.04.4 LTS
JHOVE 1.10

Running JhoveView I get an error, however it proceeds to start as expected. Command and

Trace below.

Command: java -jar JhoveView.jar

[Warning] jhove.conf:6:73: schema_reference.4: Failed to read schema document 'http://hul.harvard.edu/ois/xml/xsd/jhove/jhoveConfig.xsd', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.
[Error] jhove.conf:6:73: cvc-elt.1: Cannot find the declaration of element 'jhoveConfig'.
[Fatal Error] dynPolLoginRedirect.html:1:3: The markup in the document preceding the root element must be well-formed.

README out of date

There is a v1.14 release, but the README still has a lot of outdated info, e.g. "The OPF is preparing to release a JHOVE 1.12.x-beta in September".

Installer hangs up on step 1, Windows 7

Tried to install on Windows 7 Enterprise with Service Pack 1, 64-bit OS. Installer 1.12.19 has been stuck on step 1 of 5 for almost 20 minutes now.

Broken website links

http://jhove.openpreservation.org/documentation/

The JavaDocs link under JHOVE API.
The JHOVE2 links.

http://jhove.openpreservation.org/modules/pdf/

All but the first three PDF reference links.

http://jhove.openpreservation.org/documentation/dev-module/

The first link on this page... and probably the rest.
Needs some general formatting love...

JhoveView taking ten minutes to initialise...

We've spotted this in our environments here at Archives New Zealand. Out IT vendor in the larger department has also found the same issue after testing quite considerably.

Here is their description of the issue:

I managed to install this on my Win7 PC and I get the same result, takes approx. 9min 50secs every time??!
I first tried installing it under my username, then under C:\Temp and got the same results with both locations
I then tried 2 different versions of Java – 706071 and 802518, still with the same result.
I then downloaded a Java decompile tool and decompiled all of the class files (thousands of them!!) I trawled through all of the files that would be the obvious culprit but found nothing (I’m no expert on Java mind you..)
I found heaps of LOOPS within the code but nothing that stood out, I could not find any code relating to a TIMEOUT either, I was thinking there was a 9:50 timeout somewhere??
I also tried the same test on a separate PC with the same result.
SO… in short, I do not know what is causing this? Is this supposed to be running on a particular version of Java, Is this software supported at all? Do you know if there is another method of using it to bypass this? (via cmd window or batch file..?) Sorry this is the first time I’ve seen this application and I’ve tried everything which seems logical to resolve it.

Any help appreciated as JhoveView is a useful tool for teaching, and also getting results quickly.

Thanks,

Ross

PDF module - Indirect objects in image dictionary not handled

We found an issue with scanned technical drawings (large bitmaps in CCITT G4 format) where image width and height are indirect objects. Jhove do not handle this case but try to access them as SimpleObjects leading to a ClassCastException.
The fix is a few new lines i PdfModule.java.
Diff:

This patch file was generated by NetBeans IDE

It uses platform neutral UTF-8 encoding and \n newlines.

--- C:\usr\sw\jhove-1.11-original\classes\edu\harvard\hul\ois\jhove\module\PdfModule.java
+++ C:\usr\sw\jhove-1.11\classes\edu\harvard\hul\ois\jhove\module\PdfModule.java
@@ -1990,13 +1990,20 @@
imgList.add (new Property ("NisoImageMetadata",
PropertyType.NISOIMAGEMETADATA, niso));
niso.setMimeType("application/pdf");

                               PdfSimpleObject widObj = (PdfSimpleObject)

                                   xobdict.get ("Width");

                               PdfSimpleObject widObj = null;

                               PdfSimpleObject htObj = null;

                               if (xobdict.get("Width") instanceof PdfIndirectObj) {

                                   PdfIndirectObj io = (PdfIndirectObj)xobdict.get("Width");

                                   widObj = (PdfSimpleObject)resolveIndirectObject(io);

                                   io = (PdfIndirectObj)xobdict.get("Height");

                                   htObj = (PdfSimpleObject)resolveIndirectObject(io);

```
                               }
```
```
                               else {
```

                                   widObj = (PdfSimpleObject)xobdict.get ("Width");

                                   htObj = (PdfSimpleObject)xobdict.get ("Height");

                               }
                             niso.setImageWidth(widObj.getIntValue ());

                               PdfSimpleObject htObj = (PdfSimpleObject)

                                   xobdict.get ("Height");
                             niso.setImageLength(htObj.getIntValue ());

                             // Check for filters to add to the filter list
                             Filter[] filters = ((PdfStream) xob).getFilters ();
                             String filt = extractFilters (filters, (PdfStream) xob);

/Håkan

PDF module error with TeX-created documents

User Chris Yocum reports:
Anyway, here is the output that I am getting. You can try this on any TeX generated document and it should give you the same results.

java.lang.ClassCastException:
edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary at
edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)
at Jhove.main(Unknown Source)

Tomas Fischer 03-04-2013 :

I can confirm this bug, although the file is not TeX-generated, but from Acrobat Distiller. The file is attached. Here is my complete output:

Jhove (Rel. 1.9, 2012-12-17)
Date: 2013-03-04 13:59:26 CET
RepresentationInformation: b6c99639fc62e6a7430b78f6d8494931_http___www_bolagsverket_se_polopoly_fs_1_5530__Menu_general_column_content_file_p25_personinformation.pdf
ReportingModule: PDF-hul, Rel. 1.7 (2012-08-12)
 LastModified: 2013-01-04 12:22:13 CET
 Size: 80219
 Format: PDF
 Version: 1.6
 Status: Not well-formed
 SignatureMatches:
  PDF-hul
 ErrorMessage: Unexpected error in findFonts: java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
  Offset: 1849
 MIMEtype: application/pdf
 PDFMetadata: 
  Objects: 0
  FreeObjects: 1
  IncrementalUpdates: 0
  DocumentCatalog: 
   PageLayout: SinglePage
   PageMode: UseNone
  Filters: 
   FilterPipeline: FlateDecode
  Fonts: 
   TrueType: 
    Font: 
     BaseFont: CBMFOF+Garamond
     FontSubset: true
     FirstChar: 32
     LastChar: 246
     FontDescriptor: 
      FontName: CBMFOF+Garamond
      Flags: Serif, Nonsymbolic
      FontBBox: -139, -307, 1063, 986
      FontFile2: true
     Encoding: WinAnsiEncoding
  XMP: <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26        ">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
     <rdf:Description rdf:about=""
           xmlns:dc="http://purl.org/dc/elements/1.1/">
        <dc:format>application/pdf</dc:format>
        <dc:creator>
           <rdf:Seq>
              <rdf:li>Bolagsverket</rdf:li>
           </rdf:Seq>
        </dc:creator>
        <dc:title>
           <rdf:Alt>
              <rdf:li xml:lang="x-default">Produktbeskrivning P25_Personinformation</rdf:li>
           </rdf:Alt>
        </dc:title>
     </rdf:Description>
     <rdf:Description rdf:about=""
           xmlns:xmp="http://ns.adobe.com/xap/1.0/">
        <xmp:CreateDate>2008-10-13T15:55:07+02:00</xmp:CreateDate>
        <xmp:CreatorTool>PScript5.dll Version 5.2.2</xmp:CreatorTool>
        <xmp:ModifyDate>2012-08-17T15:56:07+02:00</xmp:ModifyDate>
        <xmp:MetadataDate>2012-08-17T15:56:07+02:00</xmp:MetadataDate>
     </rdf:Description>
     <rdf:Description rdf:about=""
           xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
        <pdf:Producer>Acrobat Distiller 8.1.0 (Windows)</pdf:Producer>
     </rdf:Description>
     <rdf:Description rdf:about=""
           xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
        <xmpMM:DocumentID>uuid:c90d60fd-280e-4af3-bf14-87f96badb896</xmpMM:DocumentID>
        <xmpMM:InstanceID>uuid:dde7d516-b11d-4d86-be2a-5cc56c489a1d</xmpMM:InstanceID>
     </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>
Pages: 
   Page: 
    Label: 1
   Page: 
    Label: 2
   Page: 
    Label: 3
   Page: 
    Label: 4
   Page: 
    Label: 5
   Page: 
    Label: 6
   Page: 
    Label: 7

b6c99639fc62e6a7430b78f6d8494931_http___www_bolagsverket_se_polopoly_fs_1_5530__Menu_general_column_content_file_p25_personinformation.pdf

@gmcgath replied

JHOVE is getting caught because it's seeing a keyword where it expects a font dictionary in a page node's resources. As far as I can tell from reading the spec, this is incorrect PDF. I've fixed it so that instead of throwing an exception it reports that it failed to see a font dictionary. This is in the checked-in PdfModule.java.
This seems to imply that many TeX-generated PDFs are broken. If there's something I've missed and a keyword object is valid in this context, please let me know. At least now the error message is more to the point, and there won't be a stack dump.

Thomas Fischer replied 05-06-2013:

The fix doesn't seem to cover all cases. I was able to create a PDF file using pdfLaTeX which recreates the crash in 1.10b2. The crash is triggered as soon as I include the MinionPro font (i.e. commenting the MinionPro package makes jHove run ok):
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[lf]{MinionPro}
\begin{document}
ABC
\end{document}
The output looks like this:

java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)
at Jhove.main(Unknown Source)
Jhove (Rel. 1.9, 2013-05-28)
Date: 2013-06-05 10:08:04 CEST
RepresentationInformation: /tmp/test.pdf
ReportingModule: PDF-hul, Rel. 1.7 (2012-08-12)
LastModified: 2013-06-05 10:00:09 CEST
Size: 42554
Format: PDF
Status: Not well-formed
SignatureMatches:
PDF-hul
ErrorMessage: No document catalog dictionary
Offset: 0
MIMEtype: application/pdf

BTW, both the version from CVS and the tar-ball report version number 1.9 instead of 1.10b2 or something else.

@gmcgath replied:

Re Thomas Fischer: I'm not getting a crash, and it looks from the output you've posted as if JHOVE is in fact running to completion after writing out a stack dump. However, JHOVE isn't processing the file properly, or else it's broken and Acrobat is able to open it anyway. (This may hinge on fine points of what "broken" means.) I'm seeing that in trying to read the document catalog dictionary, JHOVE is instead getting a keyword of "rstChar". This is most likely a fragment of a "FirstChar" keyword.
There is legitimately a bug, but I'm afraid it will have to stay open for version 1.10. Hopefully I or someone else will find a fix for it later.

Denis Bitouzé 03-11-2013:

Hi,
is this bug still present in current version of JHOVE 1.11?
Best regards.

openpreserve / jhove Goto Github PK

jhove's People

Contributors

Stargazers

Watchers

Forkers

jhove's Issues

Dev Effort

Description

Dev Effort

Description

Dev Effort

Description

Dev Effort

Description

Dev Effort

Description

Dev Effort

Description

Dev Effort

Description

Dev Effort

Description

Dev Effort

Description

This patch file was generated by NetBeans IDE

It uses platform neutral UTF-8 encoding and \n newlines.

Recommend Projects

Recommend Topics

Recommend Org