sheetjs / js-cfb Goto Github PK

View Code? Open in Web Editor NEW

66.0 12.0 15.0 573 KB

:floppy_disk: OLE File Container Format

Home Page: https://sheetjs.com/cfb-editor

License: Apache License 2.0

JavaScript 95.68% Shell 0.57% Makefile 1.24% HTML 1.91% TypeScript 0.61%

cfb xls biff file-format storage mhtml

js-cfb's Introduction

Container File Blobs

Pure JS implementation of various container file formats, including ZIP and CFB.

Installation

In the browser:

<script src="dist/cfb.min.js" type="text/javascript"></script>

With npm:

$ npm install cfb

The xlscfb.js file is designed to be embedded in js-xlsx

Library Usage

In node:

var CFB = require('cfb');

For example, to get the Workbook content from an Excel 2003 XLS file:

var cfb = CFB.read(filename, {type: 'file'});
var workbook = CFB.find(cfb, 'Workbook');
var data = workbook.content;

Command-Line Utility Usage

The cfb-cli module ships with a CLI tool for manipulating and inspecting supported files.

JS API

TypeScript definitions are maintained in types/index.d.ts.

The CFB object exposes the following methods and properties:

CFB.parse(blob) takes a nodejs Buffer or an array of bytes and returns an parsed representation of the data.

CFB.read(blob, opts) wraps parse.

CFB.find(cfb, path) performs a case-insensitive match for the path (or file name, if there are no slashes) and returns an entry object or null if not found.

CFB.write(cfb, opts) generates a file based on the container.

CFB.writeFile(cfb, filename, opts) creates a file with the specified name.

Parse Options

CFB.read takes an options argument. opts.type controls the behavior:

`type`	expected input
`"base64"`	string: Base64 encoding of the file
`"binary"`	string: binary string (byte `n` is `data.charCodeAt(n)`)
`"buffer"`	nodejs Buffer
`"file"`	string: path of file that will be read (nodejs only)
(default)	buffer or array of 8-bit unsigned int (byte `n` is `data[n]`)

Write Options

CFB.write and CFB.writeFile take options argument.

opts.type controls the behavior:

`type`	output
`"base64"`	string: Base64 encoding of the file
`"binary"`	string: binary string (byte `n` is `data.charCodeAt(n)`)
`"buffer"`	nodejs Buffer
`"file"`	string: path of file that will be created (nodejs only)
(default)	buffer if available, array of 8-bit unsigned int otherwise

opts.fileType controls the output file type:

`fileType`	output
`'cfb'` (default)	CFB container
`'zip'`	ZIP file
`'mad'`	MIME aggregate document

opts.compression enables DEFLATE compression for ZIP file type.

Utility Functions

The utility functions are available in the CFB.utils object. Functions that accept a name argument strictly deal with absolute file names:

.cfb_new(?opts) creates a new container object.
.cfb_add(cfb, name, ?content, ?opts) adds a new file to the cfb. Set the option {unsafe:true} to skip existence checks (for bulk additions)
.cfb_del(cfb, name) deletes the specified file
.cfb_mov(cfb, old_name, new_name) moves the old file to new path and name
.use_zlib(require("zlib")) loads a nodejs zlib instance.

By default, the library uses a pure JS inflate/deflate implementation. NodeJS zlib.InflateRaw exposes the number of bytes read in versions after 8.11.0. If a supplied zlib does not support the required features, a warning will be displayed in the console and the pure JS fallback will be used.

Container Object Description

The objects returned by parse and read have the following properties:

.FullPaths is an array of the names of all of the streams (files) and storages (directories) in the container. The paths are properly prefixed from the root entry (so the entries are unique)
.FileIndex is an array, in the same order as .FullPaths, whose values are objects following the schema:

interface CFBEntry {
  name: string; /** Case-sensitive internal name */
  type: number; /** 1 = dir, 2 = file, 5 = root ; see [MS-CFB] 2.6.1 */
  content: Buffer | number[] | Uint8Array; /** Raw Content */
  ct?: Date; /** Creation Time */
  mt?: Date; /** Modification Time */
  ctype?: String; /** Content-Type (for MAD) */
}

License

Please consult the attached LICENSE file for details. All rights not explicitly granted by the Apache 2.0 License are reserved by the Original Author.

References

MS-CFB: Compound File Binary File Format
ZIP APPNOTE.TXT: .ZIP File Format Specification
RFC1951: https://www.ietf.org/rfc/rfc1951.txt
RFC2045: https://www.ietf.org/rfc/rfc2045.txt
RFC2557: https://www.ietf.org/rfc/rfc2557.txt

js-cfb's People

Contributors

Stargazers

Watchers

Forkers

steveyen jokerslab watmough rossj kangkang721 yang123vc aeppic albanm yuryalkevich garrettluu stof isabella232 shearer12345 jimyzzp rinzler17

js-cfb's Issues

build fail when including cfb 1.0.3

I'm trying to import and use xlsx which depends on cfb. When I build my angular 5 project I get the following error:

ERROR in node_modules/cfb/types/index.d.ts(38,24): error TS2304: Cannot find name 'Buffer'. src/app/components/purchase-order/register/register.component.ts(98,12): error TS2304: Cannot find name 'Buffer'.

I have no problem resolving this base type in my main project but it fails inside of cfb. Am I using an incompatible module or is this possibly a bug in cfb? My package.json file contents:

{

"name": "venus-app-client",
"version": "0.0.0",
"license": "MIT",
"scripts": {
"ng": "ng",
"start": "ng serve --proxy-config proxy-config.json",
"build": "ng build",
"test": "ng test",
"lint": "ng lint",
"e2e": "ng e2e"
},
"private": true,
"dependencies": {
"angular/animations": "^5.2.3",
"angular/cdk": "^5.1.1",
"angular/common": "^5.2.3",
"angular/compiler": "^5.2.3",
"angular/core": "^5.2.3",
"angular/forms": "^5.2.3",
"angular/http": "^5.2.3",
"angular/platform-browser": "^5.2.3",
"angular/platform-browser-dynamic": "^5.2.3",
"angular/router": "^5.2.3",
"ngx-translate/core": "^9.1.1",
"types/lodash": "^4.14.100",
"cfb": "^1.0.3",
"codelyzer": "^4.1.0",
"commander": "^2.14.1",
"core-js": "^2.5.3",
"font-awesome": "^4.7.0",
"hammerjs": "^2.0.8",
"lodash": "^4.17.4",
"primeng": "^5.2.0",
"printj": "^1.1.1",
"rxjs": "^5.5.6",
"typescript": "~2.4.2",
"xlsx": "^0.12.1",
"zone.js": "^0.8.20"
},
"devDependencies": {
"angular/cli": "^1.6.7",
"angular/compiler-cli": "^5.2.3",
"angular/language-service": "^5.2.3",
"types/jasmine": "^2.8.6",
"types/jasminewd2": "^2.0.2",
"types/node": "^6.0.96",
"jasmine-core": "^2.9.1",
"jasmine-spec-reporter": "^4.2.1",
"karma": "^1.7.1",
"karma-chrome-launcher": "^2.2.0",
"karma-cli": "^1.0.1",
"karma-coverage-istanbul-reporter": "^1.4.1",
"karma-jasmine": "^1.1.1",
"karma-jasmine-html-reporter": "^0.2.2",
"protractor": "^5.3.0",
"ts-node": "^3.3.0",
"tslint": "^5.9.1"
}
}

TypeError: file.slice is not a function

.../node_modules/cfb/cfb.js:370
var blob = file.slice(0,512);
                ^

TypeError: file.slice is not a function
    at parse (.../node_modules/cfb/cfb.js:370:17)
    at Object.read (.../node_modules/cfb/cfb.js:688:9)

My code is this:

const cfb = CFB.read(file, { type: 'file' })
	// const cfb = CFB.parse(data)
	const vbaDirEntry = CFB.find(cfb, 'VBA')
	if (!vbaDirEntry) {
		throw new Error('VBA root not found')
	}

	const vbaDir = CFB.read(cfb, vbaDirEntry)
	const modules = {}
	for (const entry of vbaDir.FullPaths) {
		console.log(entry)
	}

Infinite loop in get_sector_list with damaged .doc file

Hi there. I've come across a problematic .doc file that is causing an infinite loop in get_sector_list.

It looks like the 2nd half of this .doc file is all null, so it is definitely damaged & invalid, but it would be nice to avoid the infinite loop.

In this specific case, the loop starts off with j = 0, which results in the next j value being read from sectors[312], which is all null bytes due to the file corruption. This results in an infinite loop with j = 0.

I noticed that the chkd array is not being checked. Adding if (chkd[j]) break; at the top of the loop avoids the infinite loop and results in a later exception. Perhaps it's better to throw immediately inside the loop?

stream names should be limited to 32 UTF-16 code points, including the terminating null character

Thanks for js-cfb, and for the changes to truncate stream names introduced in 0e33eb6.

The MS-CFB spec says

storage and stream names are limited to 32 UTF-16 code points, including the terminating null character.

Currently, cfb.js truncates stream name to 32 characters, but as the name has to be null terminated, it should be truncated to 31, allowing WriteShift to pad the rest with 0.

So, in cfb.js#894

		if(_nm.length > 32) {
			console.error("Name " + _nm + " will be truncated to " + _nm.slice(0,32));
			_nm = _nm.slice(0, 32);
		}

all the 32s should be 31.

In my testing this doesn't break any of my tools, but some throw warnings:

Python compoundfiles - warns missing NULL terminator in name
P7Zip - no warnings, displays the character that should be null
olebrowse - no warnings, doesn't display the character that should be null
olefile - detects no fatal parsing issue "'incorrect DirEntry name length >64 bytes"

I can submit a PR for this if you'd like

Infinite loop in build_full_paths with some files

I've run into a file that enters an infinite loop with my changes in PR #6, even with your additional loop fix in commit 8d85fb6. The infinite loop doesn't happen in previous version 1.1.0.

I believe the dad tree for this file is constructed a bit incorrectly, thus it has a loop in it. I haven't looked into fixing the dad tree, but I did manage to avoid the infinite loop by slightly changing the final naming loop to:

	for(i=1; i < pl; ++i) {
		if(FI[i].type === 0 /* unknown */) continue;
		if (i !== dad[i]) {
			j = i;
			do {
				j = dad[j];
				FP[i] = FP[j] + "/" + FP[i];
			} while (j !== 0 && -1 !== dad[j] && j != dad[j]);
		}
		dad[i] = -1;
	}

I'm unable to share the file publicly, but I can email it if you would like.

Weird entry names in .msi file

MSI file is also MS-CFB format.

I opened https://cmake.org/files/v3.11/cmake-3.11.1-win64-x64.msi with http://sheetjs.com/cfb-editor

The file can be opened and the content of entries seems to be correct but file names seem wrong:

fs.readFileSync is not a function

How do I use "fs" in my browser

Write performance issue & fix (rebuild_cfb)

Hi there,

I discovered today some unusually slow operations when trying to create a rather large .msg file with roughly 63k data nodes. I tracked the slowness down to a nested loop in rebuild_cfb() that is ensuring that each stream node has a parent storage node.

I made some modifications to track the names in a in a JS object instead, and the rebuild_cfb() time dropped from about 1 minute to only 30 ms. I decided to use a plain object instead of a Set just to maintain maximum compatibility with old browsers.

Below are my changes with the added fullPaths object and the removed nested loop.

	// Used to track which names exist
	var fullPaths = Object.create(null);
	var data/*:Array<[string, CFBEntry]>*/ = [];
	for(i = 0; i < cfb.FullPaths.length; ++i) {
		fullPaths[cfb.FullPaths[i]] = true;
		if(cfb.FileIndex[i].type === 0) continue;
		data.push([cfb.FullPaths[i], cfb.FileIndex[i]]);
	}
	for(i = 0; i < data.length; ++i) {
		var dad = dirname(data[i][0]);
		s = fullPaths[dad];
		if(!s) {
			data.push([dad, ({
				name: filename(dad).replace("/",""),
				type: 1,
				clsid: HEADER_CLSID,
				ct: now, mt: now,
				content: null
			}/*:any*/)]);
			// Add name to set
			fullPaths[dad] = true;
		}
	}

build_full_paths does not prepend "root_entry/" if child is before parent

I came across and issue where the returned FullPaths array from a cfb.parse call does not properly append the root name (root_entry/) to child nodes (with depth greater than 1) that appear before their parent node.

It seems that in such a situation, build_full_paths properly constructs the dad tree, but terminates the final path construction loop too early, before it has prepended root_entry/.

I have created and attached a sample file the demonstrates the issue.

The returned FullPaths array is:

[
    "Root Entry/",
    "some folder/some child",
    "Root Entry/some folder/"
]

Despite "some child" being a child of "some folder"

Here is the view from DFVIEW.EXE

path-test.zip

Bad parenting / hierarchy construction when parent is after R / L sibling tree

I've noticed some files that show an incorrect folder structure with this library. I believe the dad array is not filled correctly when a node comes before it's parent, and has R or L sibling nodes. I've attached a sample file that demonstrates the issue.

In the attached .cfb file, both some file 1 and some file 2 are sibling files under some folder, but js-cfb shows some file 2 as a root level file.

file.zip

cfb_add and write performance issues

Hi there,

I'm working on a program which converts .pst files to .msg files, primarily in Node but also in the browser, and it uses this library in a very write-heavy way for saving the .msg files. Through testing and profiling, I've noticed a couple write-related performance issues that I wanted to share.

With some modifications, I've been able to reduce the output generation time of my primary "large" test case (4300 .msg files from 1 .pst) by a factor of 8 from about 16 minutes to 2 minutes (running on Node).

The 1st issue, which may just be a matter of documentation, is that using cfb_add repeatedly to add all streams to a new doc is very slow, as it calls cfb_gc and cfb_rebuild every time. We switched from using cfb_add to directly pushing to cfb.FileIndex and cfb.FullPaths (and then calling cfb_rebuild once at the end) which reduced the output time from 16 minutes to 3.5 minutes.

The 2nd issue is that the _write and WriteShift functions do not utilize Buffer capabilities when it is available. By using Buffer.alloc() for the initial creation, which guarantees a 0-filled initialization, along with Buffer.copy for content streams, Buffer.write for hex / utf16le strings, and Buffer's various write int / uint methods, we were able to further reduce the output time from 3.5 minutes to 2 minutes.

If you wish, I would be happy to share my changes, or to work on a pull request which uses Buffer functions when available. My current changes don't do any feature detection, and rather just rely on Buffer being available, as even in the browser we use feross/buffer, so it would need some more work to maintain functionality in non-Buffer environments.

Thanks

process out of memory Exception

When xlsjs convert a big excel file(.xls) as readFile(), node get error below

FATAL ERROR: JS Allocation failed - process out of memory

Can you help me? How to sole it.

I attempted below setting

$ node --max-old-space-size=8192 my-node-script.js

If anyone has solution, please response it. : )

XLSX broken file

/** Chase down the rest of the DIFAT chain to build a comprehensive list
    DIFAT chains by storing the next sector number as the last 32 bits */
function sleuth_fat(idx, cnt, sectors, ssz, fat_addrs) {
        var q = ENDOFCHAIN;
        if(idx === ENDOFCHAIN) {
                // if(cnt !== 0) throw new Error("DIFAT chain shorter than expected");
                if(cnt !== 0) console.log("DIFAT chain shorter than expected");
        } else if(idx !== -1 /*FREESECT*/) {
                var sector = sectors[idx], m = (ssz>>>2)-1;
                if(!sector) return;
                for(var i = 0; i < m; ++i) {
                        if((q = __readInt32LE(sector,i*4)) === ENDOFCHAIN) break;
                        fat_addrs.push(q);
                }
                sleuth_fat(__readInt32LE(sector,ssz-4),cnt - 1, sectors, ssz, fat_addrs);
        }
}

How to update xml file in CFB file?

I need to update xml file in CFB file. Just like CFB Editor. How can I implement it?

parsing files in container.content

Hi I'm hoping you can help me on this... for this project:
https://github.com/SuddenDevelopment/ScanWordDoc

I'm trying to be able to extract the Macro as a string. I can detect if it's there but the the cfb file.content is a buffer, I can toString('utf8') that buffer and see it's still a ways off from being workable... I get a bunch of unreadbale characters and the macro is in there. in this format i cant treat it like a string, I cant get an indexOf or regex match on anything except for an attribute in .content that is in quotes.

how can I parse the .content buffer in a word doc file to work with it further?

I ahev also tried passing the resulting .content buffers to cfb.parse and cfb.read with every option I could find :)

thanks

Problematic file giving infinite loop / OOM in make_sector_list

Hi there,

I have a .doc file that is not exiting the inner loop of make_sector_list, leading to an OOM crash. There is some issue with the file itself, as Word gives an error opening it; however, DFVIEW.EXE opens the file and will display all the contained streams without complaining.

Even if the file is bad and can't be read, hopefully there is some condition that can throw an exception rather than OOM.

Able to remove Seed file item

This might be a good item to add to allow for the file to be created without the seed.

function _write(cfb, options) {
var _opts = options || {};
/* MAD is order-sensitive, skip rebuild and sort */
if(_opts.fileType == 'mad') return write_mad(cfb, _opts);
rebuild_cfb(cfb);
if (_opts.noseed) cfb_del(cfb, "/\u0001Sh33tJ5"); <--------------- Added to allow for the file item not to be created!!!!
switch(_opts.fileType) {
case 'zip': return write_zip(cfb, _opts);
//case 'mad': return write_mad(cfb, _opts);
}

sheetjs / js-cfb Goto Github PK

js-cfb's Introduction

Container File Blobs

Installation

Library Usage

Command-Line Utility Usage

JS API

Parse Options

Write Options

Utility Functions

Container Object Description

License

References

js-cfb's People

Contributors

Stargazers

Watchers

Forkers

js-cfb's Issues

Recommend Projects

Recommend Topics

Recommend Org