Git Product home page Git Product logo

webpackager's Introduction

Web Packager

Build Status GoDoc

Web Packager is a command-line tool to "package" websites in accordance with the specifications proposed at WICG/webpackage. It may look like gen-signedexchange, but is rather based on gen-signedexchange and focuses on automating generation of Signed HTTP Exchanges (aka. SXGs) and optimizing the page loading.

Web Packager HTTP Server is an HTTP server built on top of Web Packager. It functions like a reverse-proxy, receiving signing requests over HTTP. For more detail, see cmd/webpkgserver/README.md. This README focuses on the command-line tool.

Web Packager retrieves HTTP responses from servers and turns them into signed exchanges. Those signed exchanges are written into files in a way to preserve the URL path structure, so can be deployed easily in some typical cases. In addition, Web Packager applies some optimizations to the signed exchanges to help the content get rendered quicker.

Web Packager is purposed primarily for a showcase of how to speed up the page loading with privacy-preserving prefetch. Web developers may port the logic from this codebase to their systems or integrate Web Packager into their systems. The Web Packager's code is designed to allow some injections of custom logic; see the GoDoc comments for details. Note, however, that Web Packager is currently at an early stage now: see Limitations below.

Web Packager is not related to webpack.

Prerequisite

Web Packager is written in the Go language thus requires a Go system to run. See Getting Started on golang.org for how to install Go on your computer.

You will also need a certificate and private key pair to use for the signing the exchanges. Note the certificate must:

(For example, DigiCert offers the right kind of certificates.)

Then you will need to convert your certificate into the application/cert-chain+cbor format, which you can do using the instructions at:

Limitations

In this early phase, we may make backward-breaking changes to the commandline or API.

Web Packager aims to automatically meet most but not all Google SXG Cache requirements. In particular, pages that do not use responsive design should specify a supported-media annotation.

Web Packager does not handle request matching correctly. It should not matter unless your web server implements content negotiation using the Variants and Variant-Key headers (not the Vary header). We plan to support the request matching in future, but there is no ETA (estimated time of availability) at this moment.

Note: The above limitation is not expected to be a big deal even if your server serves signed exchanges conditionally using content negotiation: if you already have signed exchanges, you should not need Web Packager.

Install

go get -u github.com/google/webpackager/cmd/...

Usage

The simplest command looks like:

webpackager \
    --cert_cbor=cert.cbor \
    --private_key=priv.key \
    --cert_url=https://example.com/cert.cbor \
    --url=https://example.com/hello.html

It will retrieve an HTTP response from https://example.com/, generate a signed exchange with the given pair of certificate (cert.cbor) and private key (priv.key), then write it to ./sxg/hello.html.sxg. If hello.html had subresources that could be preloaded together, webpackager would also retrieve those resources and generate their signed exchanges under ./sxg. Web Packager recognizes <link rel="preload"> and equivalent Link HTTP headers. It also adds the preload links for CSS (stylesheets) used in HTML, and may use more heuristics in future. See the defaultproc package to find how exactly the HTTP response is processed.

--cert_url specifies where the client will expect to find the CBOR-format certificate chain. --cert_cbor is optional when it can be fetched from --cert_url. Note the reverse is not true: --cert_url is always required.

The --url flag can be repeated as many times as you want. For example:

webpackager \
    --cert_cbor=cert.cbor \
    --private_key=priv.key \
    --cert_url=https://example.com/cert.cbor \
    --url=https://example.com/foo/ \
    --url=https://example.com/bar/ \
    --url=https://example.com/baz/

would generate the following three files:

  • ./sxg/foo/index.html.sxg for https://example.com/foo/
  • ./sxg/bar/index.html.sxg for https://example.com/bar/
  • ./sxg/baz/index.html.sxg for https://example.com/baz/

Note: webpackager expects all target URLs to have the same origin. In particular, the output files collide if you specify more than one URL that has the same path but a different domain.

Using URL File

webpackage also accepts --url_file=FILE. FILE is a plain text file with one URL on each line. For example, you could create urls.txt with:

# This is a comment.
https://example.com/foo/
https://example.com/bar/
https://example.com/baz/

then run:

webpackager \
    --cert_cbor=cert.cbor \
    --private_key=priv.key \
    --cert_url=https://example.com/cert.cbor \
    --url_file=urls.txt

Changing Output Directory

You can change the output directory with the --sxg_dir flag:

webpackager \
    --cert_cbor=cert.cbor \
    --private_key=priv.key \
    --cert_url=https://example.com/cert.cbor \
    --sxg_dir=/tmp/sxg \
    --url=https://example.com/hello.html

Setting Expiration

The signed exchanges last one hour by default. You can change the duration with the --expiry flag. For example:

webpackager \
    --cert_cbor=cert.cbor \
    --private_key=priv.key \
    --cert_url=https://example.com/cert.cbor \
    --expiry=72h \
    --url=https://example.com/hello.html

would make the signed exchanges valid for 72 hours (3 days). The maximum is 168h (7 days), due to the specification.

Other Flags

webpackager provides more flags for advanced usage (e.g. to set request headers). Run the tool with --help to see those flags.

Appendix: Deploying SXGs

The steps below illustrate an example of deploying Signed HTTP Exchanges on an Apache server.

  1. Upload cert.cbor to your server. Make it available at --cert_url.

  2. Upload *.sxg files to your server. Put them next to the original files (e.g. hello.html.sxg should stay in the same directory as hello.html). For example, if you are using the sftp command to upload, you can:

    sftp> cd public_html
    sftp> put -r sxg/*
    

    assuming public_html to be the document root and sxg to be where you generated the *.sxg files.

  3. Edit or create .htaccess in public_html (or the Apache's config file) to add the following settings:

    AddType application/signed-exchange;v=b3 .sxg
    
    <Files "cert.cbor">
      AddType application/cert-chain+cbor .cbor
    </Files>
    
    RewriteEngine On
    RewriteCond %{HTTP:Accept} application/signed-exchange
    RewriteCond %{REQUEST_FILENAME} !\.sxg$
    RewriteCond %{REQUEST_FILENAME}\.sxg -s
    RewriteRule .+ %{REQUEST_URI}.sxg [L]
    
    Header set X-Content-Type-Options: "nosniff"
    

webpackager's People

Contributors

banaag avatar gaul avatar tomokinat avatar twifkak avatar yuizumi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webpackager's Issues

Automate errcheck

This found some missing error handling in #37. This project needs some suppressions to run cleanly:

errcheck -ignore 'Close|Fprintf|Remove|RemoveAll|Write' ./...

Option to preload heuristically-determined hero image

This would likely significantly boost the LCP improvement from prefetching. It would provide parity with AMP Packager's preloadimage, though needn't use the same set of heuristics.

I'm not familiar enough with the architecture of webpackager. Does this make more sense as a new Processor, a new HTMLTask, or something else?

Fix webpkgserver so it doesn't error out due to dummy OCSP response

When serving self-signed certs using locahost:8080/webpkg/cert/xxxx (where xxx = cert digest), the webpkgserver gives out an error while reading the dummy OCSP response in the cert cache:

  • asn1: structure error: tags don't match (16 vs {class:1 tag:4 length:117 isCompound:true}) {optional:false exp
    licit:false application:false private:false defaultValue: tag: stringType:0 timeType:0 set:false omitEmpty:fal
    se} responseASN1 @2

Limit to 20 preloads

The Google SXG cache has a limit of 20 preloads. I don't see anything in the webpackager code that guarantees that requirement is met. (Did I miss it?) We should provide a [default enabled] option to do so.

Seems especially necessary given the optional HTMLTask to promote preload tags to headers, where the former are more prominent today, with no such limit.

Document the CLI way to keep cert.cbor updated

The main README should:

  • Recommend regenerating the cert.cbor from cert.pem ~daily, so that it can pick up a fresh OCSP response.
  • Recommend renewing the cert.pem every ~80 days.
  • Recommend a tool for renewing with ACME. (Either verify that a popular CLI works with SXG certs, or write our own.)

Option for webpkgserver to relay custom headers

Hi!

We have simple firewall that blocks non-browser requests to our websites by checking the user-agent request header. Requests from webpkgserver gets blocked as it uses Go's http client as user-agent.

We use Nginx to proxy pass to webpkgserver. Eg:

proxy_pass http://127.0.0.1:8080/priv/doc/https://example.com/foo.html;

We're wondering if it's possible for use add custom request header that gets relayed to https://example.com/foo.html so we can use it header to unblock requests from webpkgserver.

With the config below:

proxy_pass    http://127.0.0.1:8080/priv/doc/https://example.com/foo.html;
proxy_set_header    X-Is-WebPackager    1

we expect to have all requests from webpkgserver to https://example.com/foo.html will have the X-Is-WebPackager: 1 header. But, the custom header is not present which results in the request being blocked by firewall.

Is it by design?

Add option to remove high-entropy low-effect response headers before signing

webpkgserver will set a default lifetime of 1 day for JS resources and 7 days for others (src). However, any HTML that preloads JS is effectively 1-day, unless the publisher can refresh a JS SXG without updating its header-integrity.

GetFullHeader() should (default on, opt-out via toml config) remove any headers that are likely to change often, but don't affect the way the subresource is interpreted by the browser. The Date header comes to mind, but it's worth a cursory glance of the HTTP spec to unearth any others.

URL doesn't match the fetch targets

Hello, I am getting the following issue when sending a request to the webpkgserver:

2021/09/17 16:29:24 Listening at [::]:80
2021/09/17 16:29:24 Successfully retrieved valid OCSP.
2021/09/17 16:29:26 processing https://www.perlego.com/book/1690290/criminal-law-pdf ...
2021/09/17 16:29:26 error with processing https://www.perlego.com/book/1690290/criminal-law-pdf: fetch: URL doesn't match the fetch targets

This is the webpkgserver.toml file I'm using:

[Listen]
Port = 80

[SXG.Cert]
PEMFile = '/www_perlego_com.pem'
KeyFile = '/server.key'
AllowTestCert = false

[SXG]
CertURLBase = 'https://perlego.com/'

[[Sign]]
Domain = 'perlego.com'

This is how I'm sending the request:

wget -v -d --header="Accept: application/signed-exchange;v=b3" localhost/priv/doc/https://www.perlego.com/book/1690290/criminal-law-pdf

I'm confused since from reading the source code, I gather that the error URL doesn't match the fetch targets is thrown when the Domain field of the server configuration does not match the actual host of the request, but in this case I believe it does.

Thanks in advance,
Juan

500 error occurs when ocsp cache does not exist

I think webpkgserver should return 404 error if the cache does not exist.
(Looking here, it looks like it's supposed to be.)
https://github.com/google/webpackager/blob/main/server/handler.go#L119-L122

However, in fact, "No such file or directory" occurred in ioutil.ReadFile, and webpkgserver seems to be getting a 500 error.
https://github.com/google/webpackager/blob/main/certchain/certmanager/multicert_disk_cache.go#L115-L118

I didn't know how certmanager.ErrNotFound was used, but what about returning certmanager.ErrNotFound here?
https://github.com/google/webpackager/blob/main/certchain/certchainutil/certchainutil.go#L45-L48

Make ResourceCache scalable

IIUC, webpkgserver always runs with an in-memory resource cache, which is unbounded and lacking expiration or eviction. (Expired entries are never deleted from memory unless they are replaced.) This cache persists for the life of the server.

This is OK only for a site with a small number of resources. We need to add some configuration parameters to make this work for larger sites. Some rough ideas:

  • Max size of the cache (including 0 to disable). For starters, we can do it by # of entries, but # of bytes would be a nice addition.
  • Option to use the file-based cache as a backend to the memory cache, so it can be shared between replicas. This would reduce # of fetches.

[webpkgserver] ignore third party subresources preloading

Our site uses a third party resource preloading (AMP v0.js).

<link rel="preload" href="https://cdn.ampproject.org/lts/v0.js" as="script">

webpkgserver will attempt to fetch that resource, which will result in the following error.

2021/06/25 18:07:55 processing https://cdn.ampproject.org/lts/v0.js ...
2021/06/25 18:07:55 error with processing https://cdn.ampproject.org/lts/v0.js: fetch: URL doesn't match the fetch targets

Is there any way to ignore this preload?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.