Git Product home page Git Product logo

Comments (18)

l0o0 avatar l0o0 commented on July 20, 2024 2

对于这个功能如何做,我不是很清楚。因为不知道如何保存这个东西。我觉得是否可以通过禁用某些js就可以了。不过需要进行测试。等我有空,可以试试

from translators_cn.

Lemmingh avatar Lemmingh commented on July 20, 2024 1
/*!
 * Prototype of Zotero web translator for 知乎 (Zhihu)
 *
 * Language: TypeScript 4.0; ES2020
 * Date: 2020-10-15
 *
 * Lemming <https://github.com/Lemmingh>
 * Licensed under the ISC License.
 */

/// <reference path="types-zotero/index.d.ts" />
/// <reference path="types-zhihu/index.d.ts" />

'use strict';

/* Main. */

async function doWeb(_doc: Document, url: string): Promise<void> {
    const newZoteroItem = new Zotero.Item('document');

    // Remove unnecessary query and fragment string from `url`, before performing other operations.
    url = url.split(/\?|#/)[0];

    // If there is any exception, let it propagate to terminate the process.
    const item: ZhihuItem = await getZhihuItem(url);

    // Generate the snapshot of the Zhihu item.
    const snapshot: HTMLDocument = getSnapshot(item);
    newZoteroItem.attachments.push({
        title: 'Snapshot',
        document: snapshot
    });

    // TODO: Set metadata.
    newZoteroItem.title = item.title;

    // Save.
    newZoteroItem.complete();
}

/**
 * Returns a `Promise<ZhihuItem>`.
 * @param {string} url - URL of the answer or Zhuanlan article.
 */
async function getZhihuItem(url: string): Promise<ZhihuItem> {

    // ! Warning:
    // ! RegExp lookbehind assertions are not available until ECMAScript 2018.
    // ! https://github.com/tc39/proposal-regexp-lookbehind

    // Zhuanlan article URL examples:
    // https://zhuanlan.zhihu.com/p/59589298
    // https://zhuanlan.zhihu.com/p/35295235
    const ZhuanlanArticleIdRegexp = /(?<=p\/)\d+/;

    // Answer URL examples:
    // https://www.zhihu.com/answer/984072342
    // https://www.zhihu.com/question/24952084/answer/984072342
    // https://www.zhihu.com/question/392313958/answer/1198915276
    const AnswerIdRegexp = /(?<=answer\/)\d+/;

    if (url.startsWith('https://zhuanlan.zhihu.com/p/')) {
        return getAsArticle();
    } else if (url.includes('/answer/')) {
        return getAsAnswer();
    } else {
        throw new Error("Not supported");
    }

    /* Sends an HTTP GET request to Zhihu API. */
    // Important:
    // Many Zhihu services, including API, lack cross-origin headers.
    // Well, it doesn't seem to matter, since Zotero web translator seems to run in the same context as the active document in the browser.
    // And `fetch()` can be easily converted to `Zotero.HTTP.request()`.
    // ! `Promise` may throw exceptions that can only be caught by try-catch.
    function fetchJson(aUrl: RequestInfo): Promise<Zhihu.Answer | Zhihu.Article> {
        return fetch(aUrl).then(res => res.json());
    }

    async function getAsAnswer(): Promise<ZhihuItem> {
        // Construct API URL.
        const idStr = url.match(AnswerIdRegexp)![0];
        const apiUrl = `https://www.zhihu.com/api/v4/answers/${idStr}?include=content,excerpt`;

        const res = await fetchJson(apiUrl) as Zhihu.Answer;

        return {
            id: res.id,
            type: res.type as ZhihuItemType,
            author: {
                id: res.author.id,
                name: res.author.name,
            },
            content: res.content!,
            excerpt: res.excerpt!,
            title: res.question.title,
        };

    }

    async function getAsArticle(): Promise<ZhihuItem> {
        // Construct API URL.
        const idStr = url.match(ZhuanlanArticleIdRegexp)![0];
        const apiUrl = `https://zhuanlan.zhihu.com/api/articles/${idStr}`;

        const res = await fetchJson(apiUrl) as Zhihu.Article;

        return {
            id: res.id,
            type: res.type as ZhihuItemType,
            author: {
                id: res.author.id,
                name: res.author.name,
            },
            content: res.content,
            excerpt: new DOMParser().parseFromString(res.excerpt, 'text/html').body.innerText,
            title: res.title,
        };
    }
}

/**
 * Generates the snapshot of a ZhihuItem.
 * @param {ZhihuItem} item
 */
function getSnapshot(item: ZhihuItem): HTMLDocument {
    const doc = new HTMLDocument();

    // Assuming that response from Zhihu is safe.
    doc.body.innerHTML = item.content;

    doc.title = item.title;

    // TODO: Inject other elements if necessary.

    return doc;
}


/* High-level representations of entities on Zhihu, with only properties we are interested in. */

/**
 * A unified representation of items (Zhihu answer or Zhuanlan article) on Zhihu.
 */
interface ZhihuItem {
    /**
     * Item ID.
     */
    id: number;

    /**
     * Item type.
     */
    type: ZhihuItemType;

    author: ZhihuAuthor;

    /**
     * Content.
     * An HTML string.
     */
    content: string;

    /**
     * The first few sentences of the content. Can be used as abstract.
     * A **plain text** string.
     */
    excerpt: string;

    /**
     * The name.
     *
     * For an article, use its title.
     * For an answer, use question title.
     */
    title: string;
}

const enum ZhihuItemType {
    Answer = 'answer',
    Article = 'article',
    People = 'people',
    Question = 'question',
}

/**
 * A user on Zhihu.
 */
interface ZhihuAuthor {
    /**
     * User ID.
     */
    id: string;

    /**
     * Item type.
     */
    // type: ZhihuItemType.People;

    /**
     * 用户名.
     */
    name: string;
}

from translators_cn.

l0o0 avatar l0o0 commented on July 20, 2024 1

from translators_cn.

l0o0 avatar l0o0 commented on July 20, 2024

我试了一下,发现离线保存的网页,打开后会出现一些问题
图片

估计可能是知乎那边做了一些跨域请求的限制。
目前我这边没有相关zhihu的抓取插件,负责现在的抓取插件应该是在官方那边负责的。

from translators_cn.

DansYU avatar DansYU commented on July 20, 2024

这个功能是否可以做?
知乎上有些比较有用的技术贴,想用zotero来管理。

from translators_cn.

Lemmingh avatar Lemmingh commented on July 20, 2024

可能是知乎那边做了一些跨域请求的限制

我认为这不是知乎的问题。

CORS (现在是 Fetch 标准的一部分)本身非常严格。CORS 请求是 HTTP 请求(只能使用 HTTP / HTTPS 协议),标准规定了必须和允许的 header,还跟着其他一堆约束条件。此外,某些浏览器还会自己添加标准之外的限制。

那张图显然是直接打开了本地 HTML 文件,协议是 file (本地资源,会导致没有 Origin),这肯定不可以。试试搭个 HTTP 服务器。


这个功能如何做

前些阵子,牛岱发布了 Zhihu On VSCode,其中实现了浏览知乎的功能。那么,抓取知乎页面肯定是可行的。

不妨参考他的办法,访问知乎的 API:

网上也有其他人研究过,比如

from translators_cn.

l0o0 avatar l0o0 commented on July 20, 2024

@Lemmingh 谢谢这位同学提供的这么详细的资料,我抽空看一下

from translators_cn.

l0o0 avatar l0o0 commented on July 20, 2024

@DansYU 同学,你这边的需求是什么,把回答离线保存下来么?

from translators_cn.

DansYU avatar DansYU commented on July 20, 2024

我的需要就是能把知乎的专栏网页 和 回答保存下载,要是能实现酒非常感谢了!!

from translators_cn.

luohc2004 avatar luohc2004 commented on July 20, 2024

@DansYU 同学,你这边的需求是什么,把回答离线保存下来么?

哇噻,还在积极开发。zotero开源生态真不错。owner真强。以后能自己动手就好了~项目README.MD最后的链接很有帮助

from translators_cn.

l0o0 avatar l0o0 commented on July 20, 2024

from translators_cn.

Lemmingh avatar Lemmingh commented on July 20, 2024

专栏文章<head> 有 Open Graph 元数据。回答div.ContentItem.AnswerItem 中有 Microdata 元数据。这些也能用,而且应该比走 API 方便。

from translators_cn.

l0o0 avatar l0o0 commented on July 20, 2024

@Lemmingh 我不是很明白,这个snapshot的item是哪个translator控制的?

from translators_cn.

Lemmingh avatar Lemmingh commented on July 20, 2024

是指这个吗:
function getSnapshot(item: ZhihuItem): HTMLDocument

这是我自己捏的函数。

item 的数据格式是 ZhihuItem,定义在下面那一团里。

整个原型都用 TypeScript 写,图方便。

from translators_cn.

l0o0 avatar l0o0 commented on July 20, 2024

@Lemmingh 这样啊,不过现在Zotero好像不能自定义item type。我现在还不大确定知乎这个要做成哪种的 item type。或者直接使用webpage这个类型。

from translators_cn.

Lemmingh avatar Lemmingh commented on July 20, 2024

你是不是没用过 TypeScript。😅

TypeScript 可以理解为加了静态检查的 JavaScript,要编译成 JavaScript 才能运行。


知乎的 Zotero type 取 document 即可:
let newZoteroItem = new Zotero.Item('document');

snapshot 的类型是 HTMLDocument,虽然 Zotero 的文档没有明说,但可以观察出来。其他的对象类型也可以推导。

from translators_cn.

liuxsdev avatar liuxsdev commented on July 20, 2024

我的需要就是能把知乎的专栏网页 和 回答保存下载,要是能实现酒非常感谢了!!

我觉得可以用其他一些工具来保存页面,zotero只负责元数据的提取,最后将附件和zotero关联即可。

利用一些chrome插件,像简悦对知乎有适配,可以保存为离线markdown,pdf,html之类的;
SingleFile可以完美的保存离线HTML。

我最近在管理B站视频就是这种思路,手动下载视频,再关联条目。

from translators_cn.

l0o0 avatar l0o0 commented on July 20, 2024

@Lemmingh 参考了你前面给的API链接。我测试了下,可以把知乎上的内容保存到笔记中,图片,链接以及具体格式的显示效果都挺好。
image

from translators_cn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.