抓取的知乎页面打开后会一直刷新

<div class="highlight highlight-source-ts notranslate position-relative overflow-auto" dir="auto" da

我试了一下，发现离线保存的网页，打开后会出现一些问题 <a target="_blank" rel="noopener noreferrer nofollow" h

可能是知乎那边做了一些跨域请求的限制我认为这不是知乎的问题。 <p di

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

抓取知乎页面 about translators_cn HOT 18 CLOSED

l0o0 commented on September 20, 2024

抓取知乎页面

from translators_cn.

Comments (18)

l0o0 commented on September 20, 2024 2

对于这个功能如何做，我不是很清楚。因为不知道如何保存这个东西。我觉得是否可以通过禁用某些js就可以了。不过需要进行测试。等我有空，可以试试

from translators_cn.

Lemmingh commented on September 20, 2024 1

/*!
 * Prototype of Zotero web translator for 知乎 (Zhihu)
 *
 * Language: TypeScript 4.0; ES2020
 * Date: 2020-10-15
 *
 * Lemming <https://github.com/Lemmingh>
 * Licensed under the ISC License.
 */

/// <reference path="types-zotero/index.d.ts" />
/// <reference path="types-zhihu/index.d.ts" />

'use strict';

/* Main. */

async function doWeb(_doc: Document, url: string): Promise<void> {
    const newZoteroItem = new Zotero.Item('document');

    // Remove unnecessary query and fragment string from `url`, before performing other operations.
    url = url.split(/\?|#/)[0];

    // If there is any exception, let it propagate to terminate the process.
    const item: ZhihuItem = await getZhihuItem(url);

    // Generate the snapshot of the Zhihu item.
    const snapshot: HTMLDocument = getSnapshot(item);
    newZoteroItem.attachments.push({
        title: 'Snapshot',
        document: snapshot
    });

    // TODO: Set metadata.
    newZoteroItem.title = item.title;

    // Save.
    newZoteroItem.complete();
}

/**
 * Returns a `Promise<ZhihuItem>`.
 * @param {string} url - URL of the answer or Zhuanlan article.
 */
async function getZhihuItem(url: string): Promise<ZhihuItem> {

    // ! Warning:
    // ! RegExp lookbehind assertions are not available until ECMAScript 2018.
    // ! https://github.com/tc39/proposal-regexp-lookbehind

    // Zhuanlan article URL examples:
    // https://zhuanlan.zhihu.com/p/59589298
    // https://zhuanlan.zhihu.com/p/35295235
    const ZhuanlanArticleIdRegexp = /(?<=p\/)\d+/;

    // Answer URL examples:
    // https://www.zhihu.com/answer/984072342
    // https://www.zhihu.com/question/24952084/answer/984072342
    // https://www.zhihu.com/question/392313958/answer/1198915276
    const AnswerIdRegexp = /(?<=answer\/)\d+/;

    if (url.startsWith('https://zhuanlan.zhihu.com/p/')) {
        return getAsArticle();
    } else if (url.includes('/answer/')) {
        return getAsAnswer();
    } else {
        throw new Error("Not supported");
    }

    /* Sends an HTTP GET request to Zhihu API. */
    // Important:
    // Many Zhihu services, including API, lack cross-origin headers.
    // Well, it doesn't seem to matter, since Zotero web translator seems to run in the same context as the active document in the browser.
    // And `fetch()` can be easily converted to `Zotero.HTTP.request()`.
    // ! `Promise` may throw exceptions that can only be caught by try-catch.
    function fetchJson(aUrl: RequestInfo): Promise<Zhihu.Answer | Zhihu.Article> {
        return fetch(aUrl).then(res => res.json());
    }

    async function getAsAnswer(): Promise<ZhihuItem> {
        // Construct API URL.
        const idStr = url.match(AnswerIdRegexp)![0];
        const apiUrl = `https://www.zhihu.com/api/v4/answers/${idStr}?include=content,excerpt`;

        const res = await fetchJson(apiUrl) as Zhihu.Answer;

        return {
            id: res.id,
            type: res.type as ZhihuItemType,
            author: {
                id: res.author.id,
                name: res.author.name,
            },
            content: res.content!,
            excerpt: res.excerpt!,
            title: res.question.title,
        };

    }

    async function getAsArticle(): Promise<ZhihuItem> {
        // Construct API URL.
        const idStr = url.match(ZhuanlanArticleIdRegexp)![0];
        const apiUrl = `https://zhuanlan.zhihu.com/api/articles/${idStr}`;

        const res = await fetchJson(apiUrl) as Zhihu.Article;

        return {
            id: res.id,
            type: res.type as ZhihuItemType,
            author: {
                id: res.author.id,
                name: res.author.name,
            },
            content: res.content,
            excerpt: new DOMParser().parseFromString(res.excerpt, 'text/html').body.innerText,
            title: res.title,
        };
    }
}

/**
 * Generates the snapshot of a ZhihuItem.
 * @param {ZhihuItem} item
 */
function getSnapshot(item: ZhihuItem): HTMLDocument {
    const doc = new HTMLDocument();

    // Assuming that response from Zhihu is safe.
    doc.body.innerHTML = item.content;

    doc.title = item.title;

    // TODO: Inject other elements if necessary.

    return doc;
}


/* High-level representations of entities on Zhihu, with only properties we are interested in. */

/**
 * A unified representation of items (Zhihu answer or Zhuanlan article) on Zhihu.
 */
interface ZhihuItem {
    /**
     * Item ID.
     */
    id: number;

    /**
     * Item type.
     */
    type: ZhihuItemType;

    author: ZhihuAuthor;

    /**
     * Content.
     * An HTML string.
     */
    content: string;

    /**
     * The first few sentences of the content. Can be used as abstract.
     * A **plain text** string.
     */
    excerpt: string;

    /**
     * The name.
     *
     * For an article, use its title.
     * For an answer, use question title.
     */
    title: string;
}

const enum ZhihuItemType {
    Answer = 'answer',
    Article = 'article',
    People = 'people',
    Question = 'question',
}

/**
 * A user on Zhihu.
 */
interface ZhihuAuthor {
    /**
     * User ID.
     */
    id: string;

    /**
     * Item type.
     */
    // type: ZhihuItemType.People;

    /**
     * 用户名.
     */
    name: string;
}

from translators_cn.

l0o0 commented on September 20, 2024 1

我js都是现学的，为了应付处理这些translator。你说的这些我会好好参考的，我先试试看 Lemmingh <[email protected]>于2020年7月15日周三22:00写道：

…

你是不是没用过 TypeScript <https://www.typescriptlang.org/docs/handbook/typescript-in-5-minutes.html> 。😅 TypeScript 可以理解为加了*静态检查*的 JavaScript，要编译成 JavaScript 才能运行。 ------------------------------ 知乎的 Zotero type 取 document 即可： let newZoteroItem = new Zotero.Item('document'); snapshot 的类型是 HTMLDocument，虽然 Zotero 的文档没有明说，但可以观察出来。其他的对象类型也可以推导。 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAU53QGAJPRLRZD64Z65NYDR3WZBVANCNFSM4NPAPLRQ> .

from translators_cn.

l0o0 commented on September 20, 2024

我试了一下，发现离线保存的网页，打开后会出现一些问题

估计可能是知乎那边做了一些跨域请求的限制。
目前我这边没有相关zhihu的抓取插件，负责现在的抓取插件应该是在官方那边负责的。

from translators_cn.

DansYU commented on September 20, 2024

这个功能是否可以做？
知乎上有些比较有用的技术贴，想用zotero来管理。

from translators_cn.

Lemmingh commented on September 20, 2024

可能是知乎那边做了一些跨域请求的限制

我认为这不是知乎的问题。

CORS （现在是 Fetch 标准的一部分）本身非常严格。CORS 请求是 HTTP 请求（只能使用 HTTP / HTTPS 协议），标准规定了必须和允许的 header，还跟着其他一堆约束条件。此外，某些浏览器还会自己添加标准之外的限制。

那张图显然是直接打开了本地 HTML 文件，协议是 file （本地资源，会导致没有 Origin），这肯定不可以。试试搭个 HTTP 服务器。

这个功能如何做

前些阵子，牛岱发布了 Zhihu On VSCode，其中实现了浏览知乎的功能。那么，抓取知乎页面肯定是可行的。

不妨参考他的办法，访问知乎的 API：

网上也有其他人研究过，比如

from translators_cn.

l0o0 commented on September 20, 2024

@Lemmingh 谢谢这位同学提供的这么详细的资料，我抽空看一下

from translators_cn.

l0o0 commented on September 20, 2024

@DansYU 同学，你这边的需求是什么，把回答离线保存下来么？

from translators_cn.

DansYU commented on September 20, 2024

我的需要就是能把知乎的专栏网页和回答保存下载，要是能实现酒非常感谢了！！

from translators_cn.

luohc2004 commented on September 20, 2024

@DansYU 同学，你这边的需求是什么，把回答离线保存下来么？

哇噻，还在积极开发。zotero开源生态真不错。owner真强。以后能自己动手就好了~项目README.MD最后的链接很有帮助

from translators_cn.

l0o0 commented on September 20, 2024

有空也一起开发呀。这知乎不知道怎么弄😂 FrenzyBoy <[email protected]>于2020年7月13日周一08:22写道：

…

@DansYU <https://github.com/DansYU> 同学，你这边的需求是什么，把回答离线保存下来么？哇噻，还在积极开发。zotero开源生态真不错。owner真强。以后能自己动手就好了~项目README.MD最后的链接很有帮助 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAU53QF7EQP74KFJOB5BAODR3JHT5ANCNFSM4NPAPLRQ> .

from translators_cn.

Lemmingh commented on September 20, 2024

专栏文章的 <head> 有 Open Graph 元数据。回答的 div.ContentItem.AnswerItem 中有 Microdata 元数据。这些也能用，而且应该比走 API 方便。

from translators_cn.

l0o0 commented on September 20, 2024

@Lemmingh 我不是很明白，这个snapshot的item是哪个translator控制的？

from translators_cn.

Lemmingh commented on September 20, 2024

是指这个吗：
function getSnapshot(item: ZhihuItem): HTMLDocument

这是我自己捏的函数。

item 的数据格式是 ZhihuItem，定义在下面那一团里。

整个原型都用 TypeScript 写，图方便。

from translators_cn.

l0o0 commented on September 20, 2024

@Lemmingh 这样啊，不过现在Zotero好像不能自定义item type。我现在还不大确定知乎这个要做成哪种的 item type。或者直接使用webpage这个类型。

from translators_cn.

Lemmingh commented on September 20, 2024

你是不是没用过 TypeScript。😅

TypeScript 可以理解为加了静态检查的 JavaScript，要编译成 JavaScript 才能运行。

知乎的 Zotero type 取 document 即可：
let newZoteroItem = new Zotero.Item('document');

snapshot 的类型是 HTMLDocument，虽然 Zotero 的文档没有明说，但可以观察出来。其他的对象类型也可以推导。

from translators_cn.

liuxsdev commented on September 20, 2024

我的需要就是能把知乎的专栏网页和回答保存下载，要是能实现酒非常感谢了！！

我觉得可以用其他一些工具来保存页面，zotero只负责元数据的提取，最后将附件和zotero关联即可。

利用一些chrome插件，像简悦对知乎有适配，可以保存为离线markdown，pdf，html之类的；
SingleFile可以完美的保存离线HTML。

我最近在管理B站视频就是这种思路，手动下载视频，再关联条目。

from translators_cn.

l0o0 commented on September 20, 2024

@Lemmingh 参考了你前面给的API链接。我测试了下，可以把知乎上的内容保存到笔记中，图片，链接以及具体格式的显示效果都挺好。

from translators_cn.

抓取知乎页面 about translators_cn HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent