Git Product home page Git Product logo

inner-peace's Introduction

dom-text-matcher

Experiment about searching for text in the DOM, transcending element boundaries

What is this

This is an experiment about how to locate text patterns in a DOM, when the match might span multiple nodes, and we don't know where to look exactly.

The code traverses the (configured part of) the dom, and collects info about which string slice is where. Then it searches for the pattern in the innerText of he (configured part of the) DOM using google-diff-match-patch (see app/lib/fancymatcher/README.txt for more info about this part.)

When a match is found, it maps in back to DOM elements, using the collected information, and it returns info about where to text. (XPath expressions and string indices.)

Optionally, it can also highlight the match in the DOM.

All the DOM analyzing logic is in one CoffeeScript file ( app/lib/domsearcher/domsearcher.coffee )

How to run

  1. run scripts/web-server.js
  2. Go to http://localhost:8000/
  3. Click the buttons, and see what happens.

Or see the live demo.

Unsolved problems:

  • Pattern length

    Unfortunately, there is a limitation about the max length of pattern. The exact value varies by browser and platform, but it's typically 32 or 64. Which means you can't search for patterns longer that that. (The getMaxPatternLength method returns the current limit.)

    This limitation comes from the implementation of the Bitap matching algorithm; will need to look into this later. See thread here.

    Currently, this is worked around by searching for the first and the last 32-bit slice of the pattern, and if both is found, and the distance is about right, then it's accepted as match. What goes between the start and the end slice is not analyzed. This might lead to false matches.

  • Hidden nodes

    When the "display" property of a node is set to "none", it's not displayd, so it's content does not get into it's parent's innerText.

    However, there is no easy way to detect this when one is only looking at the DOM. (Like we do.)

    So, currently we are trying to detect whether a node (and it's children) is hidden by searching for it's innerText in the innerText of the parent. (If there is no match, it means that this is probably hidden.)

    However, this approach can yield "false negatives", if a node is indeed hidden, but the content is the same like that of an other, non-hidden node.

    This would not break the results, but would add false parts to the selection.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.