Git Product home page Git Product logo

unfurl's Introduction

Unfurl Logo

Extract and Visualize Data from URLs using Unfurl

Unfurl takes a URL and expands ("unfurls") it into a directed graph, extracting every bit of information from the URL and exposing the obscured. It does this by breaking up a URL into components, extracting as much information as it can from each piece, and presenting it all visually. This “show your work” approach (along with embedded references and documentation) makes the analysis transparent to the user and helps them learn about (and discover) semantic and syntactical URL structures.

Unfurl has parsers for URLs, search engines, chat applications, social media sites, and more. It also has more generic parsers (timestamps, UUIDs, etc) helpful for exploring new URLs or reverse engineering. It’s also easy to build new parsers, since Unfurl is open source (Python 3) and has an extensible plugin system.

No matter if you extracted a URL from a memory image, carved it from slack space, or pulled it from a browser’s history file, Unfurl can help you get the most out of it.

How to use Unfurl

Online Version

  1. There is an online version at https://dfir.blog/unfurl. Visit that page, enter the URL in the form, and click 'Unfurl!'.
  2. You can also access the online version using a bookmarklet - create a new bookmark and paste javascript:window.location.href='https://dfir.blog/unfurl/?url='+window.location.href; as the location. Then when on any page with an interesting URL, you can click the bookmarklet and see the URL "unfurled".

Local Python Install

  1. Install via pip: pip install dfir-unfurl

After Unfurl is installed, you can run use it via the web app or command-line:

  1. Run python unfurl_app.py
  2. Browse to localhost:5000/ (editable via config file)
  3. Enter the URL to unfurl in the form, and 'Unfurl!'

OR

  1. Run python unfurl_cli.py https://twitter.com/_RyanBenson/status/1205161015177961473
  2. Output:
[1] https://twitter.com/_RyanBenson/status/1205161015177961473
 ├─(u)─[2] Scheme: https
 ├─(u)─[3] twitter.com
 |  ├─(u)─[5] Domain Name: twitter.com
 |  └─(u)─[6] TLD: com
 └─(u)─[4] /_RyanBenson/status/1205161015177961473
    ├─(u)─[7] 1: _RyanBenson
    ├─(u)─[8] 2: status
    └─(u)─[9] 3: 1205161015177961473
       ├─(❄)─[10] Timestamp: 1576167751484
       |  └─(🕓)─[13] 2019-12-12 16:22:31.484
       ├─(❄)─[11] Machine ID: 334
       └─(❄)─[12] Sequence: 1 

If the URL has special characters (like "&") that your shell might interpret as a command, put the URL in quotes. Example: python unfurl_cli.py "https://www.google.com/search?&ei=yTLGXeyKN_2y0PEP2smVuAg&q=dfir.blog&oq=dfir.blog&ved=0ahUKEwisk-WjmNzlAhV9GTQIHdpkBYcQ4dUDCAg"

unfurl_cli has a number of command line options to modify its behavior:

optional arguments:
  -h, --help            show this help message and exit
  -d, --detailed        show more detailed explanations.
  -f FILTER, --filter FILTER
                        only output lines that match this filter.
  -o OUTPUT, --output OUTPUT
                        file to save output (as CSV) to. if omitted, output is sent to stdout (typically this means displayed in the console).
  -v, -V, --version     show program's version number and exit

Docker

  1. git clone https://github.com/obsidianforensics/unfurl
  2. cd unfurl
  3. docker-compose up -d

Testing

  1. All tests are run automatically on each PR by Travis CI. Tests need to pass before merging.
  2. While not required, it is strongly encouraged to add tests that cover any new features in a PR.
  3. To manually run all tests (units and integration): python -m unittest discover -s unfurl/tests

If using Docker as above, run: docker exec unfurl python -m unittest discover -s unfurl/tests

Legal Bit

This is not an officially supported Google product.

unfurl's People

Contributors

dependabot[bot] avatar jkppr avatar moshekaplan avatar nemec avatar obsidianforensics avatar olliejc avatar rafiot avatar scottwedge avatar sim4n6 avatar weslambert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unfurl's Issues

Google Search: bih and biw parameters

Great tool thanks Ryan!

Both bih and biw parameters in google search currently appear as generic 'URL parsing functions'

image

However, they appear to be well documented as testable that it equates to the browser windows height and width. This can be checked using something like: https://browsersize.com/

image

Thanks again!

Firebase Push IDs

Description
Firebase's push IDs are the chronological, 20-character unique IDs. A push ID contains 120 bits of information. The first 48 bits are a timestamp, which both reduces the chance of collision and allows consecutively created push IDs to sort chronologically. The timestamp is followed by 72 bits of randomness

Examples

  • "-JhLeOlGIEjaIOFHR0xd"
  • "-JhQ76OEK_848CkIFhAq"
  • "-JhQ7APk0UtyRTFO9-TS"

References

Parsing misclassification if the password begins with ? or / or #

These URLs cause a misclassification:

The misclassification appears if the password begins with a delimiter defined in RFC3986, section 3.2:

The authority component is preceded by a double slash ("//") and is
terminated by the next slash ("/"), question mark ("?"), or number
sign ("#") character, or by the end of the URI.

Docker Support

I've created a Dockerfile and docker-compose file for standing up unfurl in a Docker container. However, it appears that even though we can change the host/port for unfurl in config.ini, if you are running via Docker etc, the fetch request here refers to the explicit host/port as defined in config.ini.

This becomes an issue when you are trying to expose unfurl via 0.0.0.0 in the Docker container, and are trying to access unfurl via the browser via the host IP address (something like 192.168.1.199, because unfurl is trying to fetch 0.0.0.0:5000, instead of the host IP).

I've got a PR ready (from here) that replaces unfurl_host:unfurl_port with:

window.location.host (host + port of current request)

and provides the necessary files for running unfurl in a Docker container

See here for more details on the host/port change

This seems to work okay from my testing.

Any thoughts on potential issues with this, or how this could be done better to support Docker?

If this all sounds okay, I will go ahead and submit a PR -- just let me know!

Parse YouTube URLs

Description
There is some good data encoded in YouTube URLs. At a minimum, some links point to a particular time in a video.

I'm happy to help/coach/answer questions if anyone wants to work on this issue!

Examples

References

  • The t param looks to be the number of seconds into the video

pytest or unittest for testing ?

Description
I would like to create a mechanism of testing. I suggest to use either Pytest, or Unittest.

Please, before I dive on working on the PR. Could you please choose the one that you recommend the most ?

Add logging

Right now, errors/warnings/etc are printed. Switch to using logging.

Add ability for node to have multiple parents

Some nodes may combine to give more info, and thus it would be nice to reference both "parent" nodes from a child node. One example is the ei Google param; one node has full seconds and another has fractional seconds. Combining both these nodes would then yield the complete timestamp.

parsing suggestion: recognize and decode 12-byte Mongo object ID timestamp (first four bytes)

I just read https://techkranti.com/idor-through-mongodb-object-ids-prediction/, and it brought to mind a case I was working last week. Not positive it was actually this, but I'm going to check when I get to work. In any case, it would be useful for unfurl to automatically recognize Mongo object IDs that appear in URLs, and decode the embedded timestamp. (If it already does, I apologize. I wasn't able to find an example ready-to-hand, with which to test, and I don't see any references to Mongo object IDs in the documentation.)

Fragments in URLs not working

I noticed that after cloning the repository from master and running with python3 unfurl_app.py, fragments in URLs do not appear in the unfurled graph. This issue is also present on your demo website when deep-linking to the URL, but not when I submit a URL via the form. I'm guessing it's because fragments aren't sent to the server when GETing a URL. Maybe it would be better to base64 the URL in the deep link or something to prevent the browser from misinterpreting the query.

Doesn't work:
http://127.0.0.1:5000/https://www.facebook.com/photo.php?type=3#hello
https://dfir.blog/unfurl/?url=https://www.facebook.com/photo.php?type=3#hello

Works:
Go to: https://dfir.blog/unfurl/
Submit query: https://www.facebook.com/photo.php?type=3#hello

Add support for clicked Yahoo URLs

Description
Parse out a clicked URL from a Yahoo search:

For example, a URL like the following:

https://r.search.yahoo.com/_ylt=AwrBT8OVIB9aqsoASDZXNyoA;_ylu=X3oDMTByOHZyb21tBGNvbG8DYmYxBHBvcwMxBHZ0aWQDBHNlYwNzcg--/RV=2/RE=1512018198/RO=10/RU=https%3a%2f%2fdeveloper.yahoo.com%2fcocktails%2fmojito%2fdocs%2fcode_exs%2fquery_params.html/RK=2/RS=vQW48_o6zXyIDewim5cXq8Np1zo-


RU = [e for e in path.split("/") if e.startswith("RU=")]
# RU=[https%3a%2f%2fdeveloper.yahoo.com%2fcocktails%2fmojito%2fdocs%2fcode_exs%2fquery_params.html]
link_target = urllib.unquote(RU[0][3:])
# link_target =https://developer.yahoo.com/cocktails/mojito/docs/code_exs/query_params.html

Add parsing IP addresses (in all their variants)

Description
IP addresses can be represented in different ways. Add support in Unfurl to parse all these variants of IP addresses:

From https://www.trustwave.com/en-us/resources/blogs/spiderlabs-blog/evasive-urls-in-spam/:

References

Expand Bing parser

Parse GitHub URLs

Description
GitHub URLs have a lot of info in them describing what is being viewed.

Examples

References

  • The above examples is just from me looking at the URLs; some actual references for this would be great!

Identify and parse (some) URLs generated by Metasploit

Description
Metasploit / CS use some generated URLs that can be decoded. Didier Stevens did a write-up & built a tool to decode them (https://isc.sans.edu/diary/27204). That research is a great base for an Unfurl parser.

Examples

  • hxxp://127.0.0.1:8080/4PGoVGYmx8l6F3sVI4Rc8g1wms758YNVXPczHlPobpJENARSuSHb57lFKNndzVSpivRDSi5VH2U-w-pEq_CroLcB--cNbYRroyFuaAgCyMCJDpWbws/
  • hxxp://12.34.56.78/WjSH

References

Add more encoding types

Description
Unfurl currently supports basic url-safe b64 decoding, and only if the results are ASCII. This should be expanded to include more encoding types and chains.

Currently supported:

  • url-safe b64 -> ASCII

Encodings to add:

  • url-safe b64 -> gzip -> ASCII
  • url-safe b64 -> protobuf
  • b32 -> ASCII
  • More ...

Examples

  • TBD

References

parse_url parser does not handle query parameters with no value

Some websites add keys to the URL query string that have no value, but still affect the way the page is displayed. One (trimmed down) example is the following Facebook URL:

https://www.facebook.com/photo.php?type=3&theater

In this case, "theater" sits on its own and indicates that the photo should be opened in a lightbox. Unfortunately, the parameter is missing entirely from the tree after parsing, as you can see below:
missing theater query parameter

This is a pretty easy fix. Modify line 66 of parse_url.py as below:

-        parsed_qs = urllib.parse.parse_qs(node.value)
+        parsed_qs = urllib.parse.parse_qs(node.value, keep_blank_values=True)

fixed theater query parameter

I could submit a pull request to fix this now, however, on lines 94 and 106 I see regexes for parsing similar forms (a=b|c=d|e=f and a=b&c=d&e=f). I'd like to make sure this issue is fixed in those as well, but I haven't yet figured out how to build a test case/example to cover them. Any ideas?

Add detection/flagging for possible homoglyph/graph attacks

Description
Unfurl already decodes punycoded URLS (xn--...), but not hxxp://аbc.com (a is Slavic). It does in fact handle these (shows as % encoded, then translates back to original homograph domain. It would be good to add some flagging to the %-encoded node.

image

Suggestion: Docker from local code rather than git

After raising #75 and trying to develop a new parser I realised that the code pulls from GitHub rather than building from local files.

A suggestion would be to use the local files rather than git, for example (as well as use a python base image):

FROM python:3-alpine
COPY requirements.txt /unfurl/requirements.txt

RUN cd /unfurl && \
    pip3 install -r requirements.txt

COPY . /unfurl
RUN sed -i 's/^host.*/host = 0.0.0.0/' /unfurl/unfurl.ini

WORKDIR /unfurl
ENTRYPOINT ["python3", "unfurl_app.py"]

The reason for the two COPYs is so that between changes, builds are cached up to the second COPY and then those changes built rather than all the dependencies every time.

I figured I'd raise an issue rather than another PR to get some comments. It would be good to get @weslambert thoughts.

Expand Short URLs

Description
Consider implementing expansion of short URLs.
Pro: Lots of URL shortening services can hide interesting links.
Con: We'd need to reach out to 3rd party sites to do this resolution (the data is not embedded in the URL).

Add support for mailto URLs

Description
Unfurl mailto URLs which behave very similar to http URLs but miss out the regular-looking scheme part (mailto: rather than http://, s3:// etc.), so first take the data before ? (if in string) and that is the to address(es) anything after the ? can be treated like url.query.

Examples

References

The explanation for the date/time parameters are wrong

In the mouseover to the ei parameter it states that the ei parameter is believed to correspond to when the search took place. This is not correct. I have seen ei parameters hours apart from the actual search.

What I have gathered is, that the EI is session start (When the tab was opened) or last search. If there is only 1 search immediately after tab opening it will roughly correspond to search time, but that is a coincidence

An example can be seen here:

https://www.google.com/search?sxsrf=ALeKk01OOv3qyJF_Sb7_1WIuBa-cNGDDNg%3A1587403446547&ei=ttqdXsP7IMKZk74Pgv-k6AY&q=third+search+at17%3A28&oq=third+search+at17%3A28&gs_lcp=CgZwc3ktYWIQA1DbihFYqeQRYPnmEWgAcAB4AIABfYgBthOSAQQyNC41mAEAoAEBqgEHZ3dzLXdpeg&sclient=psy-ab&ved=0ahUKEwjDrq_UwvfoAhXCzMQBHYI_CW0Q4dUDCAw&uact=5

I opened a tab and searched for 'First search at 17:22' (At 17:22 naturally).
I then searched for 'Second search at 17:24'
Finally waited and searched for 'third search at 17:28'

So there is no indication of the actual search time.

Proofpoint URL decoding

Add support for Yahoo Search subdomains

Description
Yahoo uses the following subdomains for different types of search:

  • Web Search = "search.yahoo.com"
  • Image Search = "images.search.yahoo.com"
  • Video Search = "video.search.yahoo.com"
  • News Search = "news.search.yahoo.com"

What is the best way to add these in at the top-layer?

Add domain reputation checks

Description
Add a lookup "parser" that takes a domain name in and shows some basic reputation things; for example popularity (Alexa 1M?) or if it has been flagged as bad (SafeBrowsing?).

Examples

  • Show google.com as popular/common and not malicious.
  • Show goooogle.ru as not popular/rare and malicious (made up example).

References

  • TBD

Make unfurl application port configurable

Because the application port is referenced in multiple places, and many folks run multiple services on certain hosts (and the typical default Flask port is 5000 😃 ) the application port should be easily modifiable (pull value from config file, etc). I can assist with a PR when I get a chance.

Investigate Google gs_lcp parameter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.