Comments (8)
Hi @noraj ,
Thanks for reporting the issue, can you please check if this PR fixes it?
Photon should now store the redirecting URLs in redirects.txt
in the following format:
https://example.com/redirect_from==>https://example.com/redirect_to
from photon.
@noraj ???
from photon.
@s0md3v Yeah answering, I'm just writing long post and I need to check what I say before affirming it.
I git cloned a fresh copy then git checkout redirect
, then ran python photon.py --url http://x.X.x.x/ --level 1 --only-url
but I have the exact same result as before without https://example.com/redirect_from==>https://example.com/redirect_to
.
I think this is because when http://x.X.x.x/
is hit the code is 200 and there is --level 1
so other links are scrapped but not requested no we never go in the if code[0] == '3':
statement.
Lines 219 to 222 in 0a5de25
So we are forced to use python photon.py --url http://x.X.x.x/ --level 2 --only-url
but here instead of having the 103 internal URL from the root page I have more than 700 URLs from all the sub-pages and it took way more time to scan (103 remote pages instead of just one).
That is why I talked about a redirect switch option that will allow internal URL collected to be requested to see if they answer a page or a redirection, and then if it is a redirection.
So what I mean is keep the actual behavior + add a new option --whatevername
that will treat internal URL scrapped as potential redirection and so request them to store the potential redirection value in addition of the raw internal URL.
Also I got about 30 (using level 2) URL in failed.txt
but all are valid, example:
$ curl -vvv http://x.x.x.x/\?s\=_____ba8da76e357a______
* Trying x.x.x.x...
* TCP_NODELAY set
* Connected to x.x.x.x (x.x.x.x) port 80 (#0)
> GET /?s=_____ba8da76e357a______ HTTP/1.1
> Host: x.x.x.x
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 303 See Other
< Date: Tue, 23 Oct 2018 18:47:37 GMT
< Server: localhost
< Content-Type: text/html
< Location: https://googleprojectzero.blogspot.com/xxxxxxxxxxx.html
< Content-Length: 0
<
* Connection #0 to host x.x.x.x left intact
So I don't know why they are failed.
But even with level no redirection value are stored, I even checked with grep -ri '==>' ./
.
from photon.
PS : maybe check that python requests
lib handle 303 redirect.
from photon.
Hi @noraj ,
It is to let you know that the issue has been acknowledged and I am working on it.
I will add a new switch, --verify
which will solve redirection and 404 issues by verifying all the URLs added on each level before crawling further.
Thanks for the verbose explanation of the issue, it really helped.
PS: Would it be possible for you to provide the website you are testing against?
You can dm at twitter
from photon.
I guess adding a parameter allow_redirects=False
to L239
, and doing a relevant check will fix this.
from photon.
@0xInfection We want to follow redirects.
from photon.
Don't worry guys, I will fix it once I have free time.
from photon.
Related Issues (20)
- OpenSSL.SSL.Error: HOT 1
- step7 of docker image build fails HOT 2
- UnicodeEncodeError: 'charmap' codec can't encode character HOT 2
- Url with parameter
- Error Message while launching search after fresh installation HOT 5
- No ninja mode?? HOT 2
- python alpine image - speed issue HOT 1
- Add feature to extract comments
- Crawling doesn't work if robots.txt or sitemap doesn't exist
- Photon runs only on Python 3.2 and above HOT 1
- Location of saved HOT 2
- No generating a DNS map HOT 1
- .well-known files
- TLSCertVerificationDisabled
- Remove spyse.com from DNS plugins HOT 1
- Can't build container Temporary failure resolving 'deb.debian.org'
- Errno 30
- Cant use headers option : Could not load headers prompt:
- Ain't getting any output
- Endho32
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from photon.