Git Product home page Git Product logo

Comments (13)

pmeenan avatar pmeenan commented on June 23, 2024 1

I'm not seeing any duplicates on the main server the dumps are created on either:

mysql> SELECT url, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY url HAVING n > 1 ORDER BY n DESC;
Empty set (35.76 sec)

587 is the crawl ID for the 2019-04-01 desktop crawl.

Any chance you're not including the protocol (http:// or https://) part of the URL when collecting the site stats? It is expected that there will be both for sites that have traffic on both since they are different origins.

Looking at domains that have both http:// and https:// results:

mysql> SELECT LEFT(RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1)) ,position('/' IN RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1))) - 1) AS domain, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY domain HAVING n > 1 ORDER BY n DESC;

113500 rows in set (40.32 sec)

That looks pretty close to half of your number so, assuming you are counting both of them, my guess is you are calculating stats at a domain level, not origin and we test both http and https versions of an origin if CrUX shows that both had traffic during the previous month.

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on June 23, 2024 1

If both protocols are in CrUX that means that both had a meaningful amount of traffic during the month. If it was redirecting from http -> https then CrUX would only report the https. The content can be completely different if, fo example, a site is migrating to a new system and deploying https as part of the migration and the performance characteristics are likely to be very different.

If you are going to de-dupe them into a single entry, I'd recommend favoring the https:// variant when there are duplicates. For the HTTPArchive it makes sense to just collect both of them so we have a clean and complete dataset that matches CrUX.

from legacy.httparchive.org.

rviscomi avatar rviscomi commented on June 23, 2024

Yes, when the crawl started the mobile URL list had not been updated correctly, so we did that on the fly and restarted the crawl. About 40k tests had already been started and those correspond to the aborted crawl.

from legacy.httparchive.org.

Themanwithoutaplan avatar Themanwithoutaplan commented on June 23, 2024

Thanks for the explanation, but in the interests of data consistency those tests should either be removed from the dump or a relevant crawl id should be added.

from legacy.httparchive.org.

rviscomi avatar rviscomi commented on June 23, 2024

Are you able to omit 581 from your end of the pipeline? We're not maintaining any of the legacy systems beyond recovering from data loss or breakages, and this doesn't seem to affect anything critical. I'm not aware of anyone else who depends on the raw legacy results, so if you could work around it for now we should be ok.

from legacy.httparchive.org.

Themanwithoutaplan avatar Themanwithoutaplan commented on June 23, 2024

I can work around this one fairly easily because it raises an exception directly when I import the data. A bigger pain are duplicate tests within the same crawl because these don't show up until I create some reports. The right constraints on the database would stop this happening but I can appreciate you not wanting to touch the schema at this stage, particularly as fixing the duplicate tests requires window functions which I'm not sure MySQL does.
Would be great if you could give a heads up at the end of any crawl if any such anomalies are expected.

from legacy.httparchive.org.

rviscomi avatar rviscomi commented on June 23, 2024

Will do! Thanks for your understanding.

from legacy.httparchive.org.

Themanwithoutaplan avatar Themanwithoutaplan commented on June 23, 2024

FWIW, I'm still seeing around 120,000 duplicates in every run. I haven't done any further analysis but I think you should be looking at what's causing this.

from legacy.httparchive.org.

Themanwithoutaplan avatar Themanwithoutaplan commented on June 23, 2024

In the desktop crawl for 2019-04-01 there are around 227,000 sites with duplicate tests so we're getting close to 5%. This will start to affect any derived statistics and is also a waste of resources. Do we have any idea what's causing this? Are some crawls being allocated twice?

from legacy.httparchive.org.

rviscomi avatar rviscomi commented on June 23, 2024

Could you share the query you're running to get that count?

summary_pages on BigQuery is created from the MySQL-based CSV dumps and it's not showing any duplicates:

SELECT
  url,
  COUNT(0) AS n
FROM
  `httparchive.summary_pages.2019_04_01_desktop`
GROUP BY
  url
HAVING
  n > 1
ORDER BY
  n DESC

If it's true I agree it's worth investigating, at least for resource conservation.

from legacy.httparchive.org.

Themanwithoutaplan avatar Themanwithoutaplan commented on June 23, 2024

My query is always slightly different due to the way I import data into Postgres but I can provide a list of what I think are duplicate sites.

from legacy.httparchive.org.

Themanwithoutaplan avatar Themanwithoutaplan commented on June 23, 2024

Pat, I think you've identified the problem: I do normalise on the domain which is why these count as duplicates. I'm not sure that there is any point in not doing this just because both protocols are in the CrUX dataset: is it right to consider these as distinct websites? But the important thing is we've identified the source of the anomaly and I can adjust my import script.

from legacy.httparchive.org.

Themanwithoutaplan avatar Themanwithoutaplan commented on June 23, 2024

I appreciate exactly what you're saying about the protocols but a cursory check suggests that these are duplicates and that the websites are just not configured to redirect. At some point, as we move towards http/2 the issue may resolve itself, in the meantime I guess it's an interesting effect itself.

For my purposes I'm doing just what you suggest and am keeping only the https variants.

from legacy.httparchive.org.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.