Git Product home page Git Product logo

randomuseragent's Introduction

Randomuseragent

CRAN_Status_Badge CRAN_Downloads R-CMD-check License: MIT

The goal of Randomuseragent is to have a easy access to different user-agent strings by randomly sampling from a pool of real strings.

Installation

You can install the released version of Randomuseragent from CRAN with:

install.packages("Randomuseragent")

The development version can be installed from GitHub with:

# install.packages("devtools")
devtools::install_github("fangzhou-xie/Randomuseragent")

Example

This is a basic example to get random user-agent strings:

library(Randomuseragent)

random_useragent()
> [1] "Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0"

filter_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
> [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9"   
> [2] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.59.10 (KHTML, like Gecko) Version/5.1.9 Safari/534.59.10"
> [3] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/6.1.6 Safari/537.78.2"  
> [4] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/E7FBAF"   
> [5] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 Safari/600.5.17" 
> [6] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/6.2.8 Safari/537.85.17"  
> [7] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_8) AppleWebKit/534.50.2 (KHTML, like Gecko) Version/5.0.6 Safari/533.22.3"

Both function will accept the same set of arguments for filtering user-agent strings. Please refer to documentation of either function for details.

Advanced Example

Although calling random_useragent() is very convenient, but this may not be the best way if you care about performance. random_useragent() essentially wraps up the filter_useragent() function and return a random one from the pool.

# call directly
random_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
> [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/6.2.8 Safari/537.85.17"

However, if you need to generate LOTS OF them, i.e. calling random_useragent() repeatedly, each time you call random_useragent() you need first to filter from all the strings that this package provides, and then randomly draw one from the pool. Hence, you are doing the subsetting each time you call the function. This is very inefficient.

A better way would be to get the string pool directly from filter_useragent() and then sampling yourself.

# first filter
uas <- filter_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
# then sample manually
sample(uas, 1)
> [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 Safari/600.5.17"

To note this difference, we need to time the following code chunks.

# first to call random_useragent() directly
system.time(lapply(1:5000, function(x){random_useragent()}))
>    user  system elapsed 
>   1.922   0.015   1.944
# second generate the character vector and sampling manually
system.time({
  ua <- filter_useragent(min_obs = 5000, software_type = "browser", operating_system_name = "Windows")
  lapply(1:5000, function(x) {sample(ua, 1)})
})
>    user  system elapsed 
>   0.023   0.000   0.023

We run each method 5000 times to make a fair comparison between methods. You should immediately see that the second method is more than 50 times faster than the first one! That said, the first method only spends 0.2452 ms per call, on average, which is pretty fast already. The second method needs 4.4 ns per call. This is certainly faster, but for most use cases, I don’t think it worth going this far.

Optional Parameters

You can type ?random_useragent to see the documentation for the parameters.

  1. min_obs: integer, threshold to filter number of times observed in the dataset. This is to keep the most frequently used UAs while removing the less frequently used ones. Larger number of this argument will result in less returned strings. Hence smaller set to be sampled from.
  2. software_name: character vector, name of the software. For example, you can choose to only use software_name="Chrome" or several platforms together software_name = c("Safari", "Edge").
  3. software_type: character vector, one or more of "browser", "bot", "application". For webscraping applications, you would most likely choose software_type="browser" to mimic real browser behavior.
  4. operating_system_name: character vector, system being operated. For example, use one or more of "Windows", "Linux", "Mac OS X", "macOS", etc.
  5. layout_engine_name: character vector, e.g. "Gecko", "Blink", etc.

randomuseragent's People

Contributors

fangzhou-xie avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.