Git Product home page Git Product logo

openscraping-lib-csharp's Issues

Missing extra spaces in RemoveExtraWhitespaceTransformation example

In the documentation for Transformations, the RemoveExtraWhitespaceTransformation example doesn't display the extra spaces in the first occurrence of "hello world". I think you'll need to use non-breaking spaces there, or if that doesn't work something else that has the same effect as non-breaking spaces.

It looks like this:

Replaces consecutive spaces with a single space. For the string "hello world" it would return "hello world".

But it should look like this:

Replaces consecutive spaces with a single space. For the string "hello     world" it would return "hello world".

Get meta data?

I have a need to get the content of meta data - how is this possible?
<meta content="Tim Fischer" name="author">

Thanks

Using "_xpath" on table with one row does not create an JSON array

The _xpath feature seems to behave differently if there is only one result from the _xpath. It will generate:

"rows": { "col1": "val1", "col2": "val2" }

Instead of:

"rows": [{ "col1": "val1", "col2": "val2" }]

This makes it difficult to iterate over the array of rows, as you end up iterating over the columns. Thoughts? Is there a way to force it to be an array regardless of there being a single row? Thank you.

Regexp Transformation

While using openscraping faced quite common scenario when a specific subset(word) has to be selected from element with plain text.

Sample: <div class="info">Contact information. Phone: 111-111-111, Address: str.Street 1/1, City. 2017</div>

Would be really useful to have built-in RegexpTransformation which can take custom regexp expressions as an input param '_regexp'. Something like '_separator' in SplitTransformation.

Incompatibility with .Net Standard 2.0

When trying to add this package to a project targeting .NET standard 2.0, an error is thrown because OpenScraping v1.0.1 only supports netcoreapp2.0.

Here is the output when running dotnet add OpenScraping:

info :   GET https://api.nuget.org/v3-flatcontainer/openscraping/index.json
info :   OK https://api.nuget.org/v3-flatcontainer/openscraping/index.json 380ms
error: Package OpenScraping 1.0.1 is not compatible with netstandard2.0 (.NETStandard,Version=v2.0). Package OpenScraping 1.0.1 supports: netcoreapp2.0 (.NETCoreApp,Version=v2.0)
error: Package 'OpenScraping' is incompatible with 'all' frameworks in project '<REDACTED>/Project.csproj'.

Regex Transformation not returning first match

Hello, I am using this config

{
  "id": {
    "_xpath": "//link[@rel='canonical']/@href",
    "_transformations": [
      {
        "_type": "RegexTransformation",
        "_regex": "[0-9]{7}$"
      }
    ]
  }
}

to extract from this document

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
   <meta charset="utf-8" />
   <link rel="canonical" href="http://example.com/test-xyz-1234567" />
    <title>Test</title>
</head>
<body>
    Test
</body>
</html>

My goal is to extract "1234567" from the link-tag.

The resulting output is empty.
Reason: In RegexTransformation.cs the first match is ignored (line 77).

I am not sure why the first match is ignored and how to work around this in my example.
Could someone please provide some advice? Thank you very much.

Enable CastToIntegerTransformation to transform from container too

My use case:

Given this URL: https://dev.test/index.php?PHPSESSID=a&action=profile;u=99 i wanted to extract the 99 user ID from the end of the string. My solution was to use a simple Regex and convert it to integer:

"_transformations": [
{
    "_type": "RegexTransformation",
    "_regex": "u=(\\d+)",
},
"CastToIntegerTransformation",
],

But after i got

Transformation chain broken at transformation type CastToIntegerTransformation

started to debug the library and recognized that the CastToIntegerTransformation not inherits from ITransformationFromObject so i cannot use at the end of the parsing pipeline.

Yes, this problem can easily fixed with inheritance but i thought mention here.

Click to view my extended CastToIntegerTransformation class implementation

/// <summary>
/// Class to cast selected XPath value to <see cref="int"/>.
/// </summary>
public class CastToIntegerTransformation : ITransformationFromHtml, ITransformationFromObject
{
    public object Transform(Dictionary<string, object> settings, HtmlNodeNavigator nodeNavigator, List<HtmlAgilityPack.HtmlNode> logicalParents)
    {
        var text = nodeNavigator?.Value ?? nodeNavigator?.CurrentNode?.InnerText;

        if (text != null)
        {
            int intVal;

            if (int.TryParse(text, out intVal))
            {
                return intVal;
            }
        }

        return null;
    }

    /// <summary>
    /// Transforms the input to a valid <see cref="int"/>.
    /// </summary>
    /// <param name="settings"><seealso cref="Config.TransformationConfig.ConfigAttributes"/>.</param>
    /// <param name="input">Parsed XPath value.</param>
    /// <returns><see cref="int"/>.</returns>
    /// <exception cref="FormatException">Occurs when the <paramref name="input" /> parameter
    /// is not a valid integer.</exception>
    public object Transform(Dictionary<string, object> settings, object input)
    {
        if (int.TryParse(input.ToString(), out int number))
        {
            return number;
        }

        throw new FormatException($"Input parameter {input} is not a valid integer!");
    }
}

Thank You for this great library!

How to return the href of a hyperlink?

I have tried "link":"//a[selector]/@href".
The parsed result is the text of the link instead of the href url.

What is the correct way to retrieve the link?

O Gosh, it is a duplicate issue with the previous attribute selection problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.