Git Product home page Git Product logo

fsharp.data's Introduction

FSharp.Data: Making Data Access Simple

The FSharp.Data package (FSharp.Data.dll) implements everything you need to access data in your F# applications and scripts. It implements F# type providers for working with structured file formats (CSV, HTML, JSON and XML) and for accessing the WorldBank data. It also includes helpers for parsing CSV, HTML and JSON files and for sending HTTP requests.

We're open to contributions from anyone. If you want to help out but don't know where to start, you can take one of the Up-For-Grabs issues, or help to improve the documentation.

You can see the version history here.

NuGet Badge

Building

  • Install the .NET SDK specified in the global.json file
  • build.sh -t Build or build.cmd -t Build

Formatting

dotnet fake build -t Format
dotnet fake build -t CheckFormat

Documentation

This library comes with comprehensive documentation. The documentation is automatically generated from *.fsx files in the content folder and from the comments in the code. If you find a typo, please submit a pull request!

  • FSharp.Data package home page with more information about the library, contributions, etc.
  • The samples from the documentation are included as part of FSharp.Data.Tests.sln, make sure you build the solution before trying out the samples to ensure that all needed packages are installed.

Releasing

Releasing of the NuGet package is done by GitHub actions CI from master branch when a new version is pushed.

Releasing of docs is done by GitHub actions CI on each push to master branch.

Support and community

Code of Conduct

This repository is governed by the Contributor Covenant Code of Conduct.

We pledge to be overt in our openness, welcoming all people to contribute, and pledging in return to value them as whole human beings and to foster an atmosphere of kindness, cooperation, and understanding.

Library license

The library is available under Apache 2.0. For more information see the License file in the GitHub repository.

Maintainers

Current maintainers are Don Syme and Phillip Carter

Historical maintainers of this project are Gustavo Guerra, Tomas Petricek and Colin Bull.

fsharp.data's People

Contributors

7sharp9 avatar baronfel avatar bonjune avatar cartermp avatar colinbull avatar ctaggart avatar devcrafting avatar dsyme avatar enricosada avatar eugbaranov avatar eulerfx avatar forki avatar fpellet avatar giacomociti avatar kant2002 avatar mlaily avatar mrange avatar nhirschey avatar nwolverson avatar ordinaryorange avatar pezipink avatar remkoboschker avatar rflechner avatar scrwtp avatar taylorwood avatar theimowski avatar thorium avatar tpetricek avatar zakaluka avatar zpbappi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fsharp.data's Issues

This is awesome!

I love the name FSharp.Data. In FSharpx we have the json parser in FSharpx.Core and the json typeprovider in FSharpx.TypeProviders.Documents, effectively grouping stuff by the implementation type. For newcomers having FSharp.Data on nuget is much easier to find, and it's a nice concept.
An the cherry on top is the documentation you've made. Very nice!

If @forki and the rest of the fsharpx team agrees, I would suggest the following:

  • Move Freebase type provider from fsharpx into here also, as it makes sense to have both wordBank and freebase together
  • Removing FSharpx.TypeProviders.Freebase and FSharpx.TypeProviders.Documents from FSharpx
  • Moving this into github/fsharp organization
  • Add support for portable profile

FSharpx is great as a holder of general purpose extensions to F#, but when there's a good group of functionality that makes sense to hold together, it makes sense to split it. We're also discussing there moving the DataStructures to a separate package.

I don't have much free time this weekend but I can try to do a pull request with the merging if you all agree

Add FsCharts to the nuget package

Hi,

since nearly all of the samples are using FsChart it would be cool it this could become part of the nuget package. This would allow me to write easier tutorials for my Dynamics NAV friends.

Cheers,
Steffen

file resolution

I use the example in a build.fsx file

type Stocks = CsvProvider<"data/MSFT.csv">
...

it compiles with no pb. I now run this through Fake

.\tools\FAKE\tools\Fake.exe "build.fsx"

where build.fsx is the previous file.
This leads to an error
build.fsx(12,15): error FS3033: The type provider 'ProviderImplementation.CsvProvider' reported an error: The input sequ
ence was empty. Parameter name: source

  • There should be more information about which resolved file the TP was trying to open.
  • Even after copying data/MSFT.csv to the Fake.exe folder, I have the error so I am quite puzzled actually.

I switched to absolute path of course, but that is far from ideal for scripting purposes in shared environment. That would be bad to loose strong type safety because of environment variable :)

(btw, staged execution would not have this pb I guess as in that case I'd generate the metastage before running the fake.exe, removing exposure to environment change influencing type generation)

Add async loading

When reading CSV, XML or JSON from the web, it should be possible to read the data asynchronously.

  • For XML and JSON, we read the entire file before processing, so this should be just a simple AsyncLoad method.
  • For CSV, it would be nice to use some sort of async enumerator so that we can read the data asynchronously on demand.

CsvProvider incorrect type for DNB Currency Exchange file

The Norwegian bank DNB has online CSV files for currency exchange rates, which is an excellent dataset to play around with when learning F#. The CSV file for historical rates can be found at https://www.dnb.no/portalfront/datafiles/miscellaneous/csv/historiske_kurser.csv

The first few lines looks like this:

Dato,USD,EUR,SEK,DKK,GBP,CHF,JPY,CAD,ISK,AUD
31.01.2013,5.4833,7.4312,86.20,99.61,8.6784,601.40,6.0309,5.4688,4.2524,5.7013
30.01.2013,5.4897,7.4180,86.34,99.44,8.6540,595.64,6.0293,5.4828,4.2642,5.7458
29.01.2013,5.5316,7.4336,86.06,99.64,8.6924,597.72,6.1015,5.5022,4.2981,5.7847
28.01.2013,5.5368,7.4379,85.53,99.67,8.7033,596.35,6.1038,5.4904,4.3028,5.7613

When using the provider like this:

type CurrencyCsv = FSharp.Data.CsvProvider<"historiske_kurser.csv", ",", "en-us", 10>

let wc = new WebClient()
let data = wc.DownloadString("https://www.dnb.no/portalfront/datafiles/miscellaneous/csv/historiske_kurser.csv")

let exchange = CurrencyCsv.Parse(data)

exchange.Data 
|> Seq.map(fun row -> (row.Dato, row.Usd))
|> Seq.sortBy(fun (date, usd) -> usd)
|> printfn "%A"

It infers the correct column names, but the Usd field is of type DateTime, and not float as expected.

I have tried with the nb-no culture, but with same result.

I'm happy to fix it with a pull request as soon as I get the project set up and building on my machine, but will keep this issue for reference, and something to link the pull request against.

Add assembly info with version number

Currently there's only version in the nuget package, the dll always has version 0.0.0.0
We could reuse some of the fake scripts from fsharpx to auto increase both the assembly info version and the nuget packager version based on the tag

Improve type providers error reporting

When there's an error because the sample can't be found on disk or is invalid, the error message in Visual Studio isn't very informative.
We can change it to provide a little bit of more information.
Will submit pull request later today

NameUtils improvements

  • When we find a field like "Foo%", instead of generating "Foo", generate "FooPct" or "FooPercentage"
  • When we find something like "Foo&Bar", instead of generating "FooBar" generate "FooAndBar"
  • When we find something like "Foo@Bar", instead of generating "FooBar" generate "FooAtBar"

Consume the sample data directly by default

The FSharpx type providers for json and xml start with the data already loaded, which is handy for scripting scenarios and demos.
In FSharp.Data we always have to do a .Load or .Parse after defining the type. By default it could load the sample data, and we would only need to call .Load or .Parse if we wanted to override it. You could disable that behavior by passing a LoadSampleData=false

CSV Provider performance improvements for big files

The csv providers still hangs VS when we give it very large files. Possible improvements:

  • Do the inference only when accessing the first member of a row, so it doesn't start processing before we're able to change the InferRows parameters
  • Make the default InferRows something other than int.Max. Let's say put 1000 the max by default
  • Do the Inference asynchronously with a timeout so we don't hang VS when it takes more than 10 seconds

Problem with booleans in CsvProvider

With a csv file like this:

Column1,Column2,Column3
TRUE,NO,3

When compiling this code:

open FSharp.Data

type csvType = CsvProvider<"C:/temp.csv">
let csv = csvType.Load "C:/temp.csv"
for line in csv.Data do
    printfn "%b %b %i" line.Column1 line.Column2 line.Column3

We get the following errors:

Error   1   The type provider 'ProviderImplementation.CsvProvider' reported an error in the context of provided type 'FSharp.Data.CsvProvider,Sample="C:/temp.csv"+DomainTypes+Row', member 'get_Column1'. The error: Constructing call of the 'ConvertBoolean' operation failed.   i:\documents\visual studio 2012\Projects\ConsoleApplication12\ConsoleApplication12\Program.fs   6   24  ConsoleApplication12
Error   2   The type provider 'ProviderImplementation.CsvProvider' reported an error in the context of provided type 'FSharp.Data.CsvProvider,Sample="C:/temp.csv"+DomainTypes+Row', member 'get_Column2'. The error: Constructing call of the 'ConvertBoolean' operation failed.   i:\documents\visual studio 2012\Projects\ConsoleApplication12\ConsoleApplication12\Program.fs   6   37  ConsoleApplication12

This was working in previous versions and was broken recently

Document AssemblyReplacer.fs

The code in AssemblyReplacer.fs is implementing an essential functionality for the portable profile, but is not commented at all.

We need to add at least some overview of a big picture (what is it doing in general) and some explanatory comments to all top level functions (similarly to how this is done in the rest of the code-base).

Improve handling of missing values in the CSV provider

I have some code ready to push, but I'd also like to discuss alternatives

Currently, when there's a missing value, the inference will force that column to be of type string. The only exception is when there's an explicit #N/A on columns of type double, in that case inference will still recognize that column as a double and use double.NaN at runtime

I propose the following:

  • When there's a missing value in a double column, also treat is as a double.NaN
  • When there's a missing value in a decimal column, infer that column to be double instead, so we can use double.NaN
  • When there's a missing value in int32, int64, bool, or date column, make that column type an option

Other alternatives:

  • Instead of option types use nullables for int32, int64, bool, and date columns. Both the XML and JSON providers use options, but the freebase provider uses nullables. Maybe add a parameter named PreferNullableTypes to activate use nullables but use options by default? Or make nullables the default and allow to switch to options? Nullables are easier to handle for numbers because of the Linq.NullableOperators module
  • Never generate options/nullables for datetimes and instead return the default datetime at runtime

Hide implementation methods

Csv

Method CsvFile.Parse: data:TextReader * sep:string option -> CsvFile is accessible, but it really shouldn't, because it returns CsvFile, and not the generated CsvType.
On other type providers, this problem doesn't exist because we usually replace the methods with others with the same signature in the derived generated class, but in the CsvProvider case, the generated Parse method doesn't have the sep parameter.
The ideal would be to make this method protected, but F# doesn't support that. Any other ideas to fix this?

WorldBank

There's a bunch of _Get methods that get the untyped data. We can hide them by putting them in an interface. Maybe that can also fix the problem with csv

Add more tests

The current tests test the structural inference (and some aspects of JSON), but it we need more tests for the end-user type providers and for JSON parser.

I think these can be largely adapted from fsharpx.

Consider renaming CsvProvider

I know this suggestion is a little bold, but thinking about it, CsvProvider currently works not only with just csv files but also with tab separated files, or any other similar textual format, and in the future it might well support more formats of tabular data (like xls/xlsx, hdf5/netCDF4, .rdata, .mat, etc...), either directly or maybe as plugins (I have some ideas about how to make that work without changing the api or creating dependencies...). But the inference and generation of typed properties is the same between all the formats.
Both the R tools and the several Python libraries that work with all those kind of files are usually called read.table or read_table (even though they have overloads called read.csv or read_csv that the only thing they do is to set the default separator to ',')
Do you think renaming CsvProvider to TabularDataProvider would be a good idea? Or are people expecting that name and we can always make the same type provider available under other additional names (like we do with freebase and worldbank that have two versions each)?

Remove pluralization service

The NameUtils.fs file uses PluralizationService from System.Data.Entity.Design.PluralizationServices. This needs to be replaced with some other library or a custom implementation of English pluralization (in order to make the library compatible with the Client profile and more importantly also for portable profile and Mono).

Allow overriding the schema in the CSV provider

Two alternatives:

  • Type provider parameter like in in TryFSharp.org:

    type csvType = CsvFile<file, Schema = "date,float,float,float,float,int,float">
    
  • Allow to specify the type in the header title within braces, like we already allow for units

    Column1 (m), Column1 (float), Column2 (float<m>)
    

Generate enums in csv type provider

If a column is infered as string, and there are many repeated values, it's probably an enumeration, so we could generate an enum. If the inference geets it wrong, we could always override (#19). We could use something like (number of distinct values / number of rows) < 0.2 to trigger this

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.