linkedin / goavro Goto Github PK

View Code? Open in Web Editor NEW

967.0 23.0 216.0 653 KB

License: Apache License 2.0

Go 100.00%

goavro's Introduction

goavro

Goavro is a library that encodes and decodes Avro data.

Description

Encodes to and decodes from both binary and textual JSON Avro data.
Codec is stateless and is safe to use by multiple goroutines.

With the exception of features not yet supported, goavro attempts to be fully compliant with the most recent version of the Avro specification.

Dependency Notice

All usage of gopkg.in has been removed in favor of Go modules. Please update your import paths to github.com/linkedin/goavro/v2. v1 users can still use old versions of goavro by adding a constraint to your go.mod or Gopkg.toml file.

require (
    github.com/linkedin/goavro v1.0.5
)

[[constraint]]
name = "github.com/linkedin/goavro"
version = "=1.0.5"

Major Improvements in v2 over v1

Avro namespaces

The original version of this library was written prior to my really understanding how Avro namespaces ought to work. After using Avro for a long time now, and after a lot of research, I think I grok Avro namespaces properly, and the library now correctly handles every test case the Apache Avro distribution has for namespaces, including being able to refer to a previously defined data type later on in the same schema.

Getting Data into and out of Records

The original version of this library required creating goavro.Record instances, and use of getters and setters to access a record's fields. When schemas were complex, this required a lot of work to debug and get right. The original version also required users to break schemas in chunks, and have a different schema for each record type. This was cumbersome, annoying, and error prone.

The new version of this library eliminates the goavro.Record type, and accepts a native Go map for all records to be encoded. Keys are the field names, and values are the field values. Nothing could be more easy. Conversely, decoding Avro data yields a native Go map for the upstream client to pull data back out of.

Furthermore, there is never a reason to ever have to break your schema down into record schemas. Merely feed the entire schema into the NewCodec function once when you create the Codec, then use it. This library knows how to parse the data provided to it and ensure data values for records and their fields are properly encoded and decoded.

3x--4x Performance Improvement

The original version of this library was truly written with Go's idea of io.Reader and io.Writer composition in mind. Although composition is a powerful tool, the original library had to pull bytes off the io.Reader--often one byte at a time--check for read errors, decode the bytes, and repeat. This version, by using a native Go byte slice, both decoding and encoding complex Avro data here at LinkedIn is between three and four times faster than before.

Avro JSON Support

The original version of this library did not support JSON encoding or decoding, because it wasn't deemed useful for our internal use at the time. When writing the new version of the library I decided to tackle this issue once and for all, because so many engineers needed this functionality for their work.

Better Handling of Record Field Default Values

The original version of this library did not well handle default values for record fields. This version of the library uses a default value of a record field when encoding from native Go data to Avro data and the record field is not specified. Additionally, when decoding from Avro JSON data to native Go data, and a field is not specified, the default value will be used to populate the field.

Contrast With Code Generation Tools

If you have the ability to rebuild and redeploy your software whenever data schemas change, code generation tools might be the best solution for your application.

There are numerous excellent tools for generating source code to translate data between native and Avro binary or textual data. One such tool is linked below. If a particular application is designed to work with a rarely changing schema, programs that use code generated functions can potentially be more performant than a program that uses goavro to create a Codec dynamically at run time.

gogen-avro

I recommend benchmarking the resultant programs using typical data using both the code generated functions and using goavro to see which performs better. Not all code generated functions will out perform goavro for all data corpuses.

If you don't have the ability to rebuild and redeploy software updates whenever a data schema change occurs, goavro could be a great fit for your needs. With goavro, your program can be given a new schema while running, compile it into a Codec on the fly, and immediately start encoding or decoding data using that Codec. Because Avro encoding specifies that encoded data always be accompanied by a schema this is not usually a problem. If the schema change is backwards compatible, and the portion of your program that handles the decoded data is still able to reference the decoded fields, there is nothing that needs to be done when the schema change is detected by your program when using goavro Codec instances to encode or decode data.

Resources

Usage

Documentation is available via .

package main

import (
    "fmt"

    "github.com/linkedin/goavro/v2"
)

func main() {
    codec, err := goavro.NewCodec(`
        {
          "type": "record",
          "name": "LongList",
          "fields" : [
            {"name": "next", "type": ["null", "LongList"], "default": null}
          ]
        }`)
    if err != nil {
        fmt.Println(err)
    }

    // NOTE: May omit fields when using default value
    textual := []byte(`{"next":{"LongList":{}}}`)

    // Convert textual Avro data (in Avro JSON format) to native Go form
    native, _, err := codec.NativeFromTextual(textual)
    if err != nil {
        fmt.Println(err)
    }

    // Convert native Go form to binary Avro data
    binary, err := codec.BinaryFromNative(nil, native)
    if err != nil {
        fmt.Println(err)
    }

    // Convert binary Avro data back to native Go form
    native, _, err = codec.NativeFromBinary(binary)
    if err != nil {
        fmt.Println(err)
    }

    // Convert native Go form to textual Avro data
    textual, err = codec.TextualFromNative(nil, native)
    if err != nil {
        fmt.Println(err)
    }

    // NOTE: Textual encoding will show all fields, even those with values that
    // match their default values
    fmt.Println(string(textual))
    // Output: {"next":{"LongList":{"next":null}}}
}

Also please see the example programs in the examples directory for reference.

OCF file reading and writing

This library supports reading and writing data in Object Container File (OCF) format

package main

import (
	"bytes"
	"fmt"
	"strings"

	"github.com/linkedin/goavro/v2"
)

func main() {
	avroSchema := `
	{
	  "type": "record",
	  "name": "test_schema",
	  "fields": [
		{
		  "name": "time",
		  "type": "long"
		},
		{
		  "name": "customer",
		  "type": "string"
		}
	  ]
	}`

	// Writing OCF data
	var ocfFileContents bytes.Buffer
	writer, err := goavro.NewOCFWriter(goavro.OCFConfig{
		W:      &ocfFileContents,
		Schema: avroSchema,
	})
	if err != nil {
		fmt.Println(err)
	}
	err = writer.Append([]map[string]interface{}{
		{
			"time":     1617104831727,
			"customer": "customer1",
		},
		{
			"time":     1717104831727,
			"customer": "customer2",
		},
	})
	fmt.Println("ocfFileContents", ocfFileContents.String())

	// Reading OCF data
	ocfReader, err := goavro.NewOCFReader(strings.NewReader(ocfFileContents.String()))
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println("Records in OCF File");
	for ocfReader.Scan() {
		record, err := ocfReader.Read()
		if err != nil {
			fmt.Println(err)
		}
		fmt.Println("record", record)
	}
}

The above code in go playground

ab2t

The ab2t program is similar to the reference standard avrocat program and converts Avro OCF files to Avro JSON encoding.

arw

The Avro-ReWrite program, arw, can be used to rewrite an Avro OCF file while optionally changing the block counts, the compression algorithm. arw can also upgrade the schema provided the existing datum values can be encoded with the newly provided schema.

avroheader

The Avro Header program, avroheader, can be used to print various header information from an OCF file.

splice

The splice program can be used to splice together an OCF file from an Avro schema file and a raw Avro binary data file.

Translating Data

A Codec provides four methods for translating between a byte slice of either binary or textual Avro data and native Go data.

The following methods convert data between native Go data and byte slices of the binary Avro representation:

BinaryFromNative
NativeFromBinary

The following methods convert data between native Go data and byte slices of the textual Avro representation:

NativeFromTextual
TextualFromNative

Each Codec also exposes the Schema method to return a simplified version of the JSON schema string used to create the Codec.

Translating From Avro to Go Data

Goavro does not use Go's structure tags to translate data between native Go types and Avro encoded data.

When translating from either binary or textual Avro to native Go data, goavro returns primitive Go data values for corresponding Avro data values. The table below shows how goavro translates Avro types to Go types.

Avro	Go
`null`	`nil`
`boolean`	`bool`
`bytes`	`[]byte`
`float`	`float32`
`double`	`float64`
`long`	`int64`
`int`	`int32`
`string`	`string`
`array`	`[]interface{}`
`enum`	`string`
`fixed`	`[]byte`
`map` and `record`	`map[string]interface{}`
`union`	see below

Because of encoding rules for Avro unions, when an union's value is null, a simple Go nil is returned. However when an union's value is non-nil, a Go map[string]interface{} with a single key is returned for the union. The map's single key is the Avro type name and its value is the datum's value.

Translating From Go to Avro Data

Goavro does not use Go's structure tags to translate data between native Go types and Avro encoded data.

When translating from native Go to either binary or textual Avro data, goavro generally requires the same native Go data types as the decoder would provide, with some exceptions for programmer convenience. Goavro will accept any numerical data type provided there is no precision lost when encoding the value. For instance, providing float64(3.0) to an encoder expecting an Avro int would succeed, while sending float64(3.5) to the same encoder would return an error.

When providing a slice of items for an encoder, the encoder will accept either []interface{}, or any slice of the required type. For instance, when the Avro schema specifies: {"type":"array","items":"string"}, the encoder will accept either []interface{}, or []string. If given []int, the encoder will return an error when it attempts to encode the first non-string array value using the string encoder.

When providing a value for an Avro union, the encoder will accept nil for a null value. If the value is non-nil, it must be a map[string]interface{} with a single key-value pair, where the key is the Avro type name and the value is the datum's value. As a convenience, the Union function wraps any datum value in a map as specified above.

func ExampleUnion() {
    codec, err := goavro.NewCodec(`["null","string","int"]`)
    if err != nil {
        fmt.Println(err)
    }
    buf, err := codec.TextualFromNative(nil, goavro.Union("string", "some string"))
    if err != nil {
        fmt.Println(err)
    }
    fmt.Println(string(buf))
    // Output: {"string":"some string"}
}

Limitations

Goavro is a fully featured encoder and decoder of binary and textual JSON Avro data. It fully supports recursive data structures, unions, and namespacing. It does have a few limitations that have yet to be implemented.

Aliases

The Avro specification allows an implementation to optionally map a writer's schema to a reader's schema using aliases. Although goavro can compile schemas with aliases, it does not yet implement this feature.

Kafka Streams

Kafka is the reason goavro was written. Similar to Avro Object Container Files being a layer of abstraction above Avro Data Serialization format, Kafka's use of Avro is a layer of abstraction that also sits above Avro Data Serialization format, but has its own schema. Like Avro Object Container Files, this has been implemented but removed until the API can be improved.

Default Maximum Block Counts, and Block Sizes

When decoding arrays, maps, and OCF files, the Avro specification states that the binary includes block counts and block sizes that specify how many items are in the next block, and how many bytes are in the next block. To prevent possible denial-of-service attacks on clients that use this library caused by attempting to decode maliciously crafted data, decoded block counts and sizes are compared against public library variables MaxBlockCount and MaxBlockSize. When the decoded values exceed these values, the decoder returns an error.

Because not every upstream client is the same, we've chosen some sane defaults for these values, but left them as mutable variables, so that clients are able to override if deemed necessary for their purposes. Their initial default values are (math.MaxInt32 or ~2.2GB).

Schema Evolution

Please see my reasons why schema evolution is broken for Avro 1.x.

License

Goavro license

Copyright 2017 LinkedIn Corp. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Google Snappy license

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Google Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Third Party Dependencies

Google Snappy

Goavro links with Google Snappy to provide Snappy compression and decompression support.

goavro's People

Contributors

Stargazers

Watchers

Forkers

tinle mbijon fede1024 scalingdata iobeam mathieuravaux davidnarayan i ahawks mediagamma thomasdesr mblinn crast leonprime cinderalla greensnark eaciit christian-blades-cb datacratic crazyjvm zak10 cetex skkumaravel grovolearning jrossi sebglon seanfeld joe-roth slaunay xinbinhao ralic leobcn jhaynie kanyukaaa thomas91310 nevercgoodbye belltoy seborama gh-osama rohit01 srayevskyy patrick-park rileyr securityscorecard aapalikov patrickmscott hoffin raisemarketplace strogo enrichman streamrail nemosupremo meltwaterarchive mediamath dharmeshptl dmitris durp goguardian prezi blachniet omidaladini inloco wy3148 aronj zyq2007 daffinity floren peernova-private pkedy alexflint pubmatic-uas saffron-digital whilei krish7919 timwu20 ryosagisu tokopedia elakito daniel-007 guregodevo gawain-hill joshng kdvy wfscheper hsawhney09 behdadforghani jensr84 exactpro mitbra jlobraco peak6 bigwalle clear-street cnxtech rubik-ai durango yangleimiao przemekd keenangebze goldenrajan

goavro's Issues

Field names don't allow fully qualified names with periods.

When the example code server.go in linkedin/goavro/examples/net is run and the client (built from client.go) in the same directory is run, the server program crashes. This is because the timestamp field is set using the fully-qualified path in server.go:

98         // you can fully qualify the field name
99         someRecord.Set("com.example.timestamp", int64(1082196484))

I have a simple patch:

commit a0dd9adc255b38b733f66c0e8ce3f038eae5b2b2
Author: Arun M. Krishnakumar <[email protected]>
Date:   Fri May 1 10:53:21 2015 -0700

    Allow fully qualified paths in field names.

diff --git a/name.go b/name.go
index bf32798..bd4efcb 100644
--- a/name.go
+++ b/name.go
@@ -85,7 +85,7 @@ func (e ErrInvalidName) Error() string {
 }

 func isRuneInvalidForFirstCharacter(r rune) bool {
-       if (r >= 'A' && r <= 'Z') || (r >= 'a' && r <= 'z') || r == '_' {
+       if (r >= 'A' && r <= 'Z') || (r >= 'a' && r <= 'z') || r == '_' || r == '.' {
                return false
        }
        return true
@@ -106,7 +106,7 @@ func checkName(s string) error {
                return &ErrInvalidName{"start with [A-Za-z_]"}
        }
        if strings.IndexFunc(s[1:], isRuneInvalidForOtherCharacters) != -1 {
-               return &ErrInvalidName{"have second and remaining characters contain only [A-Za-z0-9_]"}
+               return &ErrInvalidName{"have second and remaining characters contain only [A-Za-z0-9._]"}
        }
        return nil
 }

If you think the above is valid and complete, I can create a pull request that can be merged, or you could take the patch yourself.

Encoder does not encode empty maps correctly

When attempting to encode an empty map field(map[string]interface{}{}) whose schema is {"type": ["null", {"type": "map", "values": "double"}]}, it did not return any error. However, when trying to decode the binary produced by the encoder, it failed to decode and returned following error:

cannot decode record (Rec): 
    cannot decode record (NestedRec): 
        cannot encode union (union): 
            index must be between 0 and 1; read index: -59

Decoding the same binary using Java version of Apache Avro decoder, I got the following Exception:

java.lang.ArrayIndexOutOfBoundsException: -59
    at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
    at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
    at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
    at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
    at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
    at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)

BufferFromReader breaks for schema > 4kB

Hi,

I ran into an issue while reading avro datum from avro ocf. File is written using goavro. Issue disappears when - schema size < 4KB or if I do not use BufferFromReader.
Error msg is - cannot build cannot read header metadata: cannot decode bytes: buffer underrun.

When I looked into the code - issue is in decoder.go while reading schema bytes -
the line below reads less bytes than expected.
bytesRead, err := r.Read(buf).

IMO, writer is writing data correctly.
Also issue can be reproduced by using the file example given on github. We need to make schema longer by changing "doc" entry to > 4kB.

Thanks.

Lack of schema static compiler

Hi,

Is there any requirement or development plan for static compiler of avro schema? currently fields of record are accessed by type interface{}. Programmers need to remember its real type and do casting all around, which is somewhat cumbersome. If schema compiler is available, all record structs with schema-specified type should be decoded to be used, which would fix this problem.

avro.codec missing from header

in ocf_reader line 163: fr.CompressionCodec, err = getHeaderString("avro.codec", meta)

I'm using hive to compile data into avro format, but it doesn't submit an avro.codec header because we aren't compressing the files. Is there any way to "not require" the header? When I fork and have that line return "null", nil using the library works fine. Could this be an old version of the avro standard?

Thoughts?

Multiple commands in same dir

In examples/file there are two files with main() function. This breaks automatic build systems, I suggest using Example functions.

encoder ought to encode null data value, when data type is null

This should work: https://gist.github.com/karrick/d06a228d2704efc49e8c

Furthermore, record field default values are used only for decoding values, not encoding values.

Cannot encode enum

package main

import (
	"bytes"
	"log"

	"github.com/linkedin/goavro"
)

func main() {

	recordSchemaJSON := `{
            "name": "Request",
            "type": "record",
            "fields": [{
                "name": "type",
                "type": "enum",
                "symbols": ["request", "pubsub"],
                "default": "request"
            }]
        }`

	codec, err := goavro.NewCodec(recordSchemaJSON)
	if err != nil {
		log.Fatal(err)
	}

	record, err := goavro.NewRecord(goavro.RecordSchema(recordSchemaJSON))
	if err != nil {
		log.Fatal(err)
	}

	record.Set("type", "request")

	bb := new(bytes.Buffer)
	if err = codec.Encode(bb, record); err != nil {
		log.Fatal(err)
	}

}

E:\Repositorys\hemera-golang>go run main.go
2017/01/27 23:35:19 cannot encode record (Request): cannot encode enum (type): expected: Enum; received: string
exit status 1

Does not implement default correctly

From the 1.7.7 spec:
default: A default value for this field, used when reading instances that lack this field (optional).

Defaults are defined in the spec to be used in schema resolution between writer and readers schema when consuming. The implementation here is inserting the default value when a field is missing or has invalid data:
codec.go 651:

				if reflect.ValueOf(field.Datum).IsValid() {
					value = field.Datum
				} else if field.hasDefault {
					value = field.defval
				} else {
					return newEncoderError(friendlyName, "field has no data and no default set: %v", field.Name)
				}

One possible implementation is when the field is missing to insert null providing null is first in the types union.

longdecoder overflow / dropping most significant bit if value > int64

Hi!

I have an issue, somehow it seems like the longdecoder is dropping the most significant bit.
https://github.com/linkedin/goavro/blob/master/decoder.go#L122-L137

The data we store in avro as a binary representation from another library:
binary: 1011001101111100010000101000010111000111000011000000100000110000
as uint64: 12933285372238759984
as int64: -5513458701470791632

But somehow i get this value when reading it:
int64: 3709913335383984176
Tried casting it to uint64: 3709913335383984176
This is equivalent to dropping the first / left-most / most significant bit of the data.

if we take this value which is wrong and bit-shift in the most significant bit we get the real number:
uint64(3709913335383984176) + 1<<63 = 12933285372238759984

So somehow the longdecoder seems to be dropping the left-most bit when encoding it as int64, at least when the value is >2^63, expected behaviour would be a negative int64 value.

Performance Expectations?

I'm working on my first project with Avro and I'm wondering what the encoding performance should be like. After trying debugging some performance issues in my application I came up with a small script:

func clientTest2() {
	const schema = `{
	  "type": "record",
	  "name": "test_record",
	  "fields": [
	    {
	      "name": "name",
	      "type": "string"
	    },
	    {
	      "name": "metric",
	      "type": [
	        "int",
	        "null"
	      ]
	    }
	  ]
	}`

	{
		var codec goavro.Codec
		if c, err := goavro.NewCodec(schema); err == nil {
			codec = c
		} else {
			panic(err)
		}
		recordSchema := goavro.RecordSchema(schema)

		start := time.Now()
		for i := 0; i < 1000000; i++ {

			if someRecord, err := goavro.NewRecord(recordSchema); err == nil {

				i := 5
				if err := someRecord.Set("name", fmt.Sprintf("i_%d", i)); err != nil {
					log.Warnf("Set Error: %v", err)
				}
				if err := someRecord.Set("metric", int32(i)); err != nil {
					log.Warnf("Set Error: %v", err)
				}
				codec.Encode(ioutil.Discard, someRecord)
				/*
					if err := s.Write(someRecord); err != nil {
						log.Warnf("Write Error: %v", err)
						os.Exit(1)
					}
				*/
			} else {
				log.Warnf("NewRecord Error: %v", err)
				os.Exit(1)
			}
		}
		fmt.Printf("Avro Bench took %v\n", time.Now().Sub(start))
	}

	{
		var metric struct {
			Name   string `json:"string"`
			Metric int    `json:"metric"`
		}
		start := time.Now()
		for i := 0; i < 1000000; i++ {

			i := 5
			metric.Name = fmt.Sprintf("i_%d", i)
			metric.Metric = i
			json.Marshal(metric)
			/*
				if err := s.Write(someRecord); err != nil {
					log.Warnf("Write Error: %v", err)
					os.Exit(1)
				}
			*/

		}
		fmt.Printf("JSON Bench took %v\n", time.Now().Sub(start))
	}

}

On my local Machine (MacBook Pro (Retina, 13-inch, Early 2015) 2.7 GHz Intel Core i5) I get the following results:

$ go run ./cmd/main.go -tc
Avro Bench took 3.052033713s
JSON Bench took 982.483477ms

Where Avro consistently clocks in half as fast as the standard JSON library. Is this expected? Should I be creating NewRecords for every loop?

Support for ZLib

Does goavro support zlib compression?

Structs copy when calling method

Hi,

There are many places in the source code, methods of some structs(record, codec, etc) are defined without pointer, e.g.

func (r Record) Get(fieldName string) (interface{}, error) {

(record.go, line 44)

And these structs will be copied everytime calling these methods. Please refer to this example for this behaviour of Golang, which copies value when calling method.

http://play.golang.org/p/_UVasafGHc

I think on most "read only without modification" methods, passing pointer is good enough to avoid the cost of value copy for the whole struct.

AVDL-based codegen

We just recently open sourced our fork of the AVRO java code generator, which generates golang avro objects from avdl files. It currently generates bindings and associated roundtrip unit tests that use the AVRO C libraries for serialization/deserialization, as at the time it was written last summer we were not aware of any golang avro implementations.

Unfortunately, we no longer use golang in our environment, so we aren't planning on updating the code generator. However, with a little work, I suspect either this project or the other golang avro project could adapt it to use golang bindings instead of the C library.

NewRecord panics when receiving an invalid schema instead of returning error

goroutine 6 [running]:
panic(0x476820, 0xc82000a0f0)
    /usr/local/Cellar/go/1.6/libexec/src/runtime/panic.go:464 +0x3e6
testing.tRunner.func1(0xc82019c870)
    /usr/local/Cellar/go/1.6/libexec/src/testing/testing.go:467 +0x192
panic(0x476820, 0xc82000a0f0)
    /usr/local/Cellar/go/1.6/libexec/src/runtime/panic.go:426 +0x4e9
github.com/linkedin/goavro.NewRecord(0xc8200519f8, 0x1, 0x1, 0xc8203a79e0, 0x0, 0x0)
    $GOPATH/src/github.com/linkedin/goavro/record.go:148 +0x11a

use RCP client / server

Is it possible to use RPC client / server to exchange AVRO data?

performance questions

I'm using this goavro package with kafka and seeing less than hoped for performance.

In my setup, if I pull just the kafka messages and do not call the avro decode function, my process can consume 200k msgs/sec. This is an average over multiple millions of messages. each message is about 130 bytes. my schema has 21 fields (a mix of fixed, long and string) and if I attempt to decode all 21 fields my processing msgs/sec falls to about 35k msgs/sec (again averaged over millions of msgs). Ouch.

if I decode the first five fields, the code can process about 115k msgs/sec.

basically as I increase the number of the schema fields to be decoded, I'm seeing reduced performance. Ok, i get this but i'm surprised by the performance hit.

is this expected?

i'm using the code in the decodeRecord example namely:

   encoded :=  // message from kafka
bb := bytes.NewBuffer(encoded)
decoded, err := codec.Decode(bb)

these measurements don't include any procesesing beyond the code.Decode(bb) line above.

Are there ways to improve the performance I'm seeing from the native avro decode mechanism?

Thanks

Flume and Avro use netty framework

I wan't to use Avro lib to push on Avro-flume log centraliser but i have an error on flume.

Flume can't decode avro message because it-s not an netty call.

ressources:
netty.io

Refocus readme to usage of goavro rather than 2.0 justification

I think the readme would be much more useful if some more key examples were added to the readme (I think the OCF reader should be on the front page). I think the 2.0 changes should be at the bottom.

As a side note, I think the library could be less confusing if the concept of "native go" was more defined more prominently, and perhaps actually toned down in the actual api. For me the largest use-case is to scan to a struct & visa versa. I don't see people using the go native map to access data.

Just some constructive criticism for you to use as you'd like

Generating a 64-bit Rabin Fingerprint (as recommended in the Avro * spec) of a byte string

Avro specification contains description of generating schema fingerprints. Is this functionality supported at all in the goavro package? Can you recommend the best way to generate a schema fingerprint?

go test fails

$ go test
--- FAIL: TestCodecDecoderUnionErrorYieldsName (0.00s)
        helpers_test.go:70: Actual: "cannot encode union (union): index must be between 0 and 1; read index: 2"; Expected to contain: "union (flubber)"
--- FAIL: TestFuzz_Panics (0.00s)
        fuzz_test.go:76: Error returned during fuzzing crasher[0]: cannot decode record (com.miguno.avro.twitter_schema): cannot decode string: cannot decode long: EOF
--- FAIL: TestNameWithDots (0.00s)
        name_test.go:60: The name portion of a fullname, record field names, and enum symbols must have second and remaining characters contain only [A-Za-z0-9_]
--- FAIL: TestRecordGetFieldSchema (0.00s)
        helpers_test.go:70: Actual: "The name portion of a fullname, record field names, and enum symbols must have second and remaining characters contain only [A-Za-z0-9_]"; Expected to contain: "no such field: no-such-field"
FAIL
exit status 1
FAIL    github.com/linkedin/goavro      0.015s

json marshal example

How do I read a avro file and then Marshal it into json without having extra meta data?

package main

import (
	"encoding/json"
	"fmt"
	"io"
	"log"
	"os"

	"github.com/linkedin/goavro"
)

func main() {
	//var i interface{}
	fmt.Println("ok")

	fh, err := os.Open("/home/user/avro/zip.avro")
	if err != nil {
		log.Fatal(err)
	}

	fr, err := goavro.NewReader(goavro.BufferFromReader(io.Reader(fh)))
	if err != nil {
		log.Fatal(err)
	}

	enc := json.NewEncoder(os.Stdout)
	for fr.Scan() {
		d, e := fr.Read()
		if e != nil {
			log.Println(e)
		}
		enc.Encode(d)
	}

}

The data looks like this

[{"Name":"com.test.avro.MyClass","Fields":[{"Name":"com.test.avro.ZIP","Datum":"00601"},{"Name":"com.test.avro.LAT","Datum":18.180555}

How can I get

[{"ZIP":"00601","LAT":18.180555,"LNG":-66.749961},{"ZIP":"00602","LAT":18.361945,"LNG":-67.175597},{"ZIP":"00603","LAT":18.455183,"LNG":-67.119887},{"ZIP":"00606","LAT":18.158345,"LNG":-66.932911}

Which is the orignal JSON file. This works in Java

Default value is required to be present in the enum list

Error default value ought to encode using field schema is thrown on trying to create codec (for decoding purposes) using schema with default values not present in the enum list. The Avro specification doesn't impose this restriction on the default value, yet this library can't cope with such schema.

Example schema:

{
  "type": "record",
  "name": "person",
  "fields" : [
    {
      "name": "sex",
      "type": "enum",
      "default": "unknown",
      "symbols": [
        "male",
        "female"
      ]
    }
  ]
}

Error on goavro.NewCodec call:

Record "person" field "sex": default value ought to encode using field schema: cannot encode binary enum "sex": value ought to be member of symbols: [male female]; "unknown"

NewCodec triggers the race detector when run on different goroutines

Run the following test with -race on:

func TestRaceCodecConstruction(t *testing.T) {

    comms := make(chan []byte, 1000)
    done := make(chan error, 10)

    go func() {
        recordSchemaJSON := `{"type": "long"}`
        codec, _ := NewCodec(recordSchemaJSON)
        for i := 0; i < 10000; i++ {

            bb := new(bytes.Buffer)
            if err := codec.Encode(bb, int64(i)); err != nil {
                done <- err
                return
            }

            comms <- bb.Bytes()
        }

        close(comms)
    }()

    go func() {
        recordSchemaJSON := `{"type": "long"}`
        codec, _ := NewCodec(recordSchemaJSON)

        i := 0
        for encoded := range comms {
            bb := bytes.NewBuffer(encoded)
            decoded, err := codec.Decode(bb)
            if err != nil {
                done <- err
                return
            }
            result := decoded.(int64)
            if result != int64(i) {
                done <- fmt.Errorf("didnt match %v %v", i, result)
                return
            }

            i++
        }

        close(done)
    }()

    err := <-done
    if err != nil {
        t.Fatal(err)
    }
}

This will trigger a race detected.:

==================
WARNING: DATA RACE
Write by goroutine 51:
  github.com/linkedin/goavro.NewCodec()
      ~/go/src/github.com/linkedin/goavro/codec.go:195 +0x5f9
  github.com/linkedin/goavro.TestRaceCodecConstruction.func1()
      ~/go/src/github.com/linkedin/goavro/race_test.go:16 +0x6c

Previous write by goroutine 52:
  github.com/linkedin/goavro.NewCodec()
      ~/go/src/github.com/linkedin/goavro/codec.go:195 +0x5f9
  github.com/linkedin/goavro.TestRaceCodecConstruction.func2()
      ~/go/src/github.com/linkedin/goavro/race_test.go:33 +0x69

Goroutine 51 (running) created at:
  github.com/linkedin/goavro.TestRaceCodecConstruction()
      ~/go/src/github.com/linkedin/goavro/race_test.go:29 +0xa4
  testing.tRunner()
      ~/opt/go/src/testing/testing.go:473 +0xdc

Goroutine 52 (running) created at:
  github.com/linkedin/goavro.TestRaceCodecConstruction()
      ~/go/src/github.com/linkedin/goavro/race_test.go:53 +0xd0
  testing.tRunner()
      ~/opt/go/src/testing/testing.go:473 +0xdc
==================

Nothing jumps out at me as being particularly troublesome in that code for concurrent use but the init section with static codecs is potentially an issue.

In practice I've also been able to get the race detector to trigger when in 1 goroutine codecs were being created and in another another codec was being used for decoding records. It pointed at line 595 of the codec.go file.

Inconsistency in decoding

I am getting weird (mixed up) results while decoding an avro schema.

This is what my schema looks like:

{
  "type": "record",
  "name": "mail",
  "doc": "Mail Data",
  "namespace": "co.getstrike",
  "fields": [
    {
      "name": "emailId",
      "type": "string"
    },
    {
      "name": "password",
      "default": null,
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "refreshToken",
      "default": null,
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "provider",
      "default": "google",
      "type": "string"
    },
    {
      "name": "providerUrl",
      "default": "imap.gmail.com:993",
      "type": "string"
    }
  ]
}

I am producing using the kafka avro console producer

./kafka-avro-console-producer --broker-list localhost:29092 --topic imapSync --property value.schema='{"type":"record","name":"mail","doc":"Mail Data","namespace":"co.getstrike","fields":[{"name":"emailId","type":"string"},{"name":"password","default":null,"type":["null","string"]},{"name":"refreshToken","default":null,"type":["null","string"]},{"name":"provider","default":"google","type":"string"},{"name":"providerUrl","default":"imap.gmail.com:993","type":"string"}]}'

The data I am sending

{"emailId":"[email protected]","password":{"string":"xxx"},"refreshToken":null,"provider":"google","providerUrl":"imap.gmail.com:993"}

This is the output I am getting from kafka-avro-consumer

{"emailId":"[email protected]","password":{"string":"xxx"},"refreshToken":null,"provider":"google","providerUrl":"imap.gmail.com:993"}

Which is correct.

However, decoded data from goavro looks pretty weird.

My code

code, err := goavro.NewCodec(schema.Schema)
decoded, _, err := codec.NativeFromBinary(val)

msg, ok := decoded.(map[string]interface{})
if !ok {
  log.Println("Could not cast msg to map")
} else {
  log.Println(msg, ok)
  for k, v := range msg {
    log.Println(k, v)
  }
}

Output

2017/11/23 18:19:52 map[providerUrl:&tester@get emailId: password:<nil> refreshToken:<nil> provider:] true
2017/11/23 18:19:52 emailId
2017/11/23 18:19:52 password <nil>
2017/11/23 18:19:52 refreshToken <nil>
2017/11/23 18:19:52 provider
2017/11/23 18:19:52 providerUrl &tester@get

Not sure what I am doing wrong here.

Error encoding map[string]float64 while schema is a map

I attempted to encode a field with map[string]float64 type while schema for that field is:
{"type": "map", "values": "double"}
And I got the following error:

encode record (RECORD): cannot encode record (RECORD): cannot encode union (union): datum ought match schema: expected: null, map; received: map[string]float64

It seems that goavro only supports map[string]interface{}, but that's not always convenient. On the contrary, if map value is float64 we know that all values fit schema thus there is no need to check individual map values which might be slightly more efficient.

Invalid rows silently corrupted when writing via OCF

When writing an Avro file, the current implementation silently drops rows that do not meet the specified codec (for instance have a missing required field), without any error output.

I've tracked down the issue to here:
https://github.com/linkedin/goavro/blob/master/ocf_writer.go#L367

Unfortunately my level of Go experience means I'm not yet sure how to propose this be fixed, but I shall think about that...

Use goavro with schema-registry error Unknown magic byte!

Hi, I'm curious that this lib supports kafka-connect + schema-registry ?
I've tried to encode and sent to kafka topics but the kafka-connect receives message with error Unknown magic byte!. Please give me more info about how to use with them

Message Framing

Hi Team,

The usecase is: I want to send multiple avro objects in an HTTP call.

Typically I want to bundle say 10 objects, compress them using snappy or gzip and send it over.

Is there any feature directly in goavro, or would I have to implement a ghetto-message-framing myself?

latest revision broken with {"type": "fixed"} field data type

Getting this error decoding an Avro record with a "fixed" UUID field with the latest revision (e790a7b):

non-recognised non-scalar(?) type goavro.Fixed for field.Datum (c): goavro.Fixed{Name:"com.xxxxxxx.xxxx.UUID", Value:[]uint8{0x12, 0x7f, 0xe9, 0xc0, 0x3b, 0x59, 0x41, 0xf5, 0x93, 0x6d, 0x77, 0x75, 0xeb, 0x84, 0xb3, 0xc7}}

This is the relevant section of the Avro schema:

            {
            "name": "messageId",
            "type": {
              "type": "fixed",
              "name": "UUID",
              "namespace": "com.xxxxxxx.xxxx",
              "size": 16
            },
            "doc": "A unique identifier for the message"
          },

Simplify nullable JSON field encoding and decoding

I often use nullable fields in Avro. For that I have to use unions like:
"field" : ["null","string"]

The JSON input and output looks like this:
{ "field": { "string": "value" } }

I would prefer to also support for this format (encode and decode):
{ "field": "value" }

This can be done for all [null, TYPE] unions. For others (e.g. [null, "string", "bytes"]) there should be no change.

You can find a discussion about this here: https://issues.apache.org/jira/browse/AVRO-1582
I currently use this patch for my Java application, but I want to migrate to golang.

Please provide end-to-end examples

Please provide a basic tutorial to use the package, maybe on the lines of http://avro.apache.org/docs/1.7.7/gettingstartedjava.html

I am specially interested Serializing multiple objects in a file.

goavro.Reader ought to emit final datum during Scan and Read loop

On some occasions, goavro.Reader Scan() returns false before final element consumed using Read().

No support for Type references / recursion ?

Im trying to use the type Error in a nested field called cause

package main

import (
	"bytes"
	"log"

	"github.com/linkedin/goavro"
)

func main() {

	recordSchemaJSON := `{
    "name": "Packet",
    "type": "record",
    "fields": [{
        "name": "error",
        "default": {},
        "type": [{
            "name": "Error",
            "type": "record",
            "fields": [{
                "name": "cause",
                "type": ["null", "Error"],
                "default": null
            }]
        }]
    }]
}]
}`

	codec, err := goavro.NewCodec(recordSchemaJSON)
	if err != nil {
		log.Fatal(err)
	}

	record, err := goavro.NewRecord(goavro.RecordSchema(recordSchemaJSON))
	if err != nil {
		log.Fatal(err)
	}

	bb := new(bytes.Buffer)
	if err = codec.Encode(bb, record); err != nil {
		log.Fatal(err)
	}

}

Error

E:\Repositorys\hemera-golang>go run main.go
2017/01/30 21:01:17 cannot build record (null namespace): record field ought to be codec: {name:map[] nullCodec:0xc042056960 booleanCodec:0xc0420569c0 intCodec:0xc042056a20 longCodec:0xc042056900 floatCodec:0xc042056a80 doubleCodec:0xc042056ae0 b
ytesCodec:0xc042056b40 stringCodec:0xc042056ba0}: cannot build union (null namespace): member ought to be decodable: %!s(MISSING): cannot build record (null namespace): record field ought to be codec: {name:map[] nullCodec:0xc042056960 booleanCod
ec:0xc0420569c0 intCodec:0xc042056a20 longCodec:0xc042056900 floatCodec:0xc042056a80 doubleCodec:0xc042056ae0 bytesCodec:0xc042056b40 stringCodec:0xc042056ba0}: cannot build union (null namespace): member ought to be decodable: %!s(MISSING): cann
ot build unknown: unknown type name: Error
exit status 1

Can also reproduce it with the example on the avro specification page.

See Apaceh Avro "Complex Types" section.

Example

package main

import (
	"bytes"
	"log"

	"github.com/linkedin/goavro"
)

func main() {

	recordSchemaJSON := `{
  "type": "record",
  "name": "LongList",                  
  "fields" : [
    {"name": "next", "type": ["null", "LongList"]}
  ]
}`

	codec, err := goavro.NewCodec(recordSchemaJSON)
	if err != nil {
		log.Fatal(err)
	}

	record, err := goavro.NewRecord(goavro.RecordSchema(recordSchemaJSON))
	if err != nil {
		log.Fatal(err)
	}

	record.Set("next", "null")

	bb := new(bytes.Buffer)
	if err = codec.Encode(bb, record); err != nil {
		log.Fatal(err)
	}

}

Error

E:\Repositorys\hemera-golang>go run main.go
2017/01/30 21:13:04 cannot build record (null namespace): record field ought to be codec: {name:map[] nullCodec:0xc042090450 booleanCodec:0xc0420904b0 intCodec:0xc042090510 longCodec:0xc0420903f0 floatCodec:0xc042090570 doubleCodec:0xc0420905d0 b
ytesCodec:0xc042090630 stringCodec:0xc042090690}: cannot build union (null namespace): member ought to be decodable: %!s(MISSING): cannot build unknown: unknown type name: LongList
exit status 1

Encoding avro schema of type enum

I know this is not the good place to ask this question. But as I couldn't find much other resources for goavro, so thought of asking it here

I am trying to encode a schema whose one of the fields type is enum. Following is the schema

var schema = `{
  "type": "record",
  "name": "Event",
  "namespace": "com.avro",
  "fields": [
    {
      "name": "timestamp",
      "type": "long"
    },
    {
      "name": "category",
      "type":  {
        "namespace": "com.avro",
            "name": "EventCategory",
            "type": "enum",
            "symbols": ["OptionA", "OptionB", "OptionC", "OptionD"]
         }
    }
  ]
}
`

I was able to encode timestamp field, but couldn't figured out how to encode category field as its type is enum. Followed below steps

step1 :

record, err := goavro.NewRecord(goavro.RecordSchema(schema))
if err != nil {
    log.Error("NewRecord failed ->", err.Error())
    return fmt.Errorf("NewRecord failed -> %v", err.Error())
}

step2:

err := record.Set("timestamp", int64(123))
if err != nil {
    log.Error("timestamp Set Error ->", err.Error())
    return fmt.Errorf("timestamp Error -> %v", err)
}

step3: Need to set category, couldn't able to figure out how to set it

any help or suggestion is really appreciated.

Thanks

Exception while trying to deserialize in Java

Exception in thread "main" org.apache.avro.SchemaParseException: "record" is not a defined name. The type of the "impression" field must be a defined name or a {"type": ...} expression.
at org.apache.avro.Schema.parse(Schema.java:1199)
at org.apache.avro.Schema$Parser.parse(Schema.java:965)
at org.apache.avro.Schema$Parser.parse(Schema.java:932)

My inner Schema:
{
"name": "impression",
"namespace": "phoenix.adengine",
"doc": "The data which is common to the ad-request",
"type": "record",
"fields": [
{
"name": "pubID",
"type": "int",
"doc": "Publisher ID",
"default": 0
},
{
"name": "ts",
"type": "long",
"doc" : "Time Stamp",
"default": 0

        },
        {
            "name": "pURL",
            "type": "string",
            "doc": "Page URL",
            "default": ""
        },
        {
            "name": "dynamicPURL",
            "type": "string",
            "doc": "Dynamic Page URL",
            "default": ""
        },
        {
            "name": "responseFmt",
            "type": "int",
            "doc" : "Response format",
            "default": ""

        }
    ]
    }

Getting JSON for an avro record

is there a way to get the JSON representation of an avro record

Enum encode takes an Enum, decode returns a string

When encoding an enum field on a record it requires a Enum:

https://github.com/linkedin/goavro/blob/master/codec.go#L481

When decoding an enum field it returns a string:

https://github.com/linkedin/goavro/blob/master/codec.go#L478

This is inconsistent.

Why are tests in their own package?

By being in their own package they import the goavro package explicitly by pointing at the linkedin path. When working in a fork this is very problematic.

I've not seen that pattern in other go projects so I'm struggling to see what the benefits are for the downsides?

goavro cannot encode unions that include enums

Error:

cannot encode record (test_record):
    cannot encode union (union):
        datum ought match schema: expected: null, color_enum; received: string

Schema:

{
    "type": "record",
    "name": "test_record",
    "fields": [
        {
            "type": {
                "type": "enum",
                "name": "color_enum",
                "symbols": [
                    "red",
                    "blue",
                    "green"
                ]
            },
            "name": "color1"
        },
        {
            "name": "color2",
            "type": ["null", "color_enum"]
        }
    ]
}

Data (JSON representation here, but passed in goavro.Record when testing):

{
    "color1": "red",
    "color2": "blue"
}

avro files are unreadable

Hey,

Starting commit c76fd5f the output files are unreadable by avro-tools-1.8.1.jar (http://www-us.apache.org/dist/avro/avro-1.8.1/java/avro-tools-1.8.1.jar), the following exception is being thrown:

Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.EOFException
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:222)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:76)
at org.apache.avro.tool.Main.run(Main.java:87)
at org.apache.avro.tool.Main.main(Main.java:76)
Caused by: java.io.EOFException
at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:259)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:430)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)

also, sourcing data from avro file into google's BQ table fails with the following error:

Errors:
gs://some_bucket/some_avro_file.avro: The Apache Avro library failed to read data with the follwing error: EOF reached (error code: invalid)

test function we used to reproduce the issue:


type someKey struct {
        str1, str2, str3, str4 string
}

type someValue struct {
        someVal1               int64
}

func TestCreateFile(t *testing.T) {
	testData := []someKey{
		someKey{str1: `1`, str2: `1`, str3: `1`, str4: `1`},
		someKey{str1: `2`, str2: `2`, str3: `2`, str4: `2`},
		someKey{str1: `3`, str2: `3`, str3: `3`, str4: `3`},
		someKey{str1: `4`, str2: `4`, str3: `4`, str4: `4`},
	}
	codec, _ := goavro.NewCodec(schema)
	mp := make(map[someKey]*someValue)
	for r := 0; r < 100000; r++ {
		for i := range testData {
			d := testData[i]
			d.str1 = d.str1 + strconv.Itoa(r)
			mp[d] = &someValue{someVal1: 1}
		}
	}

	m := `test-` + time.Now().Format("200601020405")
	parseToAVRO(mp, m, codec, `/tmp/`+m+`.avro`)

}

thanks.

Support for optional field

Does goavro support optional field? I am getting error

field has no data and no default set, when I set the field to following
{
"name": "device",
"type":[ "null", "SomeType"]
}

int and long should use the same decoder and encoder.

Schema with union and default value of "null" fails to encode causing NewCodec to error

The Union guard deep in record.makeRecordCodec fails to transform the named "null" to a nil since the datum supplied is not truly sourced from a binary but from the codec itself (and the default value is literally the string "null").

This schema is valid per the Avro spec and is being used in production with other Avro serializers.

func TestDefaultValueOughtToEncodeUsingFieldSchemaOK(t *testing.T) {

	schema :=`
	{
	  "namespace": "universe.of.things",
	  "type": "record",
	  "name": "Thing",
	  "fields": [
		{
		  "name": "attributes",
		  "type": [
			"null",
			{
			  "type": "array",
			  "items": {
				"namespace": "universe.of.things",
				"type": "record",
				"name": "attribute",
				"fields": [
				  {
					"name": "name",
					"type": "string"
				  },
				  {
					"name": "value",
					"type": "string"
				  }
				]
			  }
			}
		  ],
		  "default": "null"
		}
	  ]
	}`
	
	_, err := goavro.NewCodec(schema)
	if err != nil {
		t.Error(err)
	}

}

CRC32 checksums missing from Snappy encoding?

The Avro spec specifies that Snappy encoded file data blocks should contain a checksum after each block:

The "snappy" codec uses Google's Snappy compression library. Each compressed block is followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in the block.

As far as I can tell, that's not being added now ( snappy.Encode does not add a checksum, though the Writer interface of the library would). It works fine if you're both reading and writing using goavro, but breaks interoperability with other things using the Avro format that are expecting a checksum.

(it's possible I might just be missing the check-summing, and my difficulties with interoperability were elsewhere)

Support type nesting in schema definitions

Hi! Are there any plans to support type nesting in schema definitions, similar to this example below (taken from SO)? See references to com.company.model.AddressRecord:

{
    "type": "record",
    "namespace": "com.company.model",
    "name": "AddressRecord",
    "fields": [
        {
            "name": "streetaddress",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        },
        {
            "name": "state",
            "type": "string"
        },
        {
            "name": "zip",
            "type": "string"
        }
    ]
},
{
    "namespace": "com.company.model",
    "type": "record",
    "name": "Customer",
    "fields": [
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        },
        {
            "name": "email",
            "type": "string"
        },
        {
            "name": "phone",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
},
{
    "namespace": "com.company.model",
    "type": "record",
    "name": "Customer2",
    "fields": [
        {
            "name": "x",
            "type": "string"
        },
        {
            "name": "y",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
}
]

Fail to encode missing integer field when default value is also an integer

Error message:
cannot encode union (union): datum ought match schema: expected: int32, null; received: float64"
Schema of the field:

{
    "default": 123,
    "name": "testField",
    "type": [
        "int",
        "null"
    ]
}

Arrays and nullable arrays don't decode into the same types

https://github.com/kklipsch/goavro.v2/blob/slice/record_test.go#L472-L519

Coming out of the nullable code you get the type you put in (in our test case []int32) coming out of the non-nullable code you get []interface{}.

This seems bad.

Error while ignoring field

When I set an int field with a default value and chose to ignore it i get the following error:

{
    "name": "ID",
    "type": "int",
    "doc" : "ID of the system",
    "default": 0
}

cannot encode int: expected: int32; actual: float64

Read from Reader does not move forward after an error

If a schema does not specify and default value and the record also fails to provide the value, the Read() method returns an error: cannot decode long: EOF.

Problem: f.Scan() detects more records exist but f.Read() resumes from the point of the error rather than skip the erroneous field. As a result the second record is unreadable.

Example:

package main

import (
    "bytes"
    "fmt"
    "io"
    "log"

    "github.com/linkedin/goavro"
)

func main() {
    w := new(bytes.Buffer)
    makeSomeData(w)
    fmt.Println("Data written")

    dumpReader(w)
}

func makeSomeData(w io.Writer) error {
    recordSchema := `
    {
      "type": "record",
      "name": "example",
      "namespace": "test",
      "fields": [
        {
          "type": "string",
          "name": "username"
        },
        {
          "type": "string",
          "name": "comment"
        },
        {
          "type": "long",
          "name": "timestamp"
        }
      ]
    }
    `
    fw, err := goavro.NewWriter(
        goavro.BlockSize(13),                         // example; default is 10
        goavro.Compression(goavro.CompressionSnappy), // default is CompressionNull
        goavro.WriterSchema(recordSchema),
        goavro.ToWriter(w))
    if err != nil {
        log.Fatal("cannot create Writer: ", err)
    }
    defer fw.Close()

    // make a record instance using the same schema
    someRecord, err := goavro.NewRecord(goavro.RecordSchema(recordSchema))
    if err != nil {
        log.Fatal(err)
    }
    // identify field name to set datum for
    someRecord.Set("username", "Aquaman")
    someRecord.Set("comment", "The Atlantic is oddly cold this morning!")
    // you can fully qualify the field name
    // someRecord.Set("test.timestamp", int64(1082196484))
    fw.Write(someRecord)

    // make another record
    someRecord, err = goavro.NewRecord(goavro.RecordSchema(recordSchema))
    if err != nil {
        log.Fatal(err)
    }
    someRecord.Set("username", "Batman")
    someRecord.Set("comment", "Who are all of these crazies?")
    someRecord.Set("test.timestamp", int64(1427383430))
    fw.Write(someRecord)

    return nil
}

func dumpReader(r io.Reader) {
    fr, err := goavro.NewReader(goavro.BufferFromReader(r))
    if err != nil {
        log.Fatal("cannot create Reader: ", err)
    }
    defer func() {
        if err := fr.Close(); err != nil {
            log.Fatal(err)
        }
    }()

    for fr.Scan() {
        datum, err := fr.Read()
        if err != nil {
            log.Println("cannot read datum: ", err)
            continue
        }
        fmt.Println(datum)
    }
}

Execution:

$ go run encodeRecord.go
Data written
2016/10/03 23:59:08 cannot read datum:  cannot decode record (test.example): cannot decode long: EOF
2016/10/03 23:59:08 cannot read datum:  cannot decode record (test.example): cannot decode string: cannot decode long: EOF

The second error is incorrect: the long has been set on the second record.

The concern here is that one single invalid record prevents from reading all subsequent ones.