Git Product home page Git Product logo

convertor's Introduction

trans

trans is a command line tool to convert Tor network server descriptors from The Tor Project's own serialization to more popular formats like JSON and Parquet, thereby making them readily accessible to analytics tools like Tableau, MongoDB or Apache Spark without the need to resort to specialized libraries for accessing them.

Background

Descriptors get emitted by servers in the Tor network every hour. They are collected by the CollecTor service and from there published in hourly, daily or monthly archives. They are cleaned to preserve the privacy properties of the Tor network but still provide a lot of insights into the workings of Tor that are valuable for administration, development and research. The format of these descriptors is optimized for size to allow for efficient transfer over the network. Specialized libraries like metrics-lib (Java) or [stem] (https://stem.torproject.org/) (Python) are used to access them, which makes them hard to access for non programmers. Additionally, since the data format is homegrown, it is not possible to work on them with of the mill data science and big data analytics softwares like Apache Spark, Tableau or the popular MongoDB NoSQL database.

How to build

On the shell navigate to the project directory (same level as build.xml).
For building an executable jar complete with merged dependencies and schemata just enter ant. Look into ./dist for the result. To get everything enter:

ant clean jar bundle

How to use

On the shell enter:

java -jar trans.jar -h

to get a rundwon of available options. A call to convert some descriptors might look like this:

java -jar trans.jar -i=/collector -o=/json -l=/log -cz -g &> log.txt

Note the addition of &> log.txt which writes error messages to a file.

A reasonable setup could be a working directory - let's call it work - in which you put trans.jar and 2 subdirectories, inwith the descriptors to convert and out for the converted descriptors.
Maybe you need them as JSON, pretty printed (because you want to have a look at them yourself), so run:

java -jar trans.jar -i=in/ -o=out/ -p

If you want to start working with them in MomgoDB right away, you've got to omit the pretty printing and are well advised to use compression, e.g. like this:

java -jar trans.jar -i=in/ -o=out/ -cz

Or you need them all as compressed Parquet files to work on them with Spark, plus logging enabled to make sure you don't miss anything:

java -jar trans.jar -i=in/ -o=out/ -f=parquet --snappy --log

Possible parameters are:

-f    --format     <arg>    default: json, optional: parquet, avro
-s    --suffix              a suffix to the file name                                     
-i    --inPath     <arg>    default: current working directory    
-o    --outPath    <arg>    default: current working directory    
-l    --logsPath   <arg>    default: current working directory    
-cs   --snappy              a compression format popular with Parquet
-cz   --zip                 compressing Avro as BZip2, Parquet & JSON as GZip 
-p    --pretty              pretty printed JSON                   
-m    --maxFiles   <arg>    max files to be opened, default: 20                           
-d    --debug               print JSON descriptors to console     
-g    --log                 log to file 'trans.log'           

Caveats

Data might contain double entries. See CollecTor for more details. Pretty printed JSON is primarily intended for human consumption and debugging. Softwares often require every JSON record in one line (eg. Spark otherwise can't ingest it, but MongoDB won't complain).

TODO

The most important issues:

  • make the timespan that converted data files cover configurable
  • a flattened version, less faithful to the structure of the spec
  • tests (dependend on test descriptors from the Tor metrics team)

For a complete overview see TODO.md.

Code overview

Base is the class that contains the main method.

The main method initializes a new Base base, which in turn initializes Args and Writers singletons. Args contains default arguments like input directory, output format etc and evaluates command line arguments. Writers stores fileWriters per descriptor_type and month. It initializes new fileWriters on demand, according to the type and date of incoming descriptor and the configured format.

After initialization main calls runConversion() on the base just created which iterates over the incoming descriptors, converts them according to type, gets the appropriate fileWriter from Writers and appends the converted descriptors to this fileWriter.

Actual converters (ConvertRelay, ConvertBridge etc) are all subclasses of the abstract Class Convert. Actual Writers (WriteAvro, WriteParquet and WriteJson) are all implementations of the interface Writer. Actual descriptor types, their properties and methods are all defined as enums in Types.

Encoding of the converted descriptors is managed in the WriterXXXX classes and relies on the encoder schemata and autogenerated Avro classes in the package encoders.

After all descriptors are converted closeAllWriters() is called to perform some housekeeping, making sure that all writers write to disk before the program exits.

Avro Schemata

trans uses Apache Avro to encode the descriptor data model in JSON serialization. Java classes autogenerated from this model power the conversion from descriptor to Avro, Parquet and JSON. The schemata stored in schema/IDL/*.avdl are the source from which JSON schemata (see schema/*.avsc) and Java classes (see src/trans/encoders) are generated. Modifications to the data model like new fields in descriptors etc most probably require changes to the IDL schemata first and then generating JSON schemata and Java classes. For more details on how to do this see docs/avro.md.

convertor's People

Contributors

rat10 avatar tomlurge avatar

Watchers

 avatar  avatar  avatar

convertor's Issues

ivestigate misterious IOE

This should be investigated to make sure that it really doesn't cause harm later:
in Types.java
} catch (IOException e) { /* (Not sure why...) an IOException is always thrown for all schemata but the converter works nonetheless. */ }

use constants for string literals

Those string literals that appear more than once are more convenient to maintain when using constants.
For example the use of "json", but there are a few more:
src/converTor/Writers.java: case ("json") : writer = new WriterJson(type, date); break;

src/converTor/Args.java: private String format = "json";

src/converTor/Args.java: else if (formatArgument.equals("json")) {

project clean-up

I noticed the attic folder. Is it a 'leftover' from a former CVS repo?
Some code in there is described as not working or completely commented out.
I would suggest to remove this folder, if this is not used anymore.

Same with some other notes and files. It is not clear which are the docs that one should read.
Just remove old stuff and trust git to remember everything :-)

unregcognized line [tunnelled-dir-server]

 Unrecognized lines in bridge-descriptors/server-descriptors/2016-06-10-12-14-06-server-descriptors:
 [tunnelled-dir-server]

When converting bridge descriptors.

@type bridge-server-descriptor 1.2
router orak2 10.113.214.210 53145 0 0
or-address [fd9f:2e19:3bcf::89:2281]:53145
master-key-ed25519 z0J8Zq5PcWXvc3jVo0WMeq8scYgrxEgZJ1AfNbxR+XU
platform Tor 0.2.8.0-alpha-dev on Linux
protocols Link 1 2 Circuit 1
published 2015-12-31 19:30:53
fingerprint D0A5 101B FAA5 4488 6E3E AA14 7A53 F7C0 BC44 3015
uptime 611885
bandwidth 14971520 104857600 60761
extra-info-digest B080679F850A66E158A2980C21DBA22A5E1DA0FATQK+knF6YRJ5/vZn1Bzsa8dPTKB/bJYN/r66h22s4QM
hidden-service-dir
contact somebody
ntor-onion-key v9mbXskVFrZWzW7NmNSEwZXsjxQBpO3TMhPmXzZ3gB4=
reject *:*
tunnelled-dir-server
router-digest-sha256 +XcMyHZ0uVmD+bl3N/2KlTbitmk0YWoIGMlHtrKVB0c
router-digest C66128086F0636C47FD0691DC7888A73DADBA7E6

explicit java version

The javac task doesn't explicitly state source and target version. Descriptor uses java 1.7; I didn't investigate the other dependencies.

Maybe add the source="${source-and-target-java-version}" and target="${source-and-target-java-version}" attributes to the ant task and set source-and-target-java-version to the version needed.

add tests

It's important to add test immediately or even before coding.

usage info should be omitted when correct parameters are given

This is very minor, but I cannot set the level of the issue.
The usage is always printed when running the jar.

 java -jar dist/converTor.jar -p -i in

Converter from Tor CollecTor data to JSON, Parquet or Avro.
Call with parameter '-h' for help and more options.

  Conversion with arguments: -p -i in 

  Current parameters:
  -f    --format     <arg>    default: json, optional: parquet, avro   json
  -s    --suffix                                                       
  -i    --inPath     <arg>    default: current working directory       in
  -o    --outPath    <arg>    default: current working directory       
  -l    --logsPath   <arg>    default: current working directory       
  -cs   --snappy                                                       false
  -cz   --zip                 Avro as BZip2, Parquet & JSON as GZip    false
  -p    --pretty              pretty printed JSON                      true
  -m    --maxFiles   <arg>    default: 20                              20
  -d    --debug               print JSON descriptors to console        false
  -g    --log                 log to file 'converTor.log'              false

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.