Git Product home page Git Product logo

pythia's Introduction

Pythia

Java library that produces an automated statistical profile of an input dataset.

A standard dataset is just a text file, with lines, where each line is a record, the fields of which are separated by a separator (eg. tabs, comma, pipe, etc). After registering a dataset and declaring the desired data analysis methods that should get executed, the system produces a 100% automatic statistical profile of the dataset and generates reports of the findings.

Setup


Intellij IDEA Installation Requirements

  • Install Intellij IDEA (Community edition is free)
  • Import the project as a Maven project, and it runs out of the box.

Eclipse Installation Requirements

  • Install Eclipse
  • Import the project as a Maven project.

Maven

The project uses a Maven wrapper so there is no need to install it to your system as long as you have the JAVA_HOME environmental variable pointing to your Java 8 installation folder.

🛠️ Build with Maven


Navigate to the root folder of the repo and run,

./mvnw clean install

and wait for the procedure to finish

After that, there should be a folder called target that includes two jar files:

Pythia-x.y.z-all-deps.jar and Pythia-x.y.z.jar

The difference is that the all deps jar file is an uber jar so you can import Pythia to a project and run it out of the box. (All dependecies are embedded into the all deps jar)

  • Otherwise you will need to provide the Pythia dependencies to your pom.xml file.

To run with the driver Main method, navigate to the root folder of the repo:

java -jar target/Pythia-x.y.z-all-deps.jar

🧪 Run tests


Navigate to the root folder of the repo and run,

./mvnw test

Code Formatter


This project complies with Google's Java coding style and is formatted using the official Google java formatter. You can follow the installation guide in the official GitHub repo to install it to your Editor.

Note: Consider installing it and run it so that the project follows a coding style

In case you want to format all java files from the command line, run in the root folder of the project:

java -jar google-java-format-x.y.z-all-deps.jar -i $(find . -type f -name "*.java")

Note: The formatter needs a Java 11 installation to run in the command line

Usage


Suppose we want to generate a statistical profile of the following file:

name age money
Michael 25 20
John 21 15
Andy 30 1000
Justin 65 10000

Below is a sample Main class that showcases API usage in simple steps for the above dataset:

public class Main {
	public static void main(String[] args) throws AnalysisException, IOException {
		
        // 1. Initialize a DatasetProfiler object (this is the main engine interface of Pythia).
        IDatasetProfiler datasetProfiler = new IDatasetProfilerFactory().createDatasetProfiler();

        // 2. Specify the schema, an alias and the path of the input dataset.
        StructType schema =
                new StructType(
                        new StructField[]{
                                new StructField("name", DataTypes.StringType, true, Metadata.empty()),
                                new StructField("age", DataTypes.IntegerType, true, Metadata.empty()),
                                new StructField("money", DataTypes.IntegerType, true, Metadata.empty()),
                        });
        String alias = "people";
        String path = String.format(
                "src%stest%sresources%sdatasets%speople.json",
                File.separator, File.separator, File.separator, File.separator);
        
        // 3. Register the input dataset specified in step 2 into Pythia.
        datasetProfiler.registerDataset(alias, path, schema);

        // 4. Specify labeling rules for a column and a name for the new labeled column.
        List<Rule> rules =
                new ArrayList<Rule>(
                        Arrays.asList(
                                new Rule("money", LabelingSystemConstants.LEQ, 20, "poor"),
                                new Rule("money", LabelingSystemConstants.LEQ, 1000, "mid"),
                                new Rule("money", LabelingSystemConstants.GT, 1000, "rich")));
        String newColumnName = "money_labeled";
        
        // 5. Create a RuleSet object and compute the new labeled column
        // (steps 4 & 5 can be repeated multiple times).
        RuleSet ruleSet = new RuleSet(newColumnName, rules);
        datasetProfiler.computeLabeledColumn(ruleSet);
        
        // 6. Specify the DominanceColumnSelectionMode and (optionally) a list of 
        // measurement & coordinate columns used in dominance pattern identification.
        DominanceColumnSelectionMode mode = DominanceColumnSelectionMode.USER_SPECIFIED_ONLY;
        String[] measurementColumns = new String[] { "money", "age" };
        String[] coordinateColumns =  new String[] { "name" };
        
        // 7. Declare the specified dominance parameters into Pythia
        // (steps 6 & 7 are optional, however, they are a prerequisite for highlight patterns identification).
    	datasetProfiler.declareDominanceParameters(mode, measurementColumns, coordinateColumns);

    	// 8. Specify the auxiliary data output directory and the desired parts of the analysis procedure 
    	// that should get executed for the computation of the dataset profile.
    	String auxiliaryDataOutputDirectory = "results";
    	boolean shouldRunDescriptiveStats = true;
    	boolean shouldRunHistograms = true;
    	boolean shouldRunAllPairsCorrelations = true;
    	boolean shouldRunDecisionTrees = true;
    	boolean shouldRunHighlightPatterns = true;
        
        // 9. Create a DatasetProfilerParameters object with the parameters specified in step 8
        // and compute the profile of the dataset (this will take a while for big datasets).
        DatasetProfilerParameters parameters =  new DatasetProfilerParameters(
        		auxiliaryDataOutputDirectory,
                shouldRunDescriptiveStats,
                shouldRunHistograms,
                shouldRunAllPairsCorrelations,
                shouldRunDecisionTrees,
                shouldRunHighlightPatterns);
        datasetProfiler.computeProfileOfDataset(parameters);

        // 10. (Optionally) specify an output directory path for the generated reports
        // (unspecified output directory path means that the reports will be generated under the 
        // auxiliary data output directory specified in step 8).
        String outputDirectoryPath = "";
        
        // 11. Generate a report in plain text and markdown format.
        datasetProfiler.generateReport(ReportGeneratorConstants.TXT_REPORT, outputDirectoryPath);
        datasetProfiler.generateReport(ReportGeneratorConstants.MD_REPORT, outputDirectoryPath);
    }
}

Contributors

  • Alexiou Alexandros
  • Charisis Alexandros (Youtube)
  • Christodoulos Antoniou
  • Dimos Gkitsakis
  • George Karathanos
  • Lampros Vlachopoulos
  • Panos Vassiliadis

pythia's People

Contributors

pvassil avatar alexxarisis avatar ch-ant avatar alexandrosalexiou avatar georgekarathanos avatar lamprosvlax13 avatar dimosgkitsakis avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

pythia's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.