macks22 / dblp Goto Github PK

Parse the dblp data into a structured format for experimentation.

License: MIT License

Python 98.93% Makefile 0.68% Shell 0.39%

dblp's Introduction

dblp

This library was implemented to convert the DBLP data into a structured format for experimentation. It was developed to test the SENC (Seeded Estimation of Network Communities) community detection method. If you use it for scientific experiments, please cite the following paper:

@incollection{
    title={Finding Community Topics and Membership in Graphs},
    author={Revelle, Matt and Domeniconi, Carlotta and Sweeney, Mack and Johri, Aditya},
    year={2015},
    isbn={978-3-319-23524-0},
    booktitle={Machine Learning and Knowledge Discovery in Databases},
    volume={9285},
    series={Lecture Notes in Computer Science},
    doi={10.1007/978-3-319-23525-7_38},
    url={http://dx.doi.org/10.1007/978-3-319-23525-7_38},
    publisher={Springer International Publishing},
    pages={625-640}
}

Data Source

The data comes from arnetminer.org, which is maintained by a research group at Tsinghua University in China. It exists in eight different versions. Seven of them can be found here. The 8th and newest version (as of 2015-02-11) can be found here. It is the 8th version which this package was built to parse. However, given a suitable module 1 replacement (see below), any of the versions can be used for the subsequent transformations.

TODO: fill in stats about dataset here

4 files
what each contains
which are used
number of papers, authors, percent with abstract, etc.

Pipeline Design

The entire pipeline is built using the luigi package, which provides a data pipeline dependency resolution scheme based on output files. Hence, there are many output files during all phases of processing. Often these are useful; sometimes they are not. Overall, luigi turned out to be a very nice package to work with. It allows each processing step to be written out as a distinct class. Each of these inherits from luigi.Task. Before running, each task checks its dependent data files. If any are absent, the tasks responsible for building them are run first. After running, each task produces one or more output files, which can then be specified as dependencies for later tasks. Hence, the generation of the entire dataset is as simple as running a task which is dependent on all the others. This task is called BuildDataset, and is present in the pipeline module.

How to Run the Complete Pipeline

The rest of this documentation describes exactly what the full pipeline does and how that is accomplished through several processing stages/modules. To run all stages and produce all outputs, there are three steps.

Download the data files from here. I have added make targets to download and extract this data, so you can simply run make dl && make extract. This will download the data and extract it into the directory data/original-data. Note that make extract will also install a tool to unrar the rar archive; it will be placed in the working directory.
Copy pipeline/config-example.py to pipeline/config.py and modify the directories so the base directory points to the top-level directory you want your data files written to. Place the files you downloaded in step 1 in the location pointed to by originals_dir. Ensure you have the following 3 files in the location of your config.originals_dir:
- AMiner-Author2Paper.txt
- AMiner-Author.txt
- AMiner-Paper.txt
Run the following command, including a start and end year to specify a data range to filter down to. If you do not already have the needed dependencies, you will need to install them to run this. See below for instructions.

python pipeline.py BuildDataset --start <int> --end <int> --local-scheduler

Installing Dependencies

Dependencies include numpy, pandas, luigi, python-igraph, and gensim. To install all dependencies using pip, run:

pip install -r requirements.txt

Outputs

All outputs end up in the data directory inside the base directory, which is specified in the config module by setting base_dir.

Module 1: Relational Representation

Module: aminer Location: base-csv/ Config: base_csv_dir

The first transformation layer involves parsing the given input files into several CSV files which can subsequently be loaded into a relational database or used more easily by other transformation steps. In particular, the aminer module performs the following conversions:

AMiner-Paper.txt        ->  paper.csv  (id,title,venue,year,abstract)
                        ->  refs.csv   (paper_id,ref_id)
                        ->  venue.csv  (venue -- listing of unique venues)
                        ->  year.csv   (year -- listing of unique years)

AMiner-Author.txt       ->  person.csv (id,name)

AMiner-Author2Paper.txt ->  author.csv (author_id,paper_id)

These six CSV files contain all the information used by subsequent processing modules; the four original files from the Aminer dataset are not used again.

Module 2: Filtering

Module: filtering Location: filtered-csv/ Config: filtered_dir

Rather than examining the entire dataset at once, many experiments will likely find it useful to filter to a range of years. For this purpose, the second module provides a filtering interface which takes the six relational data files and filters them based on the paper publication years. The filtering module provides code to do this. All of the tasks involved take a start and end year. Running the FilterAllCSVRecordsToYearRange task like so:

python filtering.py FilterAllCSVRecordsToYearRange --start 1990 --end 2000 --local-scheduler

will produce the following:

paper-1990-2000.csv
refs-1990-2000.csv
venue-1990-2000.csv
person-1990-2000.csv
author-1990-2000.csv

Notice that year.csv is not filtered, for obvious reasons. These files can now be used instead of those produced from the aminer output.

Module 3: Network Building

Module: build_graphs Location: graphs/ Config: graph_dir

This module constructs citation networks from the relational data files. In particular, it contains tasks for building a paper citation graph and an author citation graph, as well as for finding and writing the largest connected component (LCC) of the author citation graph. All tasks take an optional start and end year. If none is passed, the entire dataset is used; otherwise the specified subset is parsed (if not already present in filtered-csv/) and used instead. All graph data can be built by running the BuildAllGraphData task like so:

python build_graphs.py BuildAllGraphData --start 2000 --end 2005 --local-scheduler

This will produce the following output files:

paper-citation-graph-2000-2005.pickle.gz
paper-citation-graph-2000-2005.graphml.gz
paper-id-to-node-id-map-2000-2005.csv
author-citation-graph-2000-2005.graphml.gz
author-id-to-node-id-map-2000-20005.csv
lcc-author-citation-graph-2000-2005.csv
lcc-author-citation-graph-2000-2005.edgelist.txt
lcc-author-citation-graph-2000-2005.pickle.gz
lcc-author-id-to-node-id-map-2000-2005.csv
lcc-venue-id-map-2000-2005.csv
lcc-ground-truth-by-venue-2000-2005.txt
lcc-author-venues-2000-2005.txt

Note that the dates will be absent when running without start and end. So for instance, the last file would be lcc-author-venues.txt instead.

TODO: Add info on each data file

Module 4: Representative Documents

Module: repdocs Location: repdocs/ Config: repdoc_dir

This module creates representative documents (repdocs) for each paper by concatenating the title and abstract with a space between. Subseqeunt processing treats these documents as a corpus to construct term frequency (tf) attributes for each paper. Note that the tf corpus is the well-known bag-of-words (BoW) representation.

Since experiments may also be concerned with authors as nodes in a network, such as in the LCC author citation graph constructed by the build_graphs module, repdocs are also created for each author. The repdoc for a person is built by concatenating the repdocs of all papers authored. These are then treated in the same manner as paper repdocs to build a tf corpus. Term-frequency inverse-document-frequency (tfidf) weighting is also applied to produce an additional corpus file for authors.

All data can be produced by running:

python repdocs.py BuildAllRepdocData --start 2013 --end 2013 --local-scheduler

The following files are produced:

repdoc-by-paper-2013-2013.csv
repdoc-by-paper-vectors-2013-2013.csv
repdoc-by-paper-corpus-2013-2013.dict
repdoc-by-paper-corpus-2013-2013.mm
repdoc-by-paper-corpus-2013-2013.mm.index
paper-id-to-repdoc-id-map-2013-2013.csv
repdoc-by-author-vectors-2013-2013.csv
lcc-repdoc-corpus-tf-2013-2013.mm
lcc-repdoc-corpus-tf-2013-2013.mm.index
lcc-repdoc-corpus-tfidf-2013-2013.mm
lcc-repdoc-corpus-tfidf-2013-2013.mm.index

Note that the files prefixed with lcc- are dependent upon the output of the build_graphs module, since the author ids from the LCC author citation graph are used to filter down the author repdocs used to build the corpus.

TODO: add explanation of data files

Building All

To build all data files for a particular range of years, simply run:

python pipeline.py BuildDataset --start <int> --end <int> --local-scheduler

The start and end arguments can be omitted to build all data files for the whole dataset. In addition, a --workers <int> flag can be used to specify level of multiprocessing to be used. The dependency chain will limit this in some places throughout the processing, but it can provide a signficant speedup overall.

Input Data Format

Paper Format (V8)

The papers in the dataset are represented using a custom non-tabular format which allows for all papers to be stored in the same file in sequential blocks. This is the specification:

#index ---- index id of this paper
#* ---- paper title
#@ ---- authors (separated by semicolons)
#o ---- affiliations (separated by semicolons, and each affiliaiton corresponds to an author in order)
#t ---- year
#c ---- publication venue
#% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
#! ---- abstract

The following is an example:

#index 1083734
#* ArnetMiner: extraction and mining of academic social networks
#@ Jie Tang;Jing Zhang;Limin Yao;Juanzi Li;Li Zhang;Zhong Su
#o Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;IBM, Beijing, China;IBM, Beijing, China
#t 2008
#c Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
#% 197394
#% 220708
#% 280819
#% 387427
#% 464434
#% 643007
#% 722904
#% 760866
#% 766409
#% 769881
#% 769906
#% 788094
#% 805885
#% 809459
#% 817555
#% 874510
#% 879570
#% 879587
#% 939393
#% 956501
#% 989621
#% 1117023
#% 1250184
#! This paper addresses several key issues in the ArnetMiner system, which aims at extracting and mining academic social networks. Specifically, the system focuses on: 1) Extracting researcher profiles automatically from the Web; 2) Integrating the publication data into the network from existing digital libraries; 3) Modeling the entire academic network; and 4) Providing search services for the academic network. So far, 448,470 researcher profiles have been extracted using a unified tagging approach. We integrate publications from online Web databases and propose a probabilistic framework to deal with the name ambiguity problem. Furthermore, we propose a unified modeling approach to simultaneously model topical aspects of papers, authors, and publication venues. Search services such as expertise search and people association search have been provided based on the modeling results. In this paper, we describe the architecture and main features of the system. We also present the empirical evaluation of the proposed methods.

Name Disambiguation

From the data, it appears the AMiner group did not perform any name disambiguation. This has led to a dataset with quite a few duplicate author records. This package currently does not address these issues.

The most obvious examples are those where the first or second name is abbreviated with a single letter in one place and spelled out fully in another. Use of dots and/or hyphens in some places also leads to different entity mappings. Another case that is quite common is when hyphenated names are spelled in some places with the hyphen and in some without.

There are also simple common misspellings, although these are harder to detect, since an edit distance of 1 or 2 could just as easily be a completely different name. One case which might be differentiated is when the edit is a deletion of a letter in a string of one or more of that same letter. For instance, "Acharya" vs. "Acharyya". Here it likely the second spelling simply has an extraneous y.

dblp's People

Contributors

Stargazers

Watchers

dblp's Issues

Add complete() mechanism to BuildDataset

It appears that Tasks with no output are supposed to implement a custom complete() method, since completion normally means all the output files exist. We should either make the outputs include all outputs or add a custom complete method that figures it out some other way.

More discussion here: https://groups.google.com/forum/#!topic/luigi-user/F8AAG91tZfk

Improve filtering to use smaller dependencies if available

Currently the filtering module always grabs the files from base-csv to filter from. However, if a time range is given that is subsumed by another filtered file in filtered-csv, that file could be used instead. For instance, if we need to filter to 2011-2011, we can do that with the dataset for 2010-2012, since 2011 is subsumed by it.

This should be implemented for all YearFilterableTask subclasses (so probably something generic on the base class).

Unit tests for each v1.0 Task

Should rely on a small portion of the real dataset that is representative in order to test.

How to change AMiner-Author2Paper.txt to AMiner-Author2Paper.tsv ?

RuntimeError: Unfulfilled dependency at run time:

Link to Log file, running as per Readme getting this error.

config is not recognized while executing the code

I am facing problem while running pipeline.py, as i am getting problem in config module

Thoroughly document each Task

Failed scheduling due to utils.py 'basestring' is not defined?

Hi,
I am trying to use your parser, and while running the command pipeline admin$ python pipeline.py BuildDataset --start 2000 --end 2001 --local-scheduler I am getting the error, which is connected to " NameError: name 'basestring' is not defined" in the utils.py. I looked at the code, but tbh I struggle to get what the variable basestring suppose to be. Any indications what can I check or how can I solve this much appreciated!

Error:

DEBUG: Checking if BuildDataset(start=2000, end=2001) is complete
/Users/admin/anaconda/lib/python3.5/site-packages/luigi/worker.py:328: UserWarning: Task BuildDataset(start=2000, end=2001) without outputs has no custom complete() method
  is_complete = task.complete()
DEBUG: Checking if BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001) is complete
INFO: Informed scheduler that task   BuildDataset_2001_2000_429339e3d6   has status   PENDING
WARNING: Will not run BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001) or any dependencies due to error in complete() method:
Traceback (most recent call last):
  File "/Users/admin/anaconda/lib/python3.5/site-packages/luigi/worker.py", line 328, in check_complete
    is_complete = task.complete()
  File "/Users/admin/anaconda/lib/python3.5/site-packages/luigi/task.py", line 533, in complete
    outputs = flatten(self.output())
  File "/Users/admin/Desktop/DBLP_parser/dblp-master/pipeline/util.py", line 39, in output
    if isinstance(self.base_paths, basestring):
NameError: name 'basestring' is not defined

INFO: Informed scheduler that task   BuildLCCAuthorRepdocCorpusTfidf_2001_2000_429339e3d6   has status   UNKNOWN
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 1 pending tasks possibly being run by other workers
DEBUG: There are 1 pending tasks unique to this worker
DEBUG: There are 1 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=038425231, workers=1, host=Monikas-MacBook-Pro.local, username=admin, pid=75675) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 2 tasks of which:
* 1 failed scheduling:
    - 1 BuildLCCAuthorRepdocCorpusTfidf(start=2000, end=2001)
* 1 were left pending, among these:
    * 1 had dependencies whose scheduling failed:
        - 1 BuildDataset(start=2000, end=2001)

Did not run any tasks
This progress looks :( because there were tasks whose scheduling failed

===== Luigi Execution Summary =====

How to run this on the Aminer dataset?

Hi,

I want to parse the dblp dataset, I went through the documentation but I am still struggling on starting. Can anyone please elaborate on the steps to get this thing running.

Build co-authorship network

While the AMiner group already has a co-authorship network provided, it unfortunately does not allow for filtering by year ranges, which is a key feature of this library. Therefore it would be useful to implement tasks which construct such a network that inherit from YearFilterableTask. These could then also be combined with the term attributes from the repdoc corpus files and the ground truth from the venues in order to produce a more useful network.

Perform author name disambiguation to produce new mapping

BuildAllGraphData task does not exist in build_graphs.py

According to the documentation, there should be a task called BuillAllGraphData to build all the graphs, unfortunately, it does not exist in the code?

Thanks.

Refactor `convert` module to use luigi

Author <id> to <name> mapping

I was able to run the whole project. But i am not sure to from where do i get author id to author name mapping ?

Repdocs Module Documentation

Could you please add descriptions for each file in the repdocs module. I'm trying to use this parser for my projects and am unclear what all the files contain and how they relate to each other.
For example, the dictionary created using gensim.corpora has a different number of documents than the tfidf matrix created.

Add graphml writer that includes term attributes on nodes

This will be useful for graph algorithms that read features from the nodes rather than reading them from another file. One example is EDCAR.

File "pipeline.py", line 9, import config ImportError: No module named config

I have followed steps. Not able to find config.py

NFO: Task RemoveUniqueVenues__99914b932b died unexpectedly ERROR....

hi,
after installing all dependecies , i am still unable to whats went wrong, as it did genrate csv files but not genrating gml file,
program is giving this error,
could you help in resolving this issue. would appreciate if you upload gml files to repo.

INFO: Task RemoveUniqueVenues__99914b932b died unexpectedly with exit code -9
/usr/local/lib/python2.7/dist-packages/luigi/parameter.py:259: UserWarning: Parameter None is not of type string.
warnings.warn("Parameter {0} is not of type string.".format(str(x)))

===== Luigi Execution Summary =====

Scheduled 25 tasks of which:

2 present dependencies were encountered:
- 1 AminerNetworkAuthorships()
- 1 AminerNetworkPapers()
5 ran successfully:
- 1 CSVPaperRecords()
- 1 CSVRefsRecords()
- 1 ParseAuthorshipsToCSV()
- 1 ParsePapersToCSV()
- 1 RemovePapersNoVenueOrYear()
1 failed:
- 1 RemoveUniqueVenues()
17 were left pending, among these:
- 17 had failed dependencies:
  - 1 AuthorCitationGraphLCCIdmap(start=2003, end=2004)
  - 1 BuildAuthorCitationGraph(start=2003, end=2004)
  - 1 BuildAuthorRepdocVectors(start=2003, end=2004)
  - 1 BuildDataset(start=2003, end=2004)
  - 1 BuildLCCAuthorRepdocCorpusTf(start=2003, end=2004)
    ...

This progress looks 😞 because there were failed tasks

python filtering.py FilterAllCSVRecordsToYearRange --start 1990 --end 2000 --local-scheduler does not work as guided

Hi Mack,

Thanks for closing the BuildAllGraph issue so promptly.

I also noticed that module 2 does not work as expected.

If I ran,

$ python filtering.py FilterAllCSVRecordsToYearRange --start 1990 --end 2000 --local-scheduler

I would receive the following errors.

ERROR: [pid 31790] Worker Worker(salt=969812214, workers=1, host=ubuntu, username=hello, pid=31790) failed FilterVenuesToYearRange(start=1990, end=1990)
Traceback (most recent call last):
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 192, in run
new_deps = self._run_get_new_deps()
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 130, in _run_get_new_deps
task_gen = self.task.run()
File "filtering.py", line 158, in run
with self.output().open() as afile:
File "/Users/hello/anaconda2/lib/python2.7/site-packages/luigi/local_target.py", line 152, in open
fileobj = FileWrapper(io.BufferedReader(io.FileIO(self.path, mode)))
IOError: [Errno 2] No such file or directory: '/Users/hello/dblp/data/processed/filtered-csv/venue-1990-1990.csv'

Many thanks in advance.

Thoroughly document each output file.

It would be good to find some sort of programmatic way to do this, such that the final output is a polished data dictionary which can be exported to Excel or converted to a PDF. Crawling the dependency tree sounds like the way to go in order to grab all possible output files. Then this list can be compared with a documentation file (perhaps doc.md or doc.csv) in order to determine coverage.

Of course, such an approach would need to ignore the optional year parameters. There's no sense in producing all possible files for all possible year ranges.

Tasks to summarize data

For a complete dataset, generate a summary of salient characteristics, such as:

number of nodes and edges for each graph, diameter, avg. degree
number of documents, terms, and nonzeros in each corpus, quantiles on term count
proportion of papers with abstracts
ground truth stats: # venues, quantiles on comm. size

paper.csv is too large to save in my computer

When I tried to run the pipeline, paper.csv was generated from Miner-Papertxt (about 2.2G). And the paper.csv file was too large (exceeded 1.7T) but my computer has only about 2T storage space. So it failed each time I run the project. Do you know how to fix this?

Add AMiner data retrieval script

Write a script that downloads the data from the AMiner site and places it in the proper location. This could be a luigi Task or a bash script. The task option might be better, because then it can be a programmatic dependency which can be resolved (run) by luigi, rather than an external one.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.