- Introduction
- Implementation details
- Input files
- Output file
- Python Module
- Command Line Script
- Tests
- Dependencies
This is the submission for Sean Wahl, for the Insight Data Engeneering coding challenge. As described in the prompt, the code is designed to parse log files provided by the Securities and Exchange Commission's Electronic Data Gathering, Analysis and Retrieval (EDGAR) system.
Your goal of the code to build a pipeline to ingest that stream of data in the form of the SEC EDGAR logs, under the assumption that data would be provided as a real-time stream into the program, and calculate how long a particular user spends on EDGAR during a visit and how many documents that user requests during the session.
The program expects two input files, both located in the input
directory:
log.csv
: EDGAR weblog datainactivity_period.txt
: Holds a single value denoting the period of inactivity that should be used to identify when a user session is over
The source code for the EDGAR weblog parser is a Python module located in:
src/sessionization.py
While the python module can be imported into python code in the usual fashion,
sessionization.py
also contains a script implementation that can be called from the
command-line via:
python src/sessionionization.py ./input/log.csv ./input/inactivity_period.txt ./output/sessionization.txt
In the current example this script is called by a shell-script:
./run.sh
The SEC provides weblogs stretching back years and is regularly updated, although with a six month delay.
For the purposes of this challenge, I assume that the data is being streamed into the program in the same order that it appears in the file with the first line (after the header) being the first request and the last line being the latest, and that the data is listed in chronological order for the purposes of this challenge.
The code makes use of the following variables
ip
: identifies the IP address of the device requesting the data. While the SEC anonymizes the last three digits, it uses a consistent formula that allows you to assume that any twoip
fields with the duplicate values are referring to the same IP addressdate
: date of the request (yyyy-mm-dd)time
: time of the request (hh:mm:ss)cik
: SEC Central Index Keyaccession
: SEC document accession numberextention
: Value that helps determine the document being requested
The logfile contains a header line, which the code uses to know which column of the log corresponds to which variable.
This file will hold a single integer value denoting the period of inactivity (in seconds) that your program should use to identify a user session. The value will range from 1 to 86,400 (i.e., one second to 24 hours)
Once the program identifies the start and end of a session, it gathers the following fields and write them out to a line in the output file, sessionization.txt
. The fields on each line must be separated by a ,
:
- IP address of the user exactly as found in
log.csv
- date and time of the first webpage request in the session (yyyy-mm-dd hh:mm:ss)
- date and time of the last webpage request in the session (yyyy-mm-dd hh:mm:ss)
- duration of the session in seconds
- count of webpage requests during the session
The submitted code is implemented in Python 2.7.12.
The Python module contained in src/sessionization.py
defines two classes:
User
: Represents a user in a SEC EDGAR data log and tracks requests during a sessionSessionParser
: Defines a parser object for SEC EDGAR log data
A User
is initialized with a unique integer identifier lid
, the user's ip address
ip
and start_time
, a datetime object representing the time of the user's first
request in a session.
The function User.update()
is called for each subsequent request in a session, during
which the User
tracks the session length and number of documents requested.
A SessionParser
is initialized with output
a file object to for the parsed session
information, inactivity_period
the integer number of seconds beyond which a session
is terminated, and fields
a dictionary storing the column numbers for the necessary
fields.
Once initialized EDGAR log data is passed to the parser via
SessionParser.parse_requests()
. This can be called an arbitrary number of times to
parse data from different log-files in sequence, and the parser object is able to
track a User session that continues over those two log files.
The parser includes checks so that it should gracefully skip over any invalid entries in a log file, and continue parsing the sessions under the asssumption that these entries can be ignored, and prints a warning when this occurs.
Finally, `SessionParser.terminate_remaining_sessions() must be run after all desired log data has been passed to the parser to finish terminating sessions that were active at the the end of the last data source.
The command line script contained in src/sessionization.py
implements an example of
the code in which the EDGAR weblog data is read from a single log file. The script
passes the log data to the parser a fixed number of lines at the time. This is
intended so that the output can be viewd while a very large data sourced in is being
parced, and to function in a similar manner as a real-time stream of data. The size
of this line buffer can be specified by the commandline option '-h'.
usage: sessionization.py [-h] [-n LINE_BUFFER] [-d HEADER] log*.csv inactivity_period output
Parse SEC EDGAR log into Sessions.
positional arguments: log*.csv EDGAR weblog data file (csv) inactivity_period text file storing integer inactivity period output output text file with session information
optional arguments: -h, --help show this help message and exit -n LINE_BUFFER, --line_buffer LINE_BUFFER number of lines to parse at a time -d HEADER, --header HEADER number of header lines to skip at start of log file
To make sure that your code has the correct directory structure and the format of the output files are correct, we have included a test script called run_tests.sh
in the insight_testsuite
folder.
The tests are stored simply as text files under the insight_testsuite/tests
folder. Each test should have a separate folder with an input
folder for inactivity_period.txt
and log.csv
and an output
folder for sessionization.txt
.
You can run the test with the following command from within the insight_testsuite
folder:
insight_testsuite~$ ./run_tests.sh
On a failed test, the output of run_tests.sh
should look like:
[FAIL]: test_1
[Thu Mar 30 16:28:01 PDT 2017] 0 of 1 tests passed
On success:
[PASS]: test_1
[Thu Mar 30 16:25:57 PDT 2017] 1 of 1 tests passed
The other tests were deisgned to check that the parser works as intended for
different choices of inactivity_period
, for beginning at an arbitry time, and for
skipping invalid input data.
The python module for the parser depends only on the following modules contained form the Standard Python Library:
sys
argparse
datetime
itertools
warnings