MAPO:
Mining API Usages from Open Source Repositories
Related Publication:
Tao Xie and Jian Pei. MAPO:
Mining API Usages from Open Source Repositories. In Proceedings
of the 3rd
International Workshop on Mining Software Repositories (MSR 2006),
Shanghai, China, pp. 54-57, May 2006. [PDF][BibTeX][Slides]
This
project is to develop a tool to mine API usage out of partial source
code (where
no complete source code is provided for compilation). For example,
these source
code files can come from the searched results of a search engine of
open source
projects: http://www.koders.com/.
Of
course, the tool can mine information out of local complete source
code. The
tool consists of three components
-
Source code collector:
this component is to be developed for collecting top N source code
files returned by koders.com given some search keywords, either
features like "logging", package names, class names, or method names.
So far we manually download the source code files from koders.com. We
can also collect source code from some downloaded open source projects.
-
Source code analyzer:
given the collected Java source files, this component analyzes each
Java source file and produces a file containing method-call sequences
invoked by each method in the Java source file. It also exports a
single sequence database file that can be analyzed by the BIDE sequence
mining tool.
-
Usage pattern miner (BIDE):
given the
sequence database, this component mines frequent usage patterns from
the sequence database. The BIDE tool is not publicly available and is
provided by the BIDE authors up request. An alternative frequent
sequence miner is SPAM, which can be downloaded http://himalaya-tools.sourceforge.net/.
But note that then you need to adapt the miner input format described
below, which is specific for BIDE miner.
Source code collector:
- So far we manually download the source code files from
www.koders.com.
Source code analyzer:
- Installation: The source code of this
component can be downloaded from this web. It is developed based on PMD, a Java
source code scanner. In order to run it, you need to download this
modified pmd jar file (source files are also included in the
jar file) and add it to your classpath. In addition, you need to
download the following jar files and add them to your classpath: ant,
jaxen,
xercesImpl,
and xmlParser.
You also need to download this
ruleset zip and extract it to your local harddrive. Assume
the directory is C:\xtwork\pmd\pmd-3.4\rulesets.
- Usage: java net.sourceforge.pmd.PMD /path/to/source
text c:\xtwork\pmd\pmd-3.4\rulesets\methodcalls.xml
e.g., java net.sourceforge.pmd.PMD
c:\xtwork\pmd\pmd-3.4\examples\BCELClassAnalyzer.java text
c:\xtwork\pmd\pmd-3.4\rulesets\methodcalls.xml
You can also analyze all the files under the same directory
by specifying the path to the source rather than a specific source file
name:
e.g., java net.sourceforge.pmd.PMD c:\xtwork\pmd\pmd-3.4\examples text
c:\xtwork\pmd\pmd-3.4\rulesets\methodcalls.xml
If you want to run the tool over a jar or zip file containing all the
source files, please refer to PMD's usage
documentation, which is still valid in our tool.
- Outputs: For each Java source file, in
the same directory, you can see four files. Assume you java source file
is BCELAnalyzer.java, then you will see four files:
BCELAnalyzer.java.woce: method sequences with inlined local method calls
BCELAnalyzer.java.woc: method sequenced without inlined local method
calls; so you can see local method calls like this.XXX in the
sequences. This file is for debugging use.
BCELAnalyzer.java.full: no used, eventually we will add control flow
information among method call sequences.
BCELAnalyzer.java.debug: include debug information, containing control
flow information.
At the moment, only BCELAnalyzer.java.woce file is useful for debugging.
In addition, it outputs the following files to be used for mining (for
the subject described below):
mcseq.txt: inputs
to BIDE
mcseq.spec: inputs
to BIDE
mcseq.map: mapping
from method names to method ids, which are used in mcseq.txt
mcseq.txt.debug:
readable form of mcseq.txt, for debugging
For example, the following is one line in mcseq.txt.debug (for method
call name representation, see below "*.woce file format"). Basically a
line lists the method calls invoked by a caller separated by a space.
The line ends with "-1." After the "-1," we also put the caller name,
which is not present in the mcseq.txt file. In the mcseq.txt file,
method calls are represented by their method IDs, whose mappings are
described in mcseq.map.
org.xml.sax.helpers.AttributesImpl,<init>
org.xml.sax.ContentHandler,startDocument
org.xml.sax.ContentHandler,startElement(4)
org.xml.sax.ContentHandler,characters(3)
org.xml.sax.ContentHandler,endElement(3)
org.xml.sax.ContentHandler,endDocument -1 generateLargeSAX(1)
@AbstractXMLTestCase.java.woc
The corresponding line in mcseq.txt:
0 1 2 3 4 5 -1
- *.woce file format: Each sequence is
separated by an empty line. Each sequence starts with a line that
starts with "callers:" What follows "callers:" is is the method name
defined the Java source code. Note that when you import the sequence
into a sequence database, the first line should be ignored. If a method
has more than one parameter, the method name is followed by
"(PARAM_NUM)." Then the subsequent lines list the method call sequences
that are invoked within the method. The naming of the method calls in
the sequence is similar to above. But for each method name in the
sequence, we also include its package name separated by "," from the
method name.
caller: prepareMethodMap(1)
Class,getDeclaredMethods
org.apache.bcel.Repository,lookupClass(1)
org.apache.bcel.classfile.JavaClass,getMethods
org.apache.bcel.classfile.Method,getName
org.apache.bcel.generic.ArrayType,getDimensions
Class,equals(1)
org.apache.bcel.classfile.Method,getArgumentTypes
Class,equals(2)
caller: findAndAddBCELMethod(2)
org.apache.bcel.classfile.Method,getName
org.apache.bcel.generic.BasicType,equals(1)
...
You can download the
source file BCELAanalyzer.java and its four generated output files
to have a concrete idea on what they look like. You can also run the
tool over it as well as any other Java source files.
Some development notes for the source code analyzer can be found here.
Usage pattern miner (BIDE):
Prepared by Jianyong Wang, Email: jianyong@tsinghua.edu.cn
(related but not in UIUC
illimine,
BIDE is described in this
ICDE 04 paper)
An alternative frequent sequence miner is SPAM, which can be downloaded
http://himalaya-tools.sourceforge.net/.
But note that then you need to adapt the miner input format described
below, which is specific for BIDE miner.
-
Installations: put
the executable BIDE to a directory that is specified in the system path
environment variable.
-
Inputs: 1st
argument: The specification file of the dataset
2nd argument: Relative support in decimal
Usage example: bide_with_output.exe mcseq.spec 0.5
Where bide is the executable file name,
bide_gaz.spec is the specification file of the sequence dataset being
mined, 0.5 is the relative support.
Specification file format: The
first line is the dataset file name, the second line is the number of
unique items, the third line is the number of sequences, the fourth
line is the maximal length of a sequence, and the fifth line is the
average length of a sequence.
Dataset file format: Usually a sequence
database consists of a series of sequences (strictly speaking, here a
sequence is a string in the current implementation). Each line
represents a sequence and ends with -1, and
the entire dataset ends with -2. Here is a sample sequence:
38 81 256 399 756 841 962 1009 -1
Example
datasets: mcseq.spec
and mcseq.txt
-
Output:
The discovered frequent
sequences are printed into a file called “frequent.dat”.
Each line in the result file, “frequent.dat”, contains
a frequent sequence in the form:
event1
event
2 …
eventn :
absolute support
Here is an example:
6 24
748 : 66
Example output: frequent.dat
- Frequent sequence postprocessor (a class
included in the source code analyzer):
java net.sourceforge.pmd.rules.MethodCallsPostprocessor
DirectoryOfFrequentDat\frequent.dat
This produces a human readable form of frequent.data: frequent.data.txt
It also produces a file that contains the frequent patterns that start
with the same method call (note that so far we output only the first
set of frequent patterns that share the same method call): exampletrace.txt.
This file will be fed to kBehavior.
Subjects: