Overview¶
High Performance Analytics Toolkit (HPAT) is a big data analytics and machine learning framework that provides Python’s ease of use but is extremely fast.
HPAT scales analytics programs in python to cluster/cloud environments automatically, requiring only minimal code changes. Here is a logistic regression program using HPAT:
@hpat.jit
def logistic_regression(iterations):
f = h5py.File("lr.hdf5", "r")
X = f['points'][:]
Y = f['responses'][:]
D = X.shape[1]
w = np.random.ranf(D)
t1 = time.time()
for i in range(iterations):
z = ((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y)
w -= np.dot(z, X)
return w
This code runs on cluster and cloud environments using a simple command like mpiexec -n 1024 python logistic_regression.py.
HPAT compiles a subset of Python to efficient native parallel code (with MPI). This is in contrast to other frameworks such as Apache Spark which are master-executor libraries. Hence, HPAT is typically 100x or more faster. HPAT is built on top of Numba and LLVM compilers.