Supported Pandas Operations¶
Below is the list of the Pandas operators that HPAT supports. Optional arguments are not supported unless if specified. Since Numba doesn’t support Pandas, only these operations can be used for both large and small datasets.
In addition:
- Accessing columns using both getitem (e.g.
df['A']
) and attribute (e.g.df.A
) is supported. - Using columns similar to Numpy arrays and performing data-parallel operations listed previously is supported.
- Filtering data frames using boolean arrays is supported
(e.g.
df[df.A > .5]
).
Integer NaN Issue¶
DataFrame columns with integer data need special care. Pandas dynamically converts integer columns to floating point when NaN values are needed. This is because Numpy does not support NaN values for integers. HPAT does not perform this conversion unless enough information is available at compilation time. Hence, the user is responsible for manual conversion of integer data to floating point data if needed.
Input/Output¶
pandas.read_csv()
- Arguments
filepath_or_buffer
,sep
,delimiter
,names
,usecols
,dtype
, andparse_dates
are supported. filepath_or_buffer
,names
anddtype
arguments are required.names
,usecols
,parse_dates
should be constant lists.dtype
should be a constant dictionary of strings and types.
- Arguments
pandas.read_parquet()
- If filename is constant, HPAT finds the schema from file at compilation time. Otherwise, schema should be provided.
General functions¶
pandas.merge()
- Arguments
left
,right
,as_of
,how
,on
,left_on
andright_on
are supported. on
,left_on
andright_on
should be constant strings or constant list of strings.
- Arguments
pandas.concat()
- Input list or tuple of dataframes or series is supported.
Series¶
pandas.Series()
- Argument
data
can be a list or array.
- Argument
Attributes:
Series.values
Series.shape
Series.ndim
Series.size
Methods:
Series.copy()
Indexing, iteration:
Series.iat()
Series.iloc()
Binary operator functions:
Series.add()
Series.sub()
Series.mul()
Series.div()
Series.truediv()
Series.floordiv()
Series.mod()
Series.pow()
Series.combine()
Series.lt()
Series.gt()
Series.le()
Series.ge()
Series.ne()
Function application, GroupBy & Window:
Series.apply()
Series.map()
Series.rolling()
Computations / Descriptive Stats:
Series.abs()
Series.corr()
Series.count()
Series.cov()
Series.cumsum()
Series.describe()
currently returns a string instead of Series object.Series.max()
Series.mean()
Series.median()
Series.min()
Series.nlargest()
Series.nsmallest()
Series.pct_change()
Series.prod()
Series.quantile()
Series.std()
Series.sum()
Series.var()
Series.unique()
Series.nunique()
Reindexing / Selection / Label manipulation:
Series.head()
Series.idxmax()
Series.idxmin()
Series.take()
Missing data handling:
Series.isna()
Series.notna()
Series.dropna()
Series.fillna()
Reshaping, sorting:
Series.argsort()
Series.sort_values()
Series.append()
Time series-related:
Series.shift()
String handling:
Series.str.contains()
Series.str.len()
DataFrame¶
pandas.DataFrame()
Only
data
argument with a dictionary input is supported.
Attributes and underlying data:
DataFrame.values
Indexing, iteration:
DataFrame.head()
DataFrame.iat()
DataFrame.iloc()
DataFrame.isin()
Function application, GroupBy & Window:
DataFrame.apply()
DataFrame.groupby()
DataFrame.rolling()
Computations / Descriptive Stats:
DataFrame.describe()
Missing data handling:
DataFrame.dropna()
DataFrame.fillna()
Reshaping, sorting, transposing
DataFrame.pivot_table()
- Arguments
values
,index
,columns
andaggfunc
are supported. - Annotation of pivot values is required. For example, @hpat.jit(pivots={‘pt’: [‘small’, ‘large’]}) declares the output pivot table pt will have columns called small and large.
- Arguments
DataFrame.sort_values()
by argument should be constant string or constant list of strings.DataFrame.append()
DatetimeIndex¶
DatetimeIndex.year
DatetimeIndex.month
DatetimeIndex.day
DatetimeIndex.hour
DatetimeIndex.minute
DatetimeIndex.second
DatetimeIndex.microsecond
DatetimeIndex.nanosecond
DatetimeIndex.date
DatetimeIndex.min()
DatetimeIndex.max()
TimedeltaIndex¶
TimedeltaIndex.days
TimedeltaIndex.second
TimedeltaIndex.microsecond
TimedeltaIndex.nanosecond
Timestamp¶
Timestamp.day
Timestamp.hour
Timestamp.microsecond
Timestamp.month
Timestamp.nanosecond
Timestamp.second
Timestamp.year
Timestamp.date()
Window¶
Rolling.count()
Rolling.sum()
Rolling.mean()
Rolling.median()
Rolling.var()
Rolling.std()
Rolling.min()
Rolling.max()
Rolling.corr()
Rolling.cov()
Rolling.apply()
GroupBy¶
GroupBy.apply()
GroupBy.count()
GroupBy.max()
GroupBy.mean()
GroupBy.median()
GroupBy.min()
GroupBy.prod()
GroupBy.std()
GroupBy.sum()
GroupBy.var()