ARTICLE AD BOX
TLDR;
I'm trying to understand why list-like methods in dataframes behave the same as the list-like methods in Series.
Basically, in Series, if you use apply or transform on list-like methods, they will first try to pass each value of the Series individually into the function. If it doesn't work, they will pass the entire Series into the function. By the same logic, for a dataframe, first a row/column or should be passed to the function and then if it doesn't work the entire dataframe should be passed.
But this is not the case, the dataframe's behaviour for list-like functions is exactly same as those of series's, it first checks for individual values and then for entire columns. The level of granularity doesn't increase. I want to understand why it was written this way. I have seen the source code behind it.
In Detail
I'm exploring the apply and transform commands in pandas. I get the things when a single function is supplied.
For Series apply - The function is applied to each value transform - The function is passed to the apply command. If it doesn't work then the entire series is passed to the function For Dataframe, the function is applied to each column or row apply - The function is applied to each column/row transform - The function is passed to the apply command. If it doesn't work then the entire dataframe is passed to the functionThe confusion arises, when using list-like function arguments. The doc for all the above functions tells the same thing via the by_row parameter. It says as follows
by_row : False or “compat”, default “compat”
... If func is a list or dict of callables, will first try to translate each func into pandas methods. If that doesn’t work, will try call to apply again with by_row="compat" and if that fails, will call apply again with by_row=False (backward compatible) ...
Now, let's start with the apply function. For both Series and DataFrames, it all comes down to this piece of source code for list-like functions.
def apply_compat(self): obj = self.obj func = self.func if callable(func): f = com.get_cython_func(func) if f and not self.args and not self.kwargs: return obj.apply(func, by_row=False) try: result = obj.apply(func, by_row="compat") except (ValueError, AttributeError, TypeError): result = obj.apply(func, by_row=False) return resultSo, here is some sample code to see that first a single value is passed and then an entire series is passed in the apply function, when using list-like functions.
def func(val): print("Val - ", val, "\nVal Type - ", type(val), "\n") if not isinstance(val, pd.Series): raise TypeError("Not Series") return val.iloc[0:3] sr = pd.Series([1,2,3,4,5]) sr.apply([func]) # passed the function as a list --- Output Val - 1 Val Type - <class 'int'> Val - 0 1 1 2 2 3 3 4 4 5 dtype: int64 Val Type - <class 'pandas.core.series.Series'> func 0 1 1 2 2 3Now, let's go to transform. The following source code helps us understand the behaviour of list-like functions in Series.
def transform_str_or_callable(self, func) -> DataFrame | Series: obj = self.obj args = self.args kwargs = self.kwargs if isinstance(func, str): return self._apply_str(obj, func, *args, **kwargs) # Two possible ways to use a UDF - apply or call directly try: return obj.apply(func, args=args, **kwargs) except Exception: return func(obj, *args, **kwargs)Basically, it can handle any type of exception while using apply. So, if I just modify the type of exception in my custom function I can see it's behaviour (also, will have to return the entire series because the axis shape should be same as when passed)
def func(val): print("Val - ", val, "\nVal Type - ", type(val), "\n") if not isinstance(val, pd.Series): raise Exception("Not Series") return val sr = pd.Series([1,2,3,4,5]) sr.transform([func]) --- Output Val - 1 Val Type - <class 'int'> Val - 0 1 1 2 2 3 3 4 4 5 dtype: int64 Val Type - <class 'pandas.core.series.Series'> func 0 1 1 2 2 3 3 4 4 5Now, my doubt comes with Dataframes. With dataframes, I was hoping the thing would go in this way, first try passing the series and if it not works then pass the entire dataframe to the function. But that is not the case.
def func(val): print("Val - ", val, "\nVal Type - ", type(val), "\n") if not isinstance(val, pd.DataFrame): raise TypeError("Not DataFrame") return val.iloc[0:3] df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['A', 'B']) df.apply([func]) --- Output Val - 1 Val Type - <class 'int'> Val - 0 1 1 3 2 5 Name: A, dtype: int64 Val Type - <class 'pandas.core.series.Series'> TypeError: Not DataFrameSimilarly, in transform as well.
def func(val): print("Val - ", val, "\nVal Type - ", type(val), "\n") if not isinstance(val, pd.DataFrame): raise Exception("Not DataFrame") return val df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['A', 'B']) df.transform([func]) --- Output Val - 1 Val Type - <class 'int'> Val - 0 1 1 3 2 5 Name: A, dtype: int64 Val Type - <class 'pandas.core.series.Series'> ValueError: Transform function failedWhy does the list-like function in DataFrame behave same as Series? I have seen the source code behind it. It's here and here.
if is_list_like(func) and not is_dict_like(func): func = cast(list[AggFuncTypeBase], func) # Convert func equivalent dict if is_series: func = {com.get_callable_name(v) or v: v for v in func} else: func = dict.fromkeys(obj, func) ---- for name, how in func.items(): colg = obj._gotitem(name, ndim=1) results[name] = colg.transform(how, 0, *args, **kwargs)Basically, make a dictionary from column name to each function and then apply the transform function on each column. Similar is true for apply.
I want to know why it wasn't designed like Series. First try a part of the object and then pass the entire object. So, first try with row/column of dataframe and if it not words then pass the entire dataframe. I am trying to remember things but the behaviour of list-like methods completely differs from non list-like methods in dataframes.
