Scheduled upgrade from November 26, 07:00 UTC to November 26, 17:00 UTC
Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.
5
View the Changes
arrow back
Back to Blog
Haziqa Sajid
link to linkedin
Data Scientist
Article updated on

Common Error Messages in Pandas

Errors and Efficiency Improvements for Pandas Code
January 18, 2024
20 min
Read
Haziqa Sajid
Haziqa Sajidlink to linkedin
Data Scientist
Freelance

TL;DR

Pandas is a powerful Python library for data analysis, but users often encounter common errors. This blog post addresses 10 such errors and their solutions as well as provides efficiency tips for Pandas code, such as using built-in functions, choosing better data formats, optimizing and plotting.

Introduction

Pandas is a popular Python library that allows developers to work with tabular data from various sources, including CSV, XLSX, SQL, and JSON. It is widely used by the data science and machine learning (ML) communities for data analysis, exploration, and visualization. The framework is built on top of Matplotlib and NumPy libraries and serves as a concise wrapper, streamlining access to their functionalities with minimal code. 

Pandas loads all data files as a `DataFrame` object, which has access to all relevant statistical and visualization functions required for exploratory data analysis (EDA). Moreover, Pandas is open-source, user-friendly, and has an active community of contributors with extensive documentation.  

Although Pandas has transformed Python completely with its user-friendly features and powerful capabilities for data analysis, like any tool, challenges may arise for users. 

This article will dive into some of the most common Pandas error messages developers encounter and offer solutions. Link to the notebook can be found here.  

10 Common Error messages in Pandas and How to Avoid Them

1. Pandas Standard Error - Not Found

Many beginner-level programmers starting with Python usually encounter the Pandas Not Found error. This error arises from trying to import Pandas when it is not installed on the system. 

The code snippet is as follows:

import pandas 
df = pandas.DataFrame({'name':['Maurice','Alice','Bob'], 'weight':[56,76,50]})

The error looks like this:

> ModuleNotFoundError: No module named "pandas"

Solution: Install the library from the official distribution using the pip package manager. Here’s how to do it.

python -m pip install pandas

2. Calling the Dataframe

Dataframes and Series are the data structures used in Pandas for data analysis. Dataframes exhibit a tabular format, organized into rows and columns, while Series manifest as list-like structures comprising a single column. So, these are objects, not functions.

This error occurs when users assume Dataframes to be callable as functions. This results in a TypeError. Here’s how it looks:

import pandas as pd 
df = pd.DataFrame(
{'cities':['NYC','Delhi','Tokyo'], 
'population':[856,656,765]}) 

df()

The error looks like this:

> TypeError: 'DataFrame' object is not callable

Solution: Remove the parentheses after the Dataframe name and call an appropriate function against the object.

import pandas as pd 
df = pd.DataFrame( {'cities':['NYC','Delhi','Tokyo'], 'population':[856,656,765]}) 

df.head() # Fix (head displays first 5 rows of dataframe by default)

3. Columns Name Not Found

This error message can come in different forms, but knowing the difference between attribute and key will help solve this problem. Attributes are properties or characteristics that can be assigned to classes, while keys are unique identifiers for data. Here’s how Pandas make use of it

import pandas as pd 
df = pd.DataFrame({'name':['Maurice','Alice','Bob'], 'weight':[56,76,50]}) 

df.weight # df’s weight as an attribute 
df["weight"] # df’s weight as a key

The error arises when the name of the column does not exist. For example, the attribute error looks like this. The first letter in the attribute name is typed in uppercase which throws an error since keys and attributes are case-sensitive.

df.Weight # Attribute Error

The error is as follows

> AttributeError: 'DataFrame' object has no attribute 'Weight'

The key error looks like this

df["Weight"] # Key Error 

The error is as follows

> KeyError: 'Weight'

Solution: Recheck the names of the columns. A typo is likely the reason behind the error. It is also possible that the column does not exist, in which case you might want to recheck your data source.

Pro-tip: When naming columns, avoid inserting spaces between names such as "column name." The column will not be accessible as an attribute. Use underscores instead, e.g., “column_name”.

4. Duplicated Index

Indexes are ideally unique, however, Pandas allows users to insert duplicate entries as index. A common error arises when users assume that indexes are inherently unique in Pandas.

Here’s an example:

df = pd.DataFrame({'name':['Maurice','Alice','Bob'],
                   'weight':[56,76,50]},
                    index=[1,1,2]) 

print(df) 

The result for the dataframe is as follows:

>	     name  weight
1  Maurice      56
1    Alice      76
2      Bob      50

As can be seen, the indexes are repeated. It can lead to many errors, and if, for some reason, you reindex it later, it will show an error:

df.reindex(new_index)
> ValueError: cannot reindex on an axis with duplicate labels

Solution: To reindex, remove the duplicate labels. Here’s how we can do it.

df = df[~df.index.duplicated(keep='first')]

This will result in a dataframe keeping the first duplicate and removing the other found. Here’s how it looks.

	     name  weight
1  Maurice      56
2      Bob      50

Now, we can easily reindex the dataframe.

5. When Using all Scalar Values, Pass an Index

In Pandas, a scalar value refers to a single atomic data point. It is a singular element, such as an integer, float, string, or other primary data type. When creating a dataframe, Pandas throw a value error if a scalar value is passed to a column.

Here's an example:

import pandas as pd 
df = pd.DataFrame({'name':"Bob", 'age' : 12})

Here, the column name and age have scalar values, which will result in the following error:

> ValueError: If using all scalar values, you must pass an index

The reason is the class constructor Dataframe accepts the data as an Iterable and not as single values. 

Solution: To resolve the error, you can choose between two approaches. The first way is to specify the index. Here’s how:

df = pd.DataFrame({'name':"Bob", 'age':12},index=[1])

The second way is to pass the values as a list. Let’s take a look:

df = pd.DataFrame({'name':["Bob"], 'age':[12]})

6. Loc and ILoc

The ‘loc’ and ‘iloc’ functions are used to traverse the dataframe using index and integer values.  Both help in filtering data to specific rows and columns.

The loc() function is label-based, requiring the name of the row or column for selection, including the last element in the range. It also accepts boolean data for conditional selection. In contrast, iloc() is index-based, necessitating an integer index, excluding the last range element, and accepting some boolean indexing.

The primary distinction is in the nature of errors associated with each. Let’s take an example:

data = pd.DataFrame(
                 {'Brand': ['Honda', 'Buggati', 'Ferrari'], 
                  'Year': [2014, 2018, 2000], 
                  'City': ['TX', 'LA',  'NY'], 
                  'Mileage':  [40, 20, 30]}) 


data.iloc[(data.Brand == 'Honda')]

This will result in an error message:

> NotImplementedError: iLocation based boolean indexing on an integer type is not available

Solution: `iloc` is not label based, therefore, replacing it with `loc` will do the trick.

data.loc[(data.Brand == "Honda")]

The function will work as intended

	  Brand  Year        City  Mileage
0       Honda  2014          TX       40

7. Series Length Mismatch

Pandas provide functions and operator overloads to compare series or dataFrames. In Pandas, two series or data frames are comparable if they have the same length.

Otherwise, it throws an error. For example:

import pandas as pd 
s1 = pd.Series([1,2,3]) 
s2 = pd.Series([4,5])

s1 == s2 

The equality operator does an element-wise comparison of the two series. Since the lengths for not match between the two, the following error will be thrown

> ValueError: Can only compare identically-labeled Series objects

To resolve the issue, a simple fix is to make it the same length:

s2 = pd.Series([4,5,6]) # Fix (Make the same length) 

s1 == s2

8. SettingWithCopyWarning

Manipulating a Pandas DataFrame results in either a view or a copy. While a view and a copy of a DataFrame may appear identical in values, they have distinct characteristics. A view refers to a portion of an existing DataFrame, whereas a copy is an entirely separate DataFrame, identical to the original one.Modifying a view impacts the original DataFrame, whereas changes to a copy do not affect the original. It's crucial to correctly identify whether you are modifying a view or a copy to avoid unintended alterations to your DataFrame.

Let’s take a look:

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [4, 5, 6]}) 

print("1)", df[df['A'] > 2]['B'])


print("2)",df.loc[df['A'] > 2, 'B'])

When we output them, they look no different than each other. For example,

1) 2    6
Name: B, dtype: int64
2) 2    6
Name: B, dtype: int64

The problem with chained assignment lies in the uncertainty of whether a view or a copy is returned, making it difficult to predict the outcome. This becomes a significant concern when assigning values back to the DataFrame. When values are assigned to the dataframe with chained assignment, it usually throws this warning.

> SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

Solution: Use a consistent function, `loc,` as they always operate on the original dataframe. For example, to change the value, this is what we do:

df.loc[df['A'] > 2, 'B'] = 2

9. Misinterpreted Datatypes

It's a common practice among some programmers to overlook specifying columns and datatypes when importing data into a Dataframe. In such instances, Pandas read the entire dataset into memory to infer the data types, leading to potential memory blockages and increased processing time. Sometimes, a column with inconsistent datatypes raises a warning, which causes many unseen errors.This warning arises from handling larger files, as ‘dtype’ checking occurs per chunk read. Despite the warning, the CSV file is read with mixed types in a single column, resulting in an object type.

df = pd.DataFrame({'a': (['1'] * 100000 + ['X'] * 100000 + ['1'] *    100000), 
                   'b': ['b'] * 300000})

df.to_csv('test.csv', index=False) 

df2 = pd.read_csv('test.csv')

The warning looks like this:

> DtypeWarning: Columns (0) have mixed types. Specify dtype option on

The fix for this is straightforward. When reading the CSV file, specify the data type.

For example,

df2 = pd.read_csv('test.csv', sep=',', dtype={'a': str})

The `dtype` parameter allows you to explicitly define the data type for individual columns. This will not only prevent potential errors like data mismatch while doing operations but also save processing time.  

10. Empty Data Sources

When scraping data from the internet, information is sometimes retrieved unsuccessfully. During subsequent analysis, a common error encountered is the `EmptyDataError.`This error occurs when working with empty datasets. 

Here’s what the error looks like.

import pandas as pd 

pd.read_csv("test.csv")

Let’s assume `test.csv` is empty. It will throw the following error:

> EmptyDataError: No columns to parse from file

If many files need Pandas' assistance, the error can cause many problems. We can solve this problem by catching exceptions as follows:

import pandas.io.common

for i in range(0,len(file_paths)): 
     try: 
       pd.read_csv(file_paths[i]) 
     except pandas.errors.EmptyDataError: 
       print(file_paths[i], "is empty")

Here, we can get all the filenames using the `os` library to access all the filenames and iterate them. We can import errors from `pandas.io.common` and use a try-except clause to rectify such scenarios.

Tips for Improving Efficiency for Pandas Code

While addressing common errors in Pandas, it's also essential to consider practical tips for optimizing efficiency. Here are some tips to improve the code for Pandas:

  • Use built-in function: Functions implemented within the Pandas dataframe are highly optimized and utilize vectorized computation to improve efficiency. Utilizing such functions over explicit loops can improve performance significantly.
data = {'numbers': [1, 2, 3, 4, 5]} 
df = pd.DataFrame(data) # Vectorized operation to calculate the 

square df['squared'] = df['numbers'] ** 2

This approach leverages NumPy arrays internally and accelerates computation by avoiding Python code in the inner loop. The multiplication and division operations are intelligently delegated to the underlying arrays, executing the arithmetic in machine code without the overhead of slow Python code. 

  • Query(): One of the use cases of Pandas is to filter the dataset with many functions to achieve it. The function `query()` can do almost all of the filtering, whether it is comparison, chained comparison, string matches, and much more.
df = pd.DataFrame({'name':['Maurice','Alice','Bob'],
                   'weight':[56,76,50]}) 

df.query('weight > 70') 
  • Better formats for storing datasets: CSV, a row-based format, is suitable for smaller datasets but inefficient for larger ones due to processing one full row at a time. In contrast, columnar formats like Parquet and Feather organize data by column, enabling more efficient access by reading only the required columns.
  • Plotting: Leveraging Matplotlib underneath, Pandas enables users to generate insightful plots directly from DataFrames.
df = pd.DataFrame({'name':['Maurice','Alice','Bob'],
                 'weight':[56,76,50]}) 

df.plot()

Here’s how we can do it:

There are many other ways to improve the efficiency of code. With continuous improvement and Pandas 2.0 features like PyArrow for faster and memory-efficient operations, nullable data types for handling missing values, copy-on-write optimization, and so on, developers can manage resources and enhance performance for data manipulation tasks.

To enhance the performance for data analysis tasks, read the article Pandas2 and Polars for Feature Engineering

Summary

In this article, we have seen many commonly occurring errors and their solutions, like missing Pandas installation, Dataframe misinterpretation, column access errors, index duplicates, scalar value handling, and correct use of `loc` and `iloc.` 

Additionally, we covered warnings related to data type inconsistencies and addressed the issue of SettingWithCopy to ensure more determined results. Toward the end, we also introduced some tips like vectorization, querying, and plotting for maintaining efficiency in Pandas.

References