In this article, I am going to write about the various ways we can work with JSON data in Python. JSON stands for Java Script Object Notation and has become one of the most important data formats to store and transfer data across various systems. This is due to its easy-to-understand structure and also because it is very lightweight. You can easily write simple and nested data structures using JSON and it can be read by programs as well. In my opinion, JSON is much more human-readable as compared to XML, although both are used to store and transfer data. In modern web applications, by default JSON is being used to transfer information.
Understanding the JSON data structure
First, let’s begin by understanding how JSON looks and how to deal with it.
Figure 1 – A sample JSON structure
In the figure above you can see a sample data structure that is represented in JSON. The sample is a representation of this article. The top-level node of the sample is data under which a list is created by using the  braces. Inside the  braces, you can have multiple JSON nodes or strings as required. To keep things simple, I have only used one item on the list. The next items inside the list are the type, id, attributes, and author in regards to the article submitted. The attributes and author are nested objects that can be further expanded to title, description, created, updated and id, name respectively.
By having a quick glance at the overall data structure it is easy to determine the relationships between the article and the author and as such very easy to understand by both humans and machines.
Concept of serialization and deserialization of the JSON
So far, we have understood how JSON looks like and how can we interpret a JSON data structure. Now, we should understand how can we use this data in python and do operations as required. While dealing with JSON, we often come across two terms known as Serialization and Deserialization of data. The basic format of writing JSON is just a string data type that contains data in key-value pairs. In order for the machine to understand this string, it needs to be converted into an object which can be then consumed by the interpreter. The process of converting a string JSON into a python object is called Deserialization and the process of converting a python object back to JSON is called Serialization.
Let’s now understand and try to do this using python.
Figure 2 – Console output from the above snippet
If you see the code above, you will notice that I have imported the JSON module into the script. This is the default module provided by Python to deal and work with JSON data. You can read more about this library from the official documentation. There are four basic methods in this library as follows:
- json.dump – This method is used to serialize a python object from the memory into a JSON formatted stream that can be written to a file
- json.dumps – This is used to serialize the python objects in the memory to a string that is in the JSON format. The difference between both of these is that in the former, a stream of data is produced while the latter creates a string data type
- json.load – You can use this method to load data from a JSON file that exists on the file system. It parses the file and then deserializes the data into a python object
- json.loads – This is similar to json.load, the only difference is it can read a string that contains data in the JSON format
From my experience, I can say that you will be using the json.loads and json.dumps quite more frequently as compared to their streaming data counterparts. An important point worth mentioning is that the JSON library works only with the built-in python data types like string, integer, list, dictionaries, etc. In case you would want to work with a custom data type, then we would first need to convert the custom datatype to a python dictionary object and then serialize it to JSON data format.
Using Pandas to read JSON data
So far, we have learned about working with the JSON library in python to work with JSON data types. Now let us also take a look around the Pandas library in python and how to read and write data using Pandas. As you might be aware, Pandas is extensively used in the field of data science to analyze existing data and discover insights from the underlying data.
If you run the code above, you will get the data loaded into a Pandas dataframe.
Figure 3 – JSON Data loaded as Pandas Dataframe
As you can see in the figure above, the read_json() method in Pandas reads the JSON from the string or a file and then converts it into a Pandas dataframe. This method also accepts several other parameters of which I will be discussing the most important ones in the following section.
- path – The first parameter accepted by this method is the path or the name of the JSON formatted string. Instead of specifying a variable name, you can directly provide the JSON string as an argument and it will still work fine
- orient – This parameter is used to define the format in which the JSON string is available. The most common values accepted for this parameter are records, index, columns, values, etc
- typ – This defines the type of data that should be returned by the method. By default, it returns a dataframe, but can also be set to return a series instead of a dataframe
So far, we have seen how to read JSON formatted data using Pandas. Now, let us also understand how to export data from Pandas dataframe back to JSON. Basically, we are going to serialize a Pandas dataframe to a JSON string.
Figure 4 – Converting Pandas DataFrame to JSON
As you can see in the figure above, when we execute the above snippet, the Pandas dataframe gets converted into a JSON string which is then printed to the console. This is done with the to_json() method available in Pandas that help us to convert existing data to JSON string. The important parameters accepted by this method are discussed as follows.
- path – This parameter is somewhat different from the one that we have seen in the previous section. This is an optional parameter in which it will write the JSON data after serializing it
- orient – This is used to define the format in which the data has to be exported. There are several values for this parameter like records, split, index, columns, values etc. By default, if the method is passed on to a dataframe, the columns are selected
You can follow the official documentation from Pandas to learn more about handling JSON data with Pandas.
In this article, we have seen what JSON is and how to work with JSON data in python using various libraries. JSON is a rich data structure and can be used in almost every modern application in the recent world. Also, it is easily understood and read by humans as well as machines and as a result, has gained a lot of popularity with the developers. JSON data can be structured, semi-structured, or completely unstructured. It is also used in the responses generated by the REST APIs and represents objects in key-value pairs just like the python dictionary object.
Table of contents
- An overview of Azure Cognitive Services - July 26, 2021
- Learn NoSQL in Azure: Getting started with DocumentDB SQL API - July 7, 2021
- Learn NoSQL in Azure: Diving Deeper into Azure Cosmos DB - June 25, 2021