Connect to the data in object storage, use REST API GET & POST requests in the pipeline
Every analytics project starts with getting the data. We at Datrics aim to provide all the necessary tools to efficiently retrieve the data, whether you have data storage or need access to external data. With this update, we bring connectors directly to AWS S3 and GCS object storages, Athena query system, and enhance Rest API brick.
Let’s see what is new in Datrics.
Object storage connectors
New data connectors to AWS S3 and Google cloud storage to retrieve csv and parquet files. To retrieve the data - create a data source, create the dataset, and add it to your pipeline.
There are two options to define the file path: static and dynamic.
In Static mode, you may define the direct path to the file or folder in the bucket.
In Dynamic mode, you may define the set of files or folders using python code. For example, when you need to retrieve the file from the folder with today’s date in the name, you may do that with the line of code. In this case, in the pipeline the each day new file will be retrieved.
You may also load all the files from the folder by setting the path to the folder instead of the direct path to the file. If all the files are of the same structure and type, dataset will include data from all the files in the folders. Please note, that files from the subfolder will not be loaded.
Connect to Athena
To work more efficiently with the Amazon S3 data lake, we add the connector to Amazon Athena.
“Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives.”
The new data connector works similarly to the rest. First, create a new data source for Athena and then add a new dataset for this data source.
Rest API request: new design, dynamic configuration, and POST request
The most common way to access external data via API. We wanted to allow the analyst to use the data from API requests in the analytics pipelines without the need to write the code.
In the previous product update, we have introduced the REST API request brick that allows you to perform get API call to the open APIs. With February updates, we added possibility to use headers in the request, dynamic configuration and perform POST requests. Let’s go through the updates in more details.
Dynamic configuration of the REST API brick allows to use the input dataset to perform API calls. The feature might become handy, when the request parameters are stored in the database, or you have a long list of the urls you need to call. It will be easier to pass the list of urls to the REST API Request brick in a data frame and then just map the columns in the data frame to the call configuration.
Let’s go through an example. I have the list of the tickers you need to get the price for from Binance. I need to update the list of the tickers regularly, and reconfiguring the API request brick with very similar calls seems cumbersome.
With the dynamic configuration, I may create short pipeline to get the data I need.
3/ Path the output dataset to the REST API Request brick
I am all set!
Updated Rest API request brick also supports POST requests now. To accommodate POST requests, as well as GET requests, we have added the possibility to define Auth, headers, and body. The rest of the functionality works the same as for GET.
POST requests also support dynamic mode. In order to define the body for POST requests with the data from the dataframe, API Request brick should be configured in dynamic mode.
The updated date range brick in Datrics supports two options for the DateTime list generation: Start / End, Range.
In the Start / End mode, one defines the Start date, End date, and step. As the result, a new dataset with a datetime list is created. For example, I may create the data set starting 1 January 2023 and ending 31 December 2023 with the step of 7 days. This way I have created the dataframe with all the Mondays of 2023.
The date may be defined in a set of ways: defined date, date from the input dataset, or date removed from today.
In the Range mode, one defines the Start date, step, and number of periods. In this mode, a step can be above and below zero. This way, you may create the dataset of dates into the future or past.
Create New Column
The new version of create new column brick works in 2 modes: Add column, New dataset.
Add column adds the column to the input data set. Therefore, the number of values added is the same as the dataset size. To create a new dataframe of the defined size, select the New dataset option.
The new column may be filled in with the flat value, empty value, range of values, random, or list of values.
Let’s go through a couple of examples.
1/ I want to create a data column with the integer values starting 0 to 1000 with the step of 5.
2/ I want to get a data column with 10 float random values from 0 to 10.