5 minutes to export data from MongoDB with Apache Hop

Written by Adalennis Buchillón Soris | May 17, 2022 9:00:00 AM

MongoDB

MongoDB is a document-oriented database that stores data in JSON-like documents with a dynamic schema. It means that you can store your records without worrying about the data structure such as the number of fields or types of fields to store values.

Apache Hop is a data engineering and data orchestration platform that is currently incubating at the Apache Software Foundation. Hop allows data engineers and data developers to visually design workflows and data pipelines to build powerful solutions. No other data engineering platform currently has the integration with Neo4j that Apache Hop offers.

With the following example, you will learn how to extract data from a MongoDB database using Apache Hop.

As always, the examples here use a Hop project with environment variables to separate code and configuration in your Hop projects.

Step 1: Create a MongoDB connection

The MongoDB connection, specified on a project level, can be reused across multiple (instances of) a transform or other plugin types.

To create a MongoDB Connection click on the New -> MongoDB Connection option or click on the Metadata -> MongoDB Connection option. The system displays the New MongoDB Connection view with the following fields to be configured.

The connection can be configured as in the following example:

MongoDB Connection name: the name of the metadata object (mongodb-connection).
Hostname: the name of the host (${MONGODB_SERVER} = localhost).
Port: the port number (${MONGODB_PORT} = 27017).
Database name: the name of the database (${MONGODB_DATABASE} = how-to).

Test the connection by clicking on the Test button.

Step 2: Add and configure a MongoDB input transform

The MongoDB input transform retrieves documents or records from a collection in MongoDB. After creating your pipeline (read-from-mongodb), add a MongoDB input transform. Click anywhere in the pipeline canvas, then Search 'mongodb' -> MongoDB input.

Now it’s time to configure the MongoDB input transform. Open the transform and set your values as in the following example:

Tab: Input options

Transform name: choose a name for your transform, just remember that the name of the transform should be unique in your pipeline (read addresses from mongodb).
MongoDB Connection: select the created connection (source-connection).
Collection: click on the Get collection to see the available collections or insert the collection name (addresses-source).

Tab: Query

Query expression (JSON): specify the query to be used. In this case, we keep this as basic as possible and read all data in the collection ({}). Check the MongoDB query docs for more information on how to write real-life queries on your data.

Tab: Fields

Output single JSON field: this option is selected by default and lets us read the data in a JSON format as in the following image.

In this example, the data is extracted using the columns format: uncheck the Output single JSON field and click on the Get fields to get the collection fields.

To preview the read data click on the Preview button.

Click on the Close and OK options to save the configuration.

Step 3: Add and config a Text File output transform

The Text file output transform is used to export data to text file format. This is commonly used to generate Comma Separated Values (CSV files) that can be read by spreadsheet applications.

Add a Text File output transform by clicking anywhere in the pipeline canvas, then Search 'text' -> Text File output.

Now it’s time to configure the Text File output transform. Open the transform and set your values as in the following example:

Tab: File

Transform name: choose a name for your transform, just remember that the name of the transform should be unique in your pipeline (write addresses to csv).
Filename: specify the filename and location of the output text file. You can use the PROJECT_HOME variable and add the folder and file name (${PROJECT_HOME}/files/addresses).
Extension: specify the extension of the filename (csv).

Tab: Fields

Click on the Get Fields button to get the fields from the preview transform and the OK button to save.

Step 4: Run your pipeline

Finally, run your pipeline by clicking on the Run -> Launch option.

The 'local' run configuration should have been created with your Hop project. If it isn't check the Hop documentation to create a pipeline run configuration.

Open the CSV file to see the read data.

You can find the samples in 5-minutes-to github repository.

Want to find out more? Download our free Hop fact sheet now!

View full post