MongoDB - basic features

Goal:



master the basic features of mongodb



  • fill MongoDB with data;
  • write multiple queries to fetch and update data
  • create indexes and compare performance.






Filling the Data Warehouse



In search of enough data to study the basic capabilities of MongoDB, I settled on the NASA Earth Meteorite Landings dataset (1000 lines with information about meteorites that fell to Earth, from the repository https://github.com/jdorfman/awesome- json-datasets .



Note : found more complete (45.7K) information https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh , but exporting to JSON via their API only gives 1000 records (incomprehensible) , it is necessary to shovel the complete data of exported CSV file https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD ?



Note: oops, it is possible to get full data in JSON, but this is a hack. I sincerely hope that this is not SQL-injection



https:? //Data.nasa.gov/api/id/gh4g-9sfh.json $ select = ` name`,` id`, `nametype`,` recclass`, ` mass`, `fall`,` year`, `reclat`,` reclong`, `geolocation` & $ order =`: id` + ASC & $ limit = 46000 & $ offset = 0



wc ./gh4g-9sfh.json
     45716   128491 10441343 ./gh4g-9sfh.json




To interact with the server on the network, I installed only the client and the tools on the local machine:



wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | sudo apt-key add -
echo "deb http://repo.mongodb.org/apt/debian buster/mongodb-org/4.4 main" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.4.list

sudo apt-get update
sudo apt-get install -y mongodb-org-shell
sudo apt-get install -y mongodb-org-tools




Connection check:



mongo nosql-2020.otus --port 32789

    MongoDB shell version v4.4.1
    connecting to: mongodb://nosql-2020.otus:32789/test?compressors=disabled&gssapiServiceName=mongodb
    Implicit session: session { "id" : UUID("5ff24788-0710-4a1a-821f-7acb2eddfb4f") }
    MongoDB server version: 4.4.1
    Welcome to the MongoDB shell.
    For interactive help, type "help".
    For more comprehensive documentation, see
            https://docs.mongodb.com/
    Questions? Try the MongoDB Developer Community Forums
            https://community.mongodb.com




Importing data from a local machine to a remote one:



mongoimport --host nosql-2020.otus  --db "otus_003" --port 32789 --collection eml45k --jsonArray --file ./003_MONGODB.files/gh4g-9sfh.json
    
    2020-10-12T01:01:13.826+0100    connected to: mongodb://nosql-2020.otus:32789/
    2020-10-12T01:01:16.827+0100    [#######.................] otus_003.eml45k      2.99MB/9.96MB (30.0%)
    2020-10-12T01:01:19.827+0100    [###############.........] otus_003.eml45k      6.44MB/9.96MB (64.6%)
    2020-10-12T01:01:22.827+0100    [#######################.] otus_003.eml45k      9.81MB/9.96MB (98.5%)
    2020-10-12T01:01:23.035+0100    [########################] otus_003.eml45k      9.96MB/9.96MB (100.0%)
    2020-10-12T01:01:23.035+0100    45716 document(s) imported successfully. 0 document(s) failed to import.




Note : 10 seconds for everything, funny data fetch







> show databases
admin     0.000GB
config    0.000GB
local     0.000GB
otus_003  0.000GB
test      0.000GB
> use otus_003
switched to db otus_003
> show collections
em
l




We are looking for a meteorite with a known name:



> db.eml45k.find({name:"Bjelaja Zerkov"})

    { "_id" : ObjectId("5f8380a91c0ab84b54bfe394"), "name" : "Bjelaja Zerkov", "id" : "5063", "nametype" : "Valid", "recclass" : "H6", "mass" : "1850", "fall" : "Fell", "year" : "1796-01-01T00:00:00.000", "reclat" : "49.783330", "reclong" : "30.166670", "geolocation" : { "latitude" : "49.78333", "longitude" : "30.16667" } }




We are looking for a meteorite using known coordinates:



> db.eml45k.find({ "geolocation" : { "latitude" : "44.83333" , "longitude" : "95.16667" } })

{ "_id" : ObjectId("5f8380a91c0ab84b54bfe322"), "name" : "Adzhi-Bogdo (stone)", "id" : "390", "nametype" : "Valid", "recclass" : "LL3-6", "mass" : "910", "fall" : "Fell", "year" : "1949-01-01T00:00:00.000", "reclat" : "44.833330", "reclong" : "95.166670", "geolocation" : { "latitude" : "44.83333", "longitude" : "95.16667" } }




Fetching a list of fallen meteorites sorted by year of fall (I wonder why NASA does not have a specific fall time according to Greenwich Mean Time) and limiting the list of selectable fields:

> db.eml45k.find( { }, {year: 1, id: 1, name: 1, _id: 0 }).sort( { year: -1 } )

{ "name" : "Northwest Africa 7701", "id" : "57150", "year" : "2101-01-01T00:00:00.000" }
{ "name" : "Chelyabinsk", "id" : "57165", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7755", "id" : "57166", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7812", "id" : "57258", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7822", "id" : "57268", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7856", "id" : "57421", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7855", "id" : "57420", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7857", "id" : "57422", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7858", "id" : "57423", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7861", "id" : "57425", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7862", "id" : "57426", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Northwest Africa 7863", "id" : "57427", "year" : "2013-01-01T00:00:00.000" }
{ "name" : "Battle Mountain", "id" : "56133", "year" : "2012-01-01T00:00:00.000" }
{ "name" : "Sutter's Mill", "id" : "55529", "year" : "2012-01-01T00:00:00.000" }
{ "name" : "Antelope", "id" : "57455", "year" : "2012-01-01T00:00:00.000" }
{ "name" : "Catalina 009", "id" : "57173", "year" : "2012-01-01T00:00:00.000" }
{ "name" : "Jiddat al Harasis 799", "id" : "57428", "year" : "2012-01-01T00:00:00.000" }
{ "name" : "Johannesburg", "id" : "55765", "year" : "2012-01-01T00:00:00.000" }
{ "name" : "Ksar Ghilane 011", "id" : "55606", "year" : "2012-01-01T00:00:00.000" }
{ "name" : "Ksar Ghilane 010", "id" : "55605", "year" : "2012-01-01T00:00:00.000" }
Type "it" for more
>




Something I haven't found? how to combine two fields directly on selection:



"geolocation" : { "latitude" : "49.78333", "longitude" : "30.16667" } 




to a point (it should be noted that the order was changed in accordance with the documentation ( https://docs.mongodb.com/manual/geospatial-queries/#geospatial-legacy , that is



< field >: [< longitude >, < latitude >]):




"geolocation" : { "type" : "Point", "coordinates" : [  30.16667 , 49.78333 ] } }




directly when asked to do (who knows?) something like this:



> db.eml45k.find({ 
    [ 
        {$toDouble: "$geolocation.longitude"} ,
        {$toDouble: "$geolocation.latitude"} 
    ] : {
        $geoWithin: {
            $geometry: {
                type : "Polygon" ,
                coordinates: [[
                    ... ,
                ]]
            }
        }
    }
}) 




Therefore, I created an artificial field in the collection:



db.eml45k.updateMany( 
    {},
    [{
        $set: {
            "pointed_geolocation.type" : "Point",
            "pointed_geolocation.coordinates" : [ 
                { $toDouble : "$geolocation.longitude" } , 
                { $toDouble: "$geolocation.latitude" } 
            ]
        }
    }]
);

{ "acknowledged" : true, "matchedCount" : 45716, "modifiedCount" : 45716 }




and we can finally go in search of undetected meteorites that fell in a certain area:



> db.eml45k.find({ 
    "pointed_geolocation.coordinates" : {
        $geoWithin: {
            $geometry: {
                type : "Polygon" ,
                coordinates: [[
                    [ 47.0 , 33.0  ], 
                    [ 47.0 , 65.0 ], 
                    [ 169.0 , 65.0 ],
                    [ 169.0 ,  33.0 ],
                    [ 47.0 , 33.0 ]
                ]]
            }
        }
    },
    'fall': 'Fell'
},
{ 
    year: {$year: { "$toDate": "$year"}}, 
    "pointed_geolocation.coordinates": 1, 
    name: 1, 
    _id: 0 
}).sort( { year: -1 } )




Sample
{ «name»: «Chelyabinsk», «pointed_geolocation»: { «coordinates»: [ 61.11667, 54.81667 ] }, «year»: 2013 }

{ «name»: «Dashoguz», «pointed_geolocation»: { «coordinates»: [ 59.685, 41.98444 ] }, «year»: 1998 }

{ «name»: «Kunya-Urgench», «pointed_geolocation»: { «coordinates»: [ 59.2, 42.25 ] }, «year»: 1998 }

{ «name»: «Sterlitamak», «pointed_geolocation»: { «coordinates»: [ 55.98333, 53.66667 ] }, «year»: 1990 }

{ «name»: «Undulung», «pointed_geolocation»: { «coordinates»: [ 124.76667, 66.13889 ] }, «year»: 1986 }

{ «name»: «Omolon», «pointed_geolocation»: { «coordinates»: [ 161.80833, 64.02 ] }, «year»: 1981 }

{ «name»: «Yardymly», «pointed_geolocation»: { «coordinates»: [ 48.25, 38.93333 ] }, «year»: 1959 }

{ «name»: «Vengerovo», «pointed_geolocation»: { «coordinates»: [ 77.26667, 56.13333 ] }, «year»: 1950 }

{ «name»: «Kunashak», «pointed_geolocation»: { «coordinates»: [ 61.36667, 55.78333 ] }, «year»: 1949 }

{ «name»: «Krasnyi Klyuch», «pointed_geolocation»: { «coordinates»: [ 56.08333, 54.33333 ] }, «year»: 1946 }

{ «name»: «Lavrentievka», «pointed_geolocation»: { «coordinates»: [ 51.56667, 52.45 ] }, «year»: 1938 }

{ «name»: «Pavlodar (stone)», «pointed_geolocation»: { «coordinates»: [ 77.03333, 52.3 ] }, «year»: 1938 }

{ «name»: «Kainsaz», «pointed_geolocation»: { «coordinates»: [ 53.25, 55.43333 ] }, «year»: 1937 }

{ «name»: «Ichkala», «pointed_geolocation»: { «coordinates»: [ 82.93333, 58.2 ] }, «year»: 1936 }

{ «name»: «Nikolaevka», «pointed_geolocation»: { «coordinates»: [ 78.63333, 52.45 ] }, «year»: 1935 }

{ «name»: «Brient», «pointed_geolocation»: { «coordinates»: [ 59.31667, 52.13333 ] }, «year»: 1933 }

{ «name»: «Pesyanoe», «pointed_geolocation»: { «coordinates»: [ 66.08333, 55.5 ] }, «year»: 1933 }

{ «name»: «Kuznetzovo», «pointed_geolocation»: { «coordinates»: [ 75.33333, 55.2 ] }, «year»: 1932 }

{ «name»: «Boriskino», «pointed_geolocation»: { «coordinates»: [ 52.48333, 54.23333 ] }, «year»: 1930 }

{ «name»: «Khmelevka», «pointed_geolocation»: { «coordinates»: [ 75.33333, 56.75 ] }, «year»: 1929 }

Type «it» for more

> it

{ «name»: «Mamra Springs», «pointed_geolocation»: { «coordinates»: [ 62.08333, 45.21667 ] }, «year»: 1927 }

{ «name»: «Demina», «pointed_geolocation»: { «coordinates»: [ 84.76667, 51.46667 ] }, «year»: 1911 }

{ «name»: «Krutikha», «pointed_geolocation»: { «coordinates»: [ 77, 56.8 ] }, «year»: 1906 }

{ «name»: «Barnaul», «pointed_geolocation»: { «coordinates»: [ 84.08333, 52.73333 ] }, «year»: 1904 }

{ «name»: «Tyumen», «pointed_geolocation»: { «coordinates»: [ 65.53333, 57.16667 ] }, «year»: 1903 }

{ «name»: «Ochansk», «pointed_geolocation»: { «coordinates»: [ 55.26667, 57.78333 ] }, «year»: 1887 }







Strange, I didn’t know that Chelyabinskiy was not listed in the found category.



Let's aggregate and find how many were found and how many were not:



db.eml45k.aggregate([
{ $match: { 
    "pointed_geolocation.coordinates" : {
        $geoWithin: {
            $geometry: {
                type : "Polygon" ,
                coordinates: [[
                    [ 47.0 , 33.0  ], 
                    [ 47.0 , 65.0 ], 
                    [ 169.0 , 65.0 ],
                    [ 169.0 ,  33.0 ],
                    [ 47.0 , 33.0 ]
                ]]
            }
        }
    }
} },
    {"$group" : {_id: "$fall", count: { $sum: 1 }}}
])

{ "_id" : "Fell", "count" : 26 }
{ "_id" : "Found", "count" : 63 }




In total, 63 of 89 were found, and __26__ - __not__ were found, so there is a chance :)



Using indexes



Let's remove all indexes in the collection from previous experiments:



db.eml45k.dropIndexes()

{
        "nIndexesWas" : 1,
        "msg" : "non-_id indexes dropped for collection",
        "ok" : 1
}




Let's try to see the estimated execution time of the request:



db.eml45k.find({ 
    "pointed_geolocation.coordinates" : {
        $geoWithin: {
            $geometry: {
                type : "Polygon" ,
                coordinates: [[
                    [ 47.0 , 33.0  ], 
                    [ 47.0 , 65.0 ], 
                    [ 169.0 , 65.0 ],
                    [ 169.0 ,  33.0 ],
                    [ 47.0 , 33.0 ]
                ]]
            }
        }
    }
}).explain("executionStats").executionStats.executionTimeMillis
...
110
...
110
...
109




The result is approximately 110 seconds on average.



Let's index:


db.eml45k.createIndex( { "pointed_geolocation" : "2dsphere" } )

    {
        "ok" : 0,
        "errmsg" : "Index build failed: 98b9ead2-c156-4312-81af-1adf5896e3c9: Collection otus_003.eml45k ( 6db2d178-61b5-4627-8512-fcf919fe596f ) :: caused by :: Can't extract geo keys: { _id: ObjectId('5f838e30fb89bd9d553ae27f'), name: \"Bulls Run\", id: \"5163\", nametype: \"Valid\", recclass: \"Iron?\", mass: \"2250\", fall: \"Fell\", year: \"1964-01-01T00:00:00.000\", pointed_geolocation: { type: \"Point\", coordinates: [ null, null ] } }  Point must only contain numeric elements",
        "code" : 16755,
        "codeName" : "Location16755"
    }





The error is due to NULL values, I didn't find something straight away how (who knows?) To exclude it from the index during indexing, so I'll delete these keys:



db.eml45k.updateMany(
    { "pointed_geolocation.coordinates" : [ null , null ] },     
    [{         
        $set: { "pointed_geolocation": null }
    }] 
);




Trying the index again



db.eml45k.createIndex( { "pointed_geolocation" : "2dsphere" } )
    
    {
        "ok" : 0,
        "errmsg" : "Index build failed: d33b31d4-4778-4537-a087-58b7bd1968f3: Collection otus_003.eml45k ( 6db2d178-61b5-4627-8512-fcf919fe596f ) :: caused by :: Can't extract geo keys: { _id: ObjectId('5f838e35fb89bd9d553b3b8f'), name: \"Meridiani Planum\", id: \"32789\", nametype: \"Valid\", recclass: \"Iron, IAB complex\", fall: \"Found\", year: \"2005-01-01T00:00:00.000\", reclat: \"-1.946170\", reclong: \"354.473330\", geolocation: { latitude: \"-1.94617\", longitude: \"354.47333\" }, pointed_geolocation: { type: \"Point\", coordinates: [ 354.47333, -1.94617 ] } }  longitude/latitude is out of bounds, lng: 354.473 lat: -1.94617",
        "code" : 16755,
        "codeName" : "Location16755"
    }




Error __longitude / latitude is out of bounds, lng: 354.473 lat: -1.94617__ and in the documentation https://docs.mongodb.com/manual/geospatial-queries/#geospatial-legacy




    Valid longitude values are between -180 and 180, both inclusive.
    Valid latitude values are between -90 and 90, both inclusive.




and 354.47333 is not included in the range from -180 to 180.



Very strange, at first I thought there needs to be an amendment everywhere by minus 180 to make



(`$subtract: [{ $toDouble : "$geolocation.longitude" }, 180.0]`)




, but in the end everything is not so simple.



What longitudes are not in range:



db.eml45k.find({"pointed_geolocation.coordinates.0": {$lt: -180}} ) #  ,    
db.eml45k.find({"pointed_geolocation.coordinates.0": {$lt: 0}} ) #   ,    
db.eml45k.find({"pointed_geolocation.coordinates.0": {$gt: 180}} ) #  ,     


    { "_id" : ObjectId("5f838e35fb89bd9d553b3b8f"), "name" : "Meridiani Planum", "id" : "32789", "nametype" : "Valid", "recclass" : "Iron, IAB complex", "fall" : "Found", "year" : "2005-01-01T00:00:00.000", "reclat" : "-1.946170", "reclong" : "354.473330", "geolocation" : { "latitude" : "-1.94617", "longitude" : "354.47333" }, "pointed_geolocation" : { "type" : "Point", "coordinates" : [ 354.47333, -1.94617 ] } }





As a result, only one meteorite has strange coordinates. After searching, I found out that this Meridiani Planum meteorite was accidentally found by the Opportunity rover in 2005

( http://old.mirf.ru/Articles/art2427_2.htm ). This is a ( WARNING ) Martian meteorite found ( WARNING ) on Mars. Here are NASA jokers.



Let's remove it from the collection.



db.eml45k.remove({"id" : "32789"})
WriteResult({ "nRemoved" : 1 })




Indexing

> db.eml45k.createIndex( { "pointed_geolocation" : "2dsphere" } )
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}




We measure



db.eml45k.find({ 
    "pointed_geolocation.coordinates" : {
        $geoWithin: {
            $geometry: {
                type : "Polygon" ,
                coordinates: [[
                    [ 47.0 , 33.0  ], 
                    [ 47.0 , 65.0 ], 
                    [ 169.0 , 65.0 ],
                    [ 169.0 ,  33.0 ],
                    [ 47.0 , 33.0 ]
                ]]
            }
        }
    }
}).explain("executionStats").executionStats.executionTimeMillis




As a result, when rechecking 104 ... 107 ... 106 ...



It's kind of strange, not very bright.



Deleted index, checked.



Without an index and with an index - it's the same.



I try it separately for Chelyabinsk:



db.eml45k.find(
    {"pointed_geolocation.coordinates" : [ 61.11667, 54.81667 ]}
).explain("executionStats").executionStats.executionTimeMillis




without an index and with an index - the same.



You need to be more careful, the index is built for the pointed_geolocation field , and pointed_geolocation.coordinates is involved in the query .



As a result, the query



db.eml45k.find({ 
    "pointed_geolocation" : {
        $geoWithin: {
            $geometry: {
                type : "Polygon" ,
                coordinates: [[
                    [ 47.0 , 33.0  ], 
                    [ 47.0 , 65.0 ], 
                    [ 169.0 , 65.0 ],
                    [ 169.0 ,  33.0 ],
                    [ 47.0 , 33.0 ]
                ]]
            }
        }
    }
}).explain("executionStats").executionStats.executionTimeMillis




without the index 125, 123, 119, 123 milliseconds, and with the index - 7, 4, 4, 5.



Everything worked out.



All Articles